Supervised vs unsupervised learning
The single question that determines which class of algorithm to use — does your data have labels?
- Define supervised learning as learning a mapping from inputs to labelled outputs
- Define unsupervised learning as finding structure in data without labels
- Describe semi-supervised learning and when it arises in practice
- Identify which paradigm a given problem requires
Every machine learning problem starts with the same diagnostic question: does your training data include the answers? That single question separates two fundamentally different algorithmic families, each with its own assumptions, methods, and failure modes.
Supervised learning
In supervised learning, each training example comes with a label — the correct answer the model is trying to learn to produce. The algorithm learns a mapping from inputs to outputs by comparing its predictions against those labels and adjusting until the gap is small.
The two main forms are:
-
Regression — the label is a continuous number. Predicting tomorrow's temperature, estimating a house price, forecasting next quarter's revenue. The model outputs a real-valued number, and the loss function measures how far off it was.
-
Classification — the label is a category. Spam or not-spam, digit 0–9, tumour benign or malignant. The model outputs a class (or a probability distribution over classes).
Three concrete examples with the "what's the target?" test:
| Problem | Input features | Target | Type |
|---|---|---|---|
| Email spam detection | Word frequencies, sender | Spam / not-spam | Classification |
| Predicting loan default | Income, credit history | Default / no default | Classification |
| Estimating shipping time | Distance, weight, carrier | Days to deliver | Regression |
The discriminator is simple: if someone could in principle label each row by hand — even if it would be expensive — you have a supervised problem.
Unsupervised learning
Unsupervised learning removes the labels entirely. The algorithm receives only inputs and must find structure — patterns, groupings, or compressed representations — on its own.
Two main forms:
-
Clustering — partition examples into groups where members are more similar to each other than to members of other groups. k-means and DBSCAN are canonical examples. Nothing tells the algorithm how many clusters exist or what they mean; you interpret them after the fact.
-
Dimensionality reduction — compress high-dimensional data into fewer dimensions while preserving as much structure as possible. PCA (principal component analysis) and t-SNE are common. Useful for visualisation and as a preprocessing step before supervised learning.
Three examples:
| Problem | Input | No target because… |
|---|---|---|
| Customer segmentation | Purchase history | No pre-defined groups exist |
| Anomaly detection in server logs | Log feature vectors | "Normal" is not labelled |
| Document topic modelling | Word counts | Topics are latent, not labelled |
Semi-supervised learning
In practice, labels are expensive. A medical dataset might have a million scans but only ten thousand reviewed by a radiologist. Semi-supervised learning uses a small labelled set combined with a large unlabelled set — the unlabelled data still carries information about the input distribution even without labels. Self-training (repeatedly labelling confident predictions and retraining) is the simplest approach.
The "what's the target?" test is a reliable heuristic, but it has an edge case: reinforcement learning, where the "label" is a delayed reward signal, not a pre-labelled example. That's a third paradigm. At this level, supervised and unsupervised cover the vast majority of data science problems.
Where to go next
Now that you can classify a problem by paradigm, the next lesson examines a universal challenge in supervised learning: the bias-variance tradeoff — the tension between models that are too simple and models that are too complex.