Sklearn pipelines
Chain a scaler and a model into one object — and why Pipeline.fit() is the only safe way to avoid training-set leakage into preprocessing.
- Construct a Pipeline of StandardScaler and LogisticRegression
- Explain why fitting the pipeline prevents leakage that manual scaling would allow
- Pass the pipeline to cross_val_score and interpret the result
Preprocessing and modelling are logically inseparable — the same transformations
that ran during training must run identically at inference. Doing them manually
(scale the training set, then scale the test set separately) is correct when
done carefully, but it is fragile: one misplaced fit_transform on the test
set leaks information and inflates every downstream metric.
Pipeline eliminates that class of error by making the correct behaviour the
only behaviour.
Building a pipeline
When pipe.fit(X_train, y_train) runs, the pipeline calls
scaler.fit_transform(X_train), then passes the scaled data to
clf.fit(scaled_X_train, y_train). The scaler stores the training-set mean
and standard deviation. When pipe.predict(X_test) runs, it calls
scaler.transform(X_test) using those same training-set statistics — not
new ones computed from the test set. That is the leakage-free guarantee.
Cross-validation with a pipeline
Passing a pipeline to cross_val_score extends the same guarantee across
every fold: preprocessing is refitted inside each fold, never on held-out data.
The mean is your honest estimate of generalisation accuracy; the standard
deviation tells you how much that estimate varies across folds. A high standard
deviation relative to the mean suggests the dataset is small or the splits have
unusual class distributions — use StratifiedKFold in those cases.
Pipeline steps are named (the first element of each tuple). Names matter for
hyperparameter tuning: GridSearchCV uses the stepname__param convention to
target parameters inside a pipeline. For example,
{"clf__max_iter": [100, 500]} tunes LogisticRegression.max_iter through
the pipeline interface without touching the pipeline structure.
Where to go next
Next: saving models — serialising a fitted pipeline with joblib so it can
be loaded and used for inference without refitting.
Decision tree practice
Fit DecisionTreeClassifier at three depths, compare train vs test accuracy, and make overfitting visible in the output.
Saving and loading models
Serialise a fitted sklearn pipeline with joblib, load it back, and verify the predictions are identical — the minimal deployment handoff.