Code of the Day
AdvancedSklearn in Practice

Sklearn pipelines

Chain a scaler and a model into one object — and why Pipeline.fit() is the only safe way to avoid training-set leakage into preprocessing.

Data ScienceAdvanced10 min read
By the end of this lesson you will be able to:
  • Construct a Pipeline of StandardScaler and LogisticRegression
  • Explain why fitting the pipeline prevents leakage that manual scaling would allow
  • Pass the pipeline to cross_val_score and interpret the result

Preprocessing and modelling are logically inseparable — the same transformations that ran during training must run identically at inference. Doing them manually (scale the training set, then scale the test set separately) is correct when done carefully, but it is fragile: one misplaced fit_transform on the test set leaks information and inflates every downstream metric.

Pipeline eliminates that class of error by making the correct behaviour the only behaviour.

Building a pipeline

Python — editable, runs in your browser

When pipe.fit(X_train, y_train) runs, the pipeline calls scaler.fit_transform(X_train), then passes the scaled data to clf.fit(scaled_X_train, y_train). The scaler stores the training-set mean and standard deviation. When pipe.predict(X_test) runs, it calls scaler.transform(X_test) using those same training-set statistics — not new ones computed from the test set. That is the leakage-free guarantee.

Cross-validation with a pipeline

Passing a pipeline to cross_val_score extends the same guarantee across every fold: preprocessing is refitted inside each fold, never on held-out data.

Python — editable, runs in your browser

The mean is your honest estimate of generalisation accuracy; the standard deviation tells you how much that estimate varies across folds. A high standard deviation relative to the mean suggests the dataset is small or the splits have unusual class distributions — use StratifiedKFold in those cases.

Pipeline steps are named (the first element of each tuple). Names matter for hyperparameter tuning: GridSearchCV uses the stepname__param convention to target parameters inside a pipeline. For example, {"clf__max_iter": [100, 500]} tunes LogisticRegression.max_iter through the pipeline interface without touching the pipeline structure.

Where to go next

Next: saving models — serialising a fitted pipeline with joblib so it can be loaded and used for inference without refitting.

Finished reading? Mark it complete to track your progress.

On this page