Testing data pipelines
Unit-test transformation functions, validate DataFrame schemas with pandera, and assert that no training-set statistics leaked into test data.
- Write unit tests for a data transformation function using plain assert
- Validate DataFrame column types and value ranges with pandera
- Write an assertion that confirms no training-set statistics were computed on test data
Data pipelines fail silently. A transformation bug does not raise an exception — it produces a subtly wrong DataFrame that propagates through training and degrades model quality in ways that only show up at evaluation time, or worse, in production. Testing is the practice of detecting those failures before they propagate.
Unit-testing transformation functions
The first rule: write transformations as pure functions that take a DataFrame and return a DataFrame. Pure functions are testable — you can construct a small input, call the function, and assert the output matches expectations.
These tests are framework-free (no pytest import needed in the runner). In a
real project, these would be in a tests/ directory and run with pytest.
The key habit is the same: construct a small, controlled input; call the
function; assert the output is exactly what you expect.
Schema validation with pandera
Transformation tests check logic; schema tests check structure. pandera lets
you declare the expected schema of a DataFrame — column names, types, value
ranges — and raise an error when data violates it.
Asserting no data leakage
Data leakage in a preprocessing step means the test set's statistics influenced
the training data's transformation. The canonical example: fitting a
StandardScaler on the combined train+test set instead of train only. The
assertion is simple: verify that the scaler's mean was computed before the
test set existed:
In practice, leakage is harder to detect than this minimal example suggests.
Subtle leakage occurs when a feature is engineered using a statistic computed
over the full dataset before splitting — for example, computing a target-mean
encoding before calling train_test_split. The rule: all statistics must
be computed after splitting, on training data only.
Where to go next
The lab brings everything together: a complete, tested, reproducible ML pipeline in a single notebook — ingest, clean, engineer, split, scale, train, evaluate, and persist.
Reproducible pipelines
Pin seeds, hash intermediates, version your data — the checklist for pipelines that produce the same output from the same input, every time.
Lab: build a complete ML pipeline
Ingest, clean, engineer features, split, scale, train, evaluate, and persist — each step as a tested, reproducible function.