Testing data pipelines

Unit-test transformation functions, validate DataFrame schemas with pandera, and assert that no training-set statistics leaked into test data.

Data pipelines fail silently. A transformation bug does not raise an exception — it produces a subtly wrong DataFrame that propagates through training and degrades model quality in ways that only show up at evaluation time, or worse, in production. Testing is the practice of detecting those failures before they propagate.

Unit-testing transformation functions

The first rule: write transformations as pure functions that take a DataFrame and return a DataFrame. Pure functions are testable — you can construct a small input, call the function, and assert the output matches expectations.

Python — editable, runs in your browser

import pandas as pd
import numpy as np

def clip_outliers(df: pd.DataFrame, col: str, lo: float, hi: float) -> pd.DataFrame:
  """Clip values in col to [lo, hi]. Returns a new DataFrame."""
  result = df.copy()
  result[col] = result[col].clip(lower=lo, upper=hi)
  return result

# --- Tests ---
def test_clip_outliers_basic():
  df = pd.DataFrame({"x": [-10, 0, 5, 200]})
  out = clip_outliers(df, "x", 0, 100)
  assert list(out["x"]) == [0, 0, 5, 100], f"Expected [0,0,5,100], got {list(out['x'])}"
  print("test_clip_outliers_basic: PASSED")

def test_clip_outliers_no_mutation():
  df = pd.DataFrame({"x": [-10, 200]})
  _ = clip_outliers(df, "x", 0, 100)
  assert df["x"].iloc[0] == -10, "Input DataFrame was mutated — bug in clip_outliers"
  print("test_clip_outliers_no_mutation: PASSED")

def test_clip_outliers_preserves_shape():
  df = pd.DataFrame({"x": range(50), "y": range(50)})
  out = clip_outliers(df, "x", 10, 40)
  assert out.shape == df.shape, f"Shape changed: {df.shape} -> {out.shape}"
  print("test_clip_outliers_preserves_shape: PASSED")

test_clip_outliers_basic()
test_clip_outliers_no_mutation()
test_clip_outliers_preserves_shape()

These tests are framework-free (no pytest import needed in the runner). In a real project, these would be in a tests/ directory and run with pytest. The key habit is the same: construct a small, controlled input; call the function; assert the output is exactly what you expect.

Schema validation with pandera

Transformation tests check logic; schema tests check structure. pandera lets you declare the expected schema of a DataFrame — column names, types, value ranges — and raise an error when data violates it.

Python — editable, runs in your browser

Asserting no data leakage

Data leakage in a preprocessing step means the test set's statistics influenced the training data's transformation. The canonical example: fitting a StandardScaler on the combined train+test set instead of train only. The assertion is simple: verify that the scaler's mean was computed before the test set existed:

Python — editable, runs in your browser

In practice, leakage is harder to detect than this minimal example suggests. Subtle leakage occurs when a feature is engineered using a statistic computed over the full dataset before splitting — for example, computing a target-mean encoding before calling train_test_split. The rule: all statistics must be computed after splitting, on training data only.

Where to go next

The lab brings everything together: a complete, tested, reproducible ML pipeline in a single notebook — ingest, clean, engineer, split, scale, train, evaluate, and persist.

Finished reading? Mark it complete to track your progress.

Unit-testing transformation functions

Schema validation with pandera

Asserting no data leakage

Where to go next

On this page