Reproducible pipelines
Pin seeds, hash intermediates, version your data — the checklist for pipelines that produce the same output from the same input, every time.
- Apply idempotency to data pipeline steps
- Pin random seeds in Python and numpy to ensure deterministic outputs
- Use file checksums to detect when intermediate data has changed
A reproducible pipeline produces the same outputs from the same inputs, regardless of when, where, or by whom it is run. That is a stricter requirement than it sounds. Most notebooks and scripts that "work" are not reproducible — they depend on implicit state, random seeds that change each run, or intermediate files whose provenance is unknown.
Reproducibility matters for three practical reasons: debugging (if two runs produce different results, which is correct?), collaboration (a colleague cannot verify your work if they cannot replicate it), and trust (a model whose training cannot be reproduced cannot be audited).
Idempotency
An idempotent operation produces the same result whether it runs once or a thousand times on the same input. Write each pipeline step as a pure function: given the same input data, it always produces the same output, with no side effects on shared state.
The natural violation: functions that depend on the current date, system
environment variables, or network resources at call time. These make the output
a function of context, not just input. Wherever possible, make that context
explicit — pass the date as an argument, not datetime.now().
Pinning random seeds
Any step that uses randomness must pin a seed. In Python:
import random
import numpy as np
random.seed(42)
np.random.seed(42)For numpy's newer generator API (preferred):
rng = np.random.default_rng(42)For sklearn estimators, pass random_state=42 to every constructor that
accepts it: train_test_split, RandomForestClassifier, KMeans, etc.
A single unseeded step anywhere in the pipeline breaks end-to-end
reproducibility.
Hashing intermediate files
If a pipeline step is expensive (loading 50 GB, training a model for 2 hours), you want to cache its output and skip it on subsequent runs when the input has not changed. The correct way to detect "input has not changed" is a checksum:
import hashlib
def file_hash(path: str) -> str:
h = hashlib.sha256()
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(65536), b""):
h.update(chunk)
return h.hexdigest()Store the hash alongside the cached output. Before running the step, check whether the input hash matches the stored hash. If it matches, load the cache. If not, rerun and update the hash. This is the logic that build systems like Make, DVC, and Makefile targets implement.
Versioning training data
A model is only as reproducible as the data it was trained on. Store a reference to the exact version of the training dataset alongside the saved model — either a hash of the data file, a DVC data version, or an immutable S3 object version. Without this, "retrain with the same data" is not possible six months later when the source database has been modified.
The most common reproducibility failure is forgetting to pin the seed on a
train/test split. If the split is different, every downstream metric is
different. Always pass random_state= to train_test_split, and store the
seed value in your model metadata.
Where to go next
With reproducibility principles established, the next lesson operationalises them: testing pipelines — unit tests for transformation functions, schema validation with pandera, and assertions that detect data leakage.