Lab: harden a pipeline
Take a brittle three-step pipeline and make it production-ready — add idempotency checkpoints, atomic writes, exponential-backoff retry logic, and failure alerting.
- Add checkpoint files to a pipeline so completed steps are skipped on restart
- Make file writes atomic with tempfile + os.replace
- Wrap HTTP calls with exponential-backoff retry logic
- Fire a Slack webhook notification when the pipeline fails
This lab starts with the kind of pipeline that appears in every team's codebase eventually: three steps that mostly work, no retry, no checkpoints, and a half-written file on disk whenever the process is killed mid-run. Your job is to harden it using every pattern from this module.
The brittle pipeline
Read through the starter code carefully. It has three deliberate weaknesses:
- No checkpoints — a restart re-runs everything from the top.
- Writes the output file incrementally — a crash mid-write leaves a partial file.
- No retry on the (simulated) HTTP step — a single timeout aborts the run.
Run it and observe the retry messages on the fetch and upload steps. Then work through the checkpoints below.
Checkpoint 1 — verify idempotency
Change the reset_checkpoints() call at the top of run_pipeline() to a
comment, then re-run. Every step should print "already done, skipping" because
the .done files still exist from the first run. This is the correct behaviour
for a production restart after a crash.
Restore the call when you are done.
In production you would not automatically clear checkpoints on every run.
Include a --force CLI flag (from argparse) that calls reset_checkpoints()
only when the operator explicitly wants a full re-run.
Checkpoint 2 — trigger the Slack alert
Set TRANSFORM_FAIL = True near the top of the cell, re-run, and confirm that:
- Fetch succeeds and marks its checkpoint.
- Transform raises a
ValueError(a permanent error — no retry). - The orchestrator catches it and calls
_send_slack_alert. - The
ValueErrorpropagates after the alert so the process exits non-zero.
In production the alert would POST to a Slack incoming webhook URL stored as an environment variable:
import os, requests
SLACK_WEBHOOK = os.environ["SLACK_WEBHOOK_URL"]
def send_slack_alert(message: str) -> None:
requests.post(SLACK_WEBHOOK, json={"text": message}, timeout=10)Reset TRANSFORM_FAIL = False before continuing.
Checkpoint 3 — inspect the atomic writes
Add a print statement immediately after the first tmp.write(content) line in
write_atomically to print the temp filename. Verify that:
- The
.tmpfile appears inWORK_DIRduring the write. - After
os.replace, only the final filename remains.
This demonstrates that downstream steps never see a partially-written file.
Checkpoint 4 — simulate a mid-run crash
Raise KeyboardInterrupt inside step_transform (before the atomic write
completes) by adding raise KeyboardInterrupt() on the first line. Re-run and
verify:
- Fetch completes and marks its checkpoint.
- Transform raises before writing — no partial file exists.
- On a third run (without clearing checkpoints), fetch is skipped and transform retries from a clean state.
KeyboardInterrupt is not a subclass of Exception, so the broad
except Exception in the orchestrator will not catch it. That is intentional —
operator interrupts should not fire the Slack alert. Use BaseException only if
you need a catch-all that also fires on interrupt.
Extending the pattern
A production-grade version of this pipeline would add:
- Structured logging (
logging.getLogger(__name__)) replacingprintcalls, with a JSON formatter so logs are queryable in Datadog or CloudWatch. - Run IDs in checkpoint filenames (
{name}_{run_date}.done) so multiple daily runs are tracked independently. - Dead-letter storage — when the upload fails permanently, write the failed payload to a file or queue for manual review rather than discarding it.
- Metrics — a counter incremented on each retry attempt, published to Prometheus or StatsD, so you can alert on sustained high retry rates before they become full failures.
Where to go next
Module complete. Next up: Testing Automation Scripts — the patterns you have built so far are only trustworthy if they are tested. The next module covers mocking the filesystem and HTTP layer so you can verify pipeline logic without touching real files or real APIs.
Retry in practice
Use the tenacity library to wrap a flaky function with exponential backoff and per-exception routing, and see it recover from transient failures automatically.
Why scripts need tests
Automation scripts run unattended, touch real systems, and fail silently — making them the most dangerous category of code to ship without tests.