Code of the Day
AdvancedRobust Pipelines

Lab: harden a pipeline

Take a brittle three-step pipeline and make it production-ready — add idempotency checkpoints, atomic writes, exponential-backoff retry logic, and failure alerting.

Lab · optionalWorkflowAdvanced35 min
By the end of this lesson you will be able to:
  • Add checkpoint files to a pipeline so completed steps are skipped on restart
  • Make file writes atomic with tempfile + os.replace
  • Wrap HTTP calls with exponential-backoff retry logic
  • Fire a Slack webhook notification when the pipeline fails

This lab starts with the kind of pipeline that appears in every team's codebase eventually: three steps that mostly work, no retry, no checkpoints, and a half-written file on disk whenever the process is killed mid-run. Your job is to harden it using every pattern from this module.

The brittle pipeline

Read through the starter code carefully. It has three deliberate weaknesses:

  1. No checkpoints — a restart re-runs everything from the top.
  2. Writes the output file incrementally — a crash mid-write leaves a partial file.
  3. No retry on the (simulated) HTTP step — a single timeout aborts the run.
Python — editable, runs in your browser

Run it and observe the retry messages on the fetch and upload steps. Then work through the checkpoints below.

Checkpoint 1 — verify idempotency

Change the reset_checkpoints() call at the top of run_pipeline() to a comment, then re-run. Every step should print "already done, skipping" because the .done files still exist from the first run. This is the correct behaviour for a production restart after a crash.

Restore the call when you are done.

In production you would not automatically clear checkpoints on every run. Include a --force CLI flag (from argparse) that calls reset_checkpoints() only when the operator explicitly wants a full re-run.

Checkpoint 2 — trigger the Slack alert

Set TRANSFORM_FAIL = True near the top of the cell, re-run, and confirm that:

  1. Fetch succeeds and marks its checkpoint.
  2. Transform raises a ValueError (a permanent error — no retry).
  3. The orchestrator catches it and calls _send_slack_alert.
  4. The ValueError propagates after the alert so the process exits non-zero.

In production the alert would POST to a Slack incoming webhook URL stored as an environment variable:

import os, requests

SLACK_WEBHOOK = os.environ["SLACK_WEBHOOK_URL"]

def send_slack_alert(message: str) -> None:
    requests.post(SLACK_WEBHOOK, json={"text": message}, timeout=10)

Reset TRANSFORM_FAIL = False before continuing.

Checkpoint 3 — inspect the atomic writes

Add a print statement immediately after the first tmp.write(content) line in write_atomically to print the temp filename. Verify that:

  1. The .tmp file appears in WORK_DIR during the write.
  2. After os.replace, only the final filename remains.

This demonstrates that downstream steps never see a partially-written file.

Checkpoint 4 — simulate a mid-run crash

Raise KeyboardInterrupt inside step_transform (before the atomic write completes) by adding raise KeyboardInterrupt() on the first line. Re-run and verify:

  1. Fetch completes and marks its checkpoint.
  2. Transform raises before writing — no partial file exists.
  3. On a third run (without clearing checkpoints), fetch is skipped and transform retries from a clean state.

KeyboardInterrupt is not a subclass of Exception, so the broad except Exception in the orchestrator will not catch it. That is intentional — operator interrupts should not fire the Slack alert. Use BaseException only if you need a catch-all that also fires on interrupt.

Extending the pattern

A production-grade version of this pipeline would add:

  • Structured logging (logging.getLogger(__name__)) replacing print calls, with a JSON formatter so logs are queryable in Datadog or CloudWatch.
  • Run IDs in checkpoint filenames ({name}_{run_date}.done) so multiple daily runs are tracked independently.
  • Dead-letter storage — when the upload fails permanently, write the failed payload to a file or queue for manual review rather than discarding it.
  • Metrics — a counter incremented on each retry attempt, published to Prometheus or StatsD, so you can alert on sustained high retry rates before they become full failures.

Where to go next

Module complete. Next up: Testing Automation Scripts — the patterns you have built so far are only trustworthy if they are tested. The next module covers mocking the filesystem and HTTP layer so you can verify pipeline logic without touching real files or real APIs.

Finished reading? Mark it complete to track your progress.

On this page