Lab: harden a pipeline

Take a brittle three-step pipeline and make it production-ready — add idempotency checkpoints, atomic writes, exponential-backoff retry logic, and failure alerting.

This lab starts with the kind of pipeline that appears in every team's codebase eventually: three steps that mostly work, no retry, no checkpoints, and a half-written file on disk whenever the process is killed mid-run. Your job is to harden it using every pattern from this module.

The brittle pipeline

Read through the starter code carefully. It has three deliberate weaknesses:

No checkpoints — a restart re-runs everything from the top.
Writes the output file incrementally — a crash mid-write leaves a partial file.
No retry on the (simulated) HTTP step — a single timeout aborts the run.

Python — editable, runs in your browser

import os, time, random, tempfile
from pathlib import Path

# ── Infrastructure stubs ─────────────────────────────────────────────────────
WORK_DIR = Path("/tmp/lab_pipeline")
WORK_DIR.mkdir(exist_ok=True)
CHECKPOINT_DIR = WORK_DIR / ".checkpoints"
CHECKPOINT_DIR.mkdir(exist_ok=True)

# Failure injection — flip these to watch the pipeline recover
FETCH_FAIL_TIMES  = 2   # number of times fetch raises before succeeding
TRANSFORM_FAIL    = False
UPLOAD_FAIL_TIMES = 1

_fetch_count    = 0
_upload_count   = 0

class TransientError(Exception): pass

def _simulated_fetch(url):
  """Raises TransientError for the first FETCH_FAIL_TIMES calls."""
  global _fetch_count
  _fetch_count += 1
  if _fetch_count <= FETCH_FAIL_TIMES:
      raise TransientError(f"timeout on attempt {_fetch_count}")
  return [{"id": i, "value": i * 10} for i in range(1, 6)]

def _simulated_upload(data):
  global _upload_count
  _upload_count += 1
  if _upload_count <= UPLOAD_FAIL_TIMES:
      raise TransientError(f"service unavailable (attempt {_upload_count})")
  return True

def _send_slack_alert(message):
  """In production: requests.post(SLACK_WEBHOOK_URL, json={"text": message})"""
  print(f"[ALERT] Slack notification sent: {message}")

# ── Checkpoint helpers ────────────────────────────────────────────────────────
def step_done(name):
  return (CHECKPOINT_DIR / f"{name}.done").exists()

def mark_done(name):
  (CHECKPOINT_DIR / f"{name}.done").touch()

def reset_checkpoints():
  for f in CHECKPOINT_DIR.glob("*.done"):
      f.unlink()

# ── Atomic write helper ───────────────────────────────────────────────────────
def write_atomically(destination, content):
  with tempfile.NamedTemporaryFile(
      mode="w", dir=destination.parent, suffix=".tmp", delete=False
  ) as tmp:
      tmp.write(content)
      tmp_path = tmp.name
  os.replace(tmp_path, destination)

# ── Retry helper (mirrors tenacity behaviour) ─────────────────────────────────
def retry_with_backoff(fn, max_attempts=4, base=0.05, max_wait=0.3,
                     retriable=(TransientError,)):
  for attempt in range(1, max_attempts + 1):
      try:
          return fn()
      except retriable as exc:
          if attempt == max_attempts:
              raise
          wait = random.uniform(0, min(base * (2 ** (attempt - 1)), max_wait))
          print(f"  [retry] attempt {attempt} failed: {exc}. Retrying in {wait:.3f}s")
          time.sleep(wait)

# ── Hardened pipeline ─────────────────────────────────────────────────────────

def step_fetch():
  print("[fetch] starting...")
  records = retry_with_backoff(
      lambda: _simulated_fetch("https://api.example.com/data"),
      retriable=(TransientError,),
  )
  import json
  write_atomically(WORK_DIR / "raw.json", json.dumps(records, indent=2))
  print(f"[fetch] wrote {len(records)} records atomically")
  return records

def step_transform():
  print("[transform] starting...")
  if TRANSFORM_FAIL:
      raise ValueError("Bad input data — permanent error, do not retry")
  import json
  raw = json.loads((WORK_DIR / "raw.json").read_text())
  result = [{"id": r["id"], "value": r["value"] * 2} for r in raw]
  write_atomically(WORK_DIR / "transformed.json", json.dumps(result, indent=2))
  print(f"[transform] wrote {len(result)} records atomically")

def step_upload():
  print("[upload] starting...")
  import json
  data = json.loads((WORK_DIR / "transformed.json").read_text())
  retry_with_backoff(
      lambda: _simulated_upload(data),
      retriable=(TransientError,),
  )
  print(f"[upload] uploaded {len(data)} records successfully")

# ── Orchestrator ──────────────────────────────────────────────────────────────

def run_pipeline():
  reset_checkpoints()   # start clean for this demo
  steps = [
      ("fetch",     step_fetch),
      ("transform", step_transform),
      ("upload",    step_upload),
  ]

try:
      for name, fn in steps:
          if step_done(name):
              print(f"[{name}] already done, skipping")
          else:
              fn()
              mark_done(name)
      print()
      print("Pipeline completed successfully.")
  except Exception as exc:
      _send_slack_alert(f"Pipeline failed at an unrecoverable step: {exc}")
      raise

run_pipeline()

Run it and observe the retry messages on the fetch and upload steps. Then work through the checkpoints below.

Checkpoint 1 — verify idempotency

Change the reset_checkpoints() call at the top of run_pipeline() to a comment, then re-run. Every step should print "already done, skipping" because the .done files still exist from the first run. This is the correct behaviour for a production restart after a crash.

Restore the call when you are done.

In production you would not automatically clear checkpoints on every run. Include a --force CLI flag (from argparse) that calls reset_checkpoints() only when the operator explicitly wants a full re-run.

Checkpoint 2 — trigger the Slack alert

Set TRANSFORM_FAIL = True near the top of the cell, re-run, and confirm that:

Fetch succeeds and marks its checkpoint.
Transform raises a ValueError (a permanent error — no retry).
The orchestrator catches it and calls _send_slack_alert.
The ValueError propagates after the alert so the process exits non-zero.

In production the alert would POST to a Slack incoming webhook URL stored as an environment variable:

import os, requests

SLACK_WEBHOOK = os.environ["SLACK_WEBHOOK_URL"]

def send_slack_alert(message: str) -> None:
    requests.post(SLACK_WEBHOOK, json={"text": message}, timeout=10)

Reset TRANSFORM_FAIL = False before continuing.

Checkpoint 3 — inspect the atomic writes

Add a print statement immediately after the first tmp.write(content) line in write_atomically to print the temp filename. Verify that:

The .tmp file appears in WORK_DIR during the write.
After os.replace, only the final filename remains.

This demonstrates that downstream steps never see a partially-written file.

Checkpoint 4 — simulate a mid-run crash

Raise KeyboardInterrupt inside step_transform (before the atomic write completes) by adding raise KeyboardInterrupt() on the first line. Re-run and verify:

Fetch completes and marks its checkpoint.
Transform raises before writing — no partial file exists.
On a third run (without clearing checkpoints), fetch is skipped and transform retries from a clean state.

KeyboardInterrupt is not a subclass of Exception, so the broad except Exception in the orchestrator will not catch it. That is intentional — operator interrupts should not fire the Slack alert. Use BaseException only if you need a catch-all that also fires on interrupt.

Extending the pattern

A production-grade version of this pipeline would add:

Structured logging (logging.getLogger(__name__)) replacing print calls, with a JSON formatter so logs are queryable in Datadog or CloudWatch.
Run IDs in checkpoint filenames ({name}_{run_date}.done) so multiple daily runs are tracked independently.
Dead-letter storage — when the upload fails permanently, write the failed payload to a file or queue for manual review rather than discarding it.
Metrics — a counter incremented on each retry attempt, published to Prometheus or StatsD, so you can alert on sustained high retry rates before they become full failures.

Where to go next

Module complete. Next up: Testing Automation Scripts — the patterns you have built so far are only trustworthy if they are tested. The next module covers mocking the filesystem and HTTP layer so you can verify pipeline logic without touching real files or real APIs.

Finished reading? Mark it complete to track your progress.