Checkpoints and atomic writes

Implement the two core idempotency patterns in Python — checkpoint marker files and atomic write-then-rename — so your pipelines survive crashes and restarts cleanly.

The previous lesson defined idempotency and the two patterns that enforce it. Here you will run both patterns against a simulated multi-step pipeline and see how the checkpoint system skips completed steps when the pipeline is re-invoked.

The patterns in code

The step_done / mark_done helpers are intentionally simple — a .done file per step is sufficient for most pipelines. The atomic write function wraps tempfile and os.replace into a single reusable utility.

Python — editable, runs in your browser

import os
import tempfile
from pathlib import Path

# ── Checkpoint helpers ──────────────────────────────────────────────
CHECKPOINT_DIR = Path("/tmp/pipeline_checkpoints")
CHECKPOINT_DIR.mkdir(exist_ok=True)

def step_done(name):
  return (CHECKPOINT_DIR / f"{name}.done").exists()

def mark_done(name):
  (CHECKPOINT_DIR / f"{name}.done").touch()

def clear_checkpoints():
  """Remove all markers — use this to force a full re-run."""
  for f in CHECKPOINT_DIR.glob("*.done"):
      f.unlink()

# ── Atomic write helper ─────────────────────────────────────────────
OUTPUT_DIR = Path("/tmp/pipeline_output")
OUTPUT_DIR.mkdir(exist_ok=True)

def write_atomically(destination, content):
  with tempfile.NamedTemporaryFile(
      mode="w",
      dir=destination.parent,
      suffix=".tmp",
      delete=False,
  ) as tmp:
      tmp.write(content)
      tmp_path = tmp.name
  os.replace(tmp_path, destination)  # atomic rename
  return destination

# ── Simulated pipeline steps ────────────────────────────────────────
def fetch():
  print("  [fetch] downloading data...")
  data = "id,value\n1,100\n2,200\n3,300\n"
  write_atomically(OUTPUT_DIR / "raw.csv", data)
  print("  [fetch] wrote raw.csv atomically")

def transform():
  print("  [transform] processing...")
  raw = (OUTPUT_DIR / "raw.csv").read_text()
  rows = raw.strip().split("\n")[1:]  # skip header
  doubled = [r.split(",")[0] + "," + str(int(r.split(",")[1]) * 2)
             for r in rows]
  result = "id,value\n" + "\n".join(doubled) + "\n"
  write_atomically(OUTPUT_DIR / "transformed.csv", result)
  print("  [transform] wrote transformed.csv atomically")

def upload():
  print("  [upload] sending to warehouse (simulated)...")
  data = (OUTPUT_DIR / "transformed.csv").read_text()
  print("  [upload] rows sent:", data.count("\n") - 1)

# ── Idempotent pipeline ──────────────────────────────────────────────
def run_pipeline():
  print("=== Run 1 (first time — all steps execute) ===")
  for step_name, step_fn in [("fetch", fetch), ("transform", transform), ("upload", upload)]:
      if step_done(step_name):
          print(f"  [{step_name}] already done, skipping")
      else:
          step_fn()
          mark_done(step_name)

print()
  print("=== Run 2 (pipeline restarted — all steps skipped) ===")
  for step_name, step_fn in [("fetch", fetch), ("transform", transform), ("upload", upload)]:
      if step_done(step_name):
          print(f"  [{step_name}] already done, skipping")
      else:
          step_fn()
          mark_done(step_name)

clear_checkpoints()  # start clean for this demo
run_pipeline()

Run this and observe: every step prints its work message on the first pass, and every step is skipped on the second pass. The output files are identical.

What makes the write atomic

tempfile.NamedTemporaryFile(dir=destination.parent) creates the temp file in the same directory as the destination. This is the critical detail: os.replace() is only guaranteed atomic when the source and destination are on the same filesystem. Writing to /tmp/ when the destination is on a mounted network share would break atomicity.

Always pass dir=destination.parent, not a fixed /tmp path, unless you know both paths live on the same filesystem. In the demo above the output is also in /tmp so it works — but in production, put the output wherever the pipeline expects it and the temp file will follow.

Adapting to your pipeline

Two things to parameterise when you use these patterns for real:

Run ID in checkpoint names. If you run the pipeline multiple times per day, include a date or run ID: mark_done(f"{name}_{run_id}"). Otherwise all runs after the first will be no-ops.
Checkpoint location. Store checkpoints outside the output directory so that clearing outputs does not also clear checkpoints. A .checkpoints/ directory at the project root works well.

Where to go next

Next: retry logic — checkpoints and atomic writes handle the "already done" case. Retrying transient failures handles the "not done yet, but worth trying again" case. Together they make a pipeline that is both safe to restart and resilient to intermittent errors.

Finished reading? Mark it complete to track your progress.

Checkpoints and atomic writes

The patterns in code

What makes the write atomic

Adapting to your pipeline

Where to go next

On this page