Code of the Day
AdvancedRobust Pipelines

Checkpoints and atomic writes

Implement the two core idempotency patterns in Python — checkpoint marker files and atomic write-then-rename — so your pipelines survive crashes and restarts cleanly.

WorkflowAdvanced10 min read
Recommended first
By the end of this lesson you will be able to:
  • Implement a checkpoint file pattern using step_done() and mark_done() helpers
  • Write output atomically using tempfile.NamedTemporaryFile and os.replace()
  • Detect and skip already-completed steps when a pipeline restarts

The previous lesson defined idempotency and the two patterns that enforce it. Here you will run both patterns against a simulated multi-step pipeline and see how the checkpoint system skips completed steps when the pipeline is re-invoked.

The patterns in code

The step_done / mark_done helpers are intentionally simple — a .done file per step is sufficient for most pipelines. The atomic write function wraps tempfile and os.replace into a single reusable utility.

Python — editable, runs in your browser

Run this and observe: every step prints its work message on the first pass, and every step is skipped on the second pass. The output files are identical.

What makes the write atomic

tempfile.NamedTemporaryFile(dir=destination.parent) creates the temp file in the same directory as the destination. This is the critical detail: os.replace() is only guaranteed atomic when the source and destination are on the same filesystem. Writing to /tmp/ when the destination is on a mounted network share would break atomicity.

Always pass dir=destination.parent, not a fixed /tmp path, unless you know both paths live on the same filesystem. In the demo above the output is also in /tmp so it works — but in production, put the output wherever the pipeline expects it and the temp file will follow.

Adapting to your pipeline

Two things to parameterise when you use these patterns for real:

  1. Run ID in checkpoint names. If you run the pipeline multiple times per day, include a date or run ID: mark_done(f"{name}_{run_id}"). Otherwise all runs after the first will be no-ops.
  2. Checkpoint location. Store checkpoints outside the output directory so that clearing outputs does not also clear checkpoints. A .checkpoints/ directory at the project root works well.

Where to go next

Next: retry logic — checkpoints and atomic writes handle the "already done" case. Retrying transient failures handles the "not done yet, but worth trying again" case. Together they make a pipeline that is both safe to restart and resilient to intermittent errors.

Finished reading? Mark it complete to track your progress.

On this page