Lab: DAG pipeline

Express a four-step pipeline as both a Makefile and a Prefect flow, run both, and compare the developer experience of each approach.

The same pipeline can be modelled in many ways. This lab builds a fetch → clean → aggregate → report pipeline in both Make and Prefect, then asks you to reflect on which model fits which context. Understanding both gives you the vocabulary to choose deliberately rather than by default.

The pipeline

Four steps with clear file inputs and outputs:

Step	Input	Output
`fetch`	API URL	`data/raw.json`
`clean`	`data/raw.json`	`data/clean.json`
`aggregate`	`data/clean.json`	`data/agg.json`
`report`	`data/agg.json`	`reports/summary.txt`

Part 1 — Makefile

The Makefile below encodes the full dependency graph. Read it, then answer the questions in Checkpoint 1.

# Makefile — fetch → clean → aggregate → report pipeline

DATA    := data
REPORTS := reports

.PHONY: all clean

all: $(REPORTS)/summary.txt

# Step 1: fetch raw data
$(DATA)/raw.json:
	mkdir -p $(DATA)
	python scripts/fetch.py --output $@

# Step 2: clean and validate
$(DATA)/clean.json: $(DATA)/raw.json
	python scripts/clean.py --input $< --output $@

# Step 3: aggregate
$(DATA)/agg.json: $(DATA)/clean.json
	python scripts/aggregate.py --input $< --output $@

# Step 4: generate report
$(REPORTS)/summary.txt: $(DATA)/agg.json
	mkdir -p $(REPORTS)
	python scripts/report.py --input $< --output $@

clean:
	rm -rf $(DATA) $(REPORTS)

Checkpoint 1 — incremental rebuilds

Run make all (assuming scripts exist). All four steps execute.
Touch data/clean.json (simulating a manual edit): touch data/clean.json.
Run make all again. Which steps re-run? Which are skipped?

Expected: clean.json is newer than raw.json, so fetch is skipped. But clean.json is newer than agg.json, so aggregate and report re-run.

This is incremental rebuilding — one of Make's core strengths.

Part 2 — Prefect flow

The Prefect version of the same pipeline is runnable in the demo below.

Python — editable, runs in your browser

import json, os, tempfile, time
from pathlib import Path

# ── Prefect simulator (same as previous lesson) ───────────────────────────────
_task_states = []

def task(fn=None, retries=0, retry_delay_seconds=0):
  def decorator(f):
      def wrapper(*args, **kwargs):
          for attempt in range(1, retries + 2):
              try:
                  result = f(*args, **kwargs)
                  _task_states.append((f.__name__, "Completed"))
                  return result
              except Exception as exc:
                  if attempt <= retries:
                      _task_states.append((f.__name__, f"Retry {attempt}"))
                  else:
                      _task_states.append((f.__name__, "Failed"))
                      raise
      wrapper.__name__ = f.__name__
      return wrapper
  return decorator(fn) if fn is not None else decorator

def flow(fn=None, name=None):
  def decorator(f):
      def wrapper(*args, **kwargs):
          print(f"=== Flow '{name or f.__name__}' starting ===")
          result = f(*args, **kwargs)
          print(f"=== Flow completed ===")
          return result
      return wrapper
  return decorator(fn) if fn is not None else decorator

# ── Four pipeline tasks ───────────────────────────────────────────────────────
@task
def fetch(url: str) -> list:
  # Simulated fetch — in production: requests.get(url).json()
  records = [{"id": i, "region": "EU" if i % 2 else "US", "revenue": i * 150}
             for i in range(1, 9)]
  print(f"  [fetch]     fetched {len(records)} raw records")
  return records

@task
def clean(records: list) -> list:
  cleaned = [r for r in records if r["revenue"] > 0]
  print(f"  [clean]     kept {len(cleaned)}/{len(records)} valid records")
  return cleaned

@task
def aggregate(records: list) -> dict:
  totals = {}
  for r in records:
      totals[r["region"]] = totals.get(r["region"], 0) + r["revenue"]
  print(f"  [aggregate] aggregated into {len(totals)} regions: {totals}")
  return totals

@task
def report(aggregated: dict) -> str:
  lines = ["Revenue by region", "─" * 30]
  for region, total in sorted(aggregated.items()):
      lines.append(f"  {region:6s}  {total:>10,d}")
  summary = "\n".join(lines)
  print(f"  [report]    generated summary ({len(lines)} lines)")
  return summary

# ── Flow ───────────────────────────────────────────────────────────────────────
@flow(name="revenue-pipeline")
def pipeline(url: str = "https://api.example.com/revenue") -> str:
  raw        = fetch(url)
  validated  = clean(raw)
  aggregated = aggregate(validated)
  return report(aggregated)

# ── Run ────────────────────────────────────────────────────────────────────────
summary = pipeline()
print()
print(summary)
print()
print("Task states:")
for name, state in _task_states:
  print(f"  {name}: {state}")

Checkpoint 2 — add a retry to fetch

Modify fetch to raise ConnectionError on the first call (use a module-level counter as in the previous lesson), and add retries=2 to the @task decorator. Re-run and confirm the retry fires and the pipeline completes.

In the Makefile version, adding retry behaviour to a single step would require wrapping the script invocation in a shell retry loop — significantly more awkward.

Checkpoint 3 — add a parallel branch

Add a fifth task, export_to_csv, that also depends on the clean output but is independent of aggregate. In Prefect:

@task
def export_to_csv(records: list) -> str:
    # write records to CSV string
    ...

@flow(name="revenue-pipeline-v2")
def pipeline(url: str = "https://api.example.com/revenue") -> str:
    raw        = fetch(url)
    validated  = clean(raw)
    # these two calls are independent — Prefect can submit them in parallel
    aggregated = aggregate(validated)
    csv_path   = export_to_csv(validated)
    return report(aggregated)

In the Makefile, you would add a new target with $(DATA)/clean.json as its prerequisite and add it to the all target — equally clean, but without the Python ecosystem for any logic inside the step.

Comparison

Concern	Makefile	Prefect
Incremental rebuilds	First-class (timestamp comparison)	Not built-in
Per-step retries	Shell loop workaround	Declarative
Run history	None	Full UI
Parametrised runs	ENV variables	Typed Python args
Parallel execution	`-j` flag	Automatic (with task runner)
Dependencies	File timestamps	Data flow
Infrastructure needed	None — Make ships everywhere	None for local; server for cloud

Choose Make when: your pipeline is file-to-file, incremental rebuilds matter, and you want zero Python dependencies in the orchestration layer.

Choose Prefect when: you need retries, observability, scheduling, or parametrised runs, and your team works primarily in Python.

These tools are not mutually exclusive. A common pattern is to use Make for the data-heavy file transformation stages (where incremental rebuilds save hours) and wrap the whole Makefile in a Prefect flow for scheduling, retries, and alerting.

Where to go next

Module complete. Next up: Containerised Workflows — packaging your hardened, tested, orchestrated pipeline into a Docker image so it runs identically in development, CI, and production.

Finished reading? Mark it complete to track your progress.

On this page