Lab: test suite
Write a complete test suite for a pipeline script — covering config loading, data transformation, mocked HTTP, file output, and error paths.
- Write a test for config loading that catches missing required keys
- Test a data transformation function with known inputs and expected outputs
- Test an HTTP fetch function with a mocked transport
- Test file output with a temporary directory
- Test an error path to confirm exceptions surface rather than being swallowed
A pipeline without tests is a time bomb. This lab provides a small but realistic pipeline module and guides you through writing five tests — one for each layer that matters. By the end, every function in the module has at least one test, and every error path is covered.
The pipeline module under test
Read through the module carefully before writing tests. Note:
load_configreads from a dict (simulatingos.environ) and raises on missing keys.transform_recordsis pure — it only computes.fetch_recordscalls an external URL.write_reportwrites JSON to disk.runis the orchestrator — it calls all four in sequence.
Checkpoint 1 — break a test intentionally
Change the transform_records function so it does not drop negative values. Run
the tests. test_transform_records should fail with a clear message. Restore the
filter and confirm all tests pass again.
This is the most important habit: verify that tests actually catch the bug they claim to catch, not just that they pass when the code is correct.
Checkpoint 2 — test the orchestrator
run calls four functions in sequence. Add a test that:
- Mocks
fetch_recordsto return a list of two records. - Uses a
tmp_path-style temporary directory forOUTPUT_DIR. - Calls
run(env)with appropriate env values. - Asserts the output file exists and contains the transformed data.
This is an integration test within the module — it exercises all four functions together without hitting the real network.
An integration test that mocks only the network boundary (not the filesystem) is often the most valuable test you can write. It proves that the pieces fit together, not just that each piece works in isolation.
Checkpoint 3 — add a missing-file error path
Modify write_report to raise PermissionError if output_dir is /root
(always unwritable on a standard Linux system). Write a test that:
- Calls
write_report(records, Path("/root")). - Asserts the
PermissionErroris raised.
Then revert the change. The point is to practise writing a test before writing the code — red first, then green.
Converting to real pytest
When you run these as a proper pytest suite, replace the manual if/else checks
with assert statements and let pytest handle the output formatting:
def test_transform_records():
raw = [{"id": "a", "value": "10"}, {"id": "b", "value": "-5"}]
result = transform_records(raw)
assert len(result) == 1
assert result[0]["value"] == 10
def test_write_report_file_contents(tmp_path):
records = [{"id": "a", "value": 42}]
dest = write_report(records, tmp_path)
assert dest.exists()
assert json.loads(dest.read_text()) == recordsNote how tmp_path comes in as a pytest fixture — no tempfile.TemporaryDirectory
context manager needed.
Where to go next
Module complete. Next up: Workflow Orchestration — once your pipelines are tested and hardened, the next step is expressing their dependencies explicitly as DAGs and running them with a proper orchestration tool.
Mocking HTTP in practice
Use the responses library to intercept requests calls in tests — assert the right URL was called, return fake JSON, and test your error-handling code with a simulated 500.
DAG thinking
Express a data pipeline as a directed acyclic graph — identify dependencies, find the critical path, and understand why cycles must be forbidden.