Lab: Subprocess pipeline
Build a script that generates a file list with a system call, processes metadata in Python, and writes a JSON summary — end-to-end practice for the shell and processes module.
- Call an external tool and capture its output
- Parse the captured output in Python
- Aggregate data by a category (file extension)
- Write a structured JSON summary to a file
This lab applies the subprocess pipeline pattern to a practical task: inventory the files in a directory, aggregate statistics by file extension, and write a JSON summary. The same shape appears constantly in real automation work — generate data from an external tool, process it in Python, persist the result.
You will use subprocess and os (both in the standard library), so nothing needs
installing.
What you are building
A script with three stages:
- Generate — get a list of files using a subprocess call or
os.walk() - Aggregate — count files and total bytes per extension
- Write — serialise the result as JSON
Checkpoint 1: Generate the file list
Start by calling a system tool to list files. On Linux/macOS, find . -type f lists
all files under the current directory. The runner below uses os.walk() as the data
source — same result, no platform dependency:
You should see a list of file paths. The exact count depends on the runner environment.
If file_list is empty, check that the walk starts from a path that exists.
Checkpoint 2: Parse and aggregate
Process the file list: extract the extension, get the file size, and accumulate counts and total bytes per extension:
Check that the extension counts add up to the total file count from checkpoint 1. If
they do not, there is a gap in the iteration — probably an OSError on a file that
disappeared between listing and sizing.
os.path.splitext("report.tar.gz") returns ("report.tar", ".gz") — it splits
on the last dot only. For double extensions like .tar.gz, you would need extra
logic. For most purposes, splitting on the last dot is correct.
Checkpoint 3: Write the JSON summary
Serialise the aggregated stats to JSON and write them to a file:
The output is valid JSON. Replace io.StringIO() with open("file_summary.json", "w")
to write to disk. The rest of the logic is identical.
Putting it all together
Here is the complete script as it would run from the command line:
import os
import json
import subprocess
import sys
def list_files(root="."):
"""Use subprocess find, fall back to os.walk if find is unavailable."""
try:
result = subprocess.run(
["find", root, "-type", "f"],
capture_output=True,
text=True,
check=True,
)
return [p for p in result.stdout.strip().split("\n") if p]
except (subprocess.CalledProcessError, FileNotFoundError):
paths = []
for dirpath, _, files in os.walk(root):
for fname in files:
paths.append(os.path.join(dirpath, fname))
return paths
def aggregate(file_list):
stats = {}
total_size = 0
for path in file_list:
_, ext = os.path.splitext(path)
ext = ext.lower() or "(no extension)"
try:
size = os.path.getsize(path)
except OSError:
size = 0
total_size += size
if ext not in stats:
stats[ext] = {"count": 0, "total_bytes": 0}
stats[ext]["count"] += 1
stats[ext]["total_bytes"] += size
return stats, total_size
def write_report(file_list, stats, total_size, path="file_summary.json"):
report = {
"total_files": len(file_list),
"total_bytes": total_size,
"by_extension": stats,
}
with open(path, "w") as f:
json.dump(report, f, indent=2)
print(f"Report written to {path}")
def main():
root = sys.argv[1] if len(sys.argv) > 1 else "."
files = list_files(root)
stats, total_size = aggregate(files)
write_report(files, stats, total_size)
if __name__ == "__main__":
main()The list_files function tries the subprocess approach first and falls back to
os.walk — a graceful degradation pattern that makes the script portable without
sacrificing the subprocess exercise.
Where to go next
The next module covers scheduling and configuration — once your script works, how do you run it automatically on a schedule, and how do you make it configurable without editing source code each time?
Python subprocess pipeline
Chain subprocess calls, process the output in Python between steps, and write the final result to a file — a complete worked example.
Cron concepts
Cron is the Unix scheduler — learn to read and write crontab expressions, understand common schedule patterns, and know where cron falls short.