Lab: Subprocess pipeline

Build a script that generates a file list with a system call, processes metadata in Python, and writes a JSON summary — end-to-end practice for the shell and processes module.

This lab applies the subprocess pipeline pattern to a practical task: inventory the files in a directory, aggregate statistics by file extension, and write a JSON summary. The same shape appears constantly in real automation work — generate data from an external tool, process it in Python, persist the result.

You will use subprocess and os (both in the standard library), so nothing needs installing.

What you are building

A script with three stages:

Generate — get a list of files using a subprocess call or os.walk()
Aggregate — count files and total bytes per extension
Write — serialise the result as JSON

Checkpoint 1: Generate the file list

Start by calling a system tool to list files. On Linux/macOS, find . -type f lists all files under the current directory. The runner below uses os.walk() as the data source — same result, no platform dependency:

Python — editable, runs in your browser

You should see a list of file paths. The exact count depends on the runner environment. If file_list is empty, check that the walk starts from a path that exists.

Checkpoint 2: Parse and aggregate

Process the file list: extract the extension, get the file size, and accumulate counts and total bytes per extension:

Python — editable, runs in your browser

Check that the extension counts add up to the total file count from checkpoint 1. If they do not, there is a gap in the iteration — probably an OSError on a file that disappeared between listing and sizing.

os.path.splitext("report.tar.gz") returns ("report.tar", ".gz") — it splits on the last dot only. For double extensions like .tar.gz, you would need extra logic. For most purposes, splitting on the last dot is correct.

Checkpoint 3: Write the JSON summary

Serialise the aggregated stats to JSON and write them to a file:

Python — editable, runs in your browser

The output is valid JSON. Replace io.StringIO() with open("file_summary.json", "w") to write to disk. The rest of the logic is identical.

Putting it all together

Here is the complete script as it would run from the command line:

import os
import json
import subprocess
import sys

def list_files(root="."):
    """Use subprocess find, fall back to os.walk if find is unavailable."""
    try:
        result = subprocess.run(
            ["find", root, "-type", "f"],
            capture_output=True,
            text=True,
            check=True,
        )
        return [p for p in result.stdout.strip().split("\n") if p]
    except (subprocess.CalledProcessError, FileNotFoundError):
        paths = []
        for dirpath, _, files in os.walk(root):
            for fname in files:
                paths.append(os.path.join(dirpath, fname))
        return paths

def aggregate(file_list):
    stats = {}
    total_size = 0
    for path in file_list:
        _, ext = os.path.splitext(path)
        ext = ext.lower() or "(no extension)"
        try:
            size = os.path.getsize(path)
        except OSError:
            size = 0
        total_size += size
        if ext not in stats:
            stats[ext] = {"count": 0, "total_bytes": 0}
        stats[ext]["count"] += 1
        stats[ext]["total_bytes"] += size
    return stats, total_size

def write_report(file_list, stats, total_size, path="file_summary.json"):
    report = {
        "total_files": len(file_list),
        "total_bytes": total_size,
        "by_extension": stats,
    }
    with open(path, "w") as f:
        json.dump(report, f, indent=2)
    print(f"Report written to {path}")

def main():
    root = sys.argv[1] if len(sys.argv) > 1 else "."
    files = list_files(root)
    stats, total_size = aggregate(files)
    write_report(files, stats, total_size)

if __name__ == "__main__":
    main()

The list_files function tries the subprocess approach first and falls back to os.walk — a graceful degradation pattern that makes the script portable without sacrificing the subprocess exercise.

Where to go next

The next module covers scheduling and configuration — once your script works, how do you run it automatically on a schedule, and how do you make it configurable without editing source code each time?

Finished reading? Mark it complete to track your progress.

What you are building

Checkpoint 1: Generate the file list

Checkpoint 2: Parse and aggregate

Checkpoint 3: Write the JSON summary

Putting it all together

Where to go next

On this page