Code of the Day
AdvancedPerformance and Streaming

Don't buffer what you can stream

Why reading an entire file into memory crashes large-input tools, and how generator-based streaming fixes it.

UtilitiesAdvanced6 min read
By the end of this lesson you will be able to:
  • Explain why buffering a 10 GB file crashes a CLI tool
  • Describe how generators produce values on demand
  • Identify the Unix line-at-a-time processing pattern

The most common performance mistake in CLI tools is not a slow algorithm — it is loading the entire input into memory before doing any work. A tool that processes log files works fine in development (100 KB logs) and silently fails in production (50 GB logs).

The cost of buffering

Consider a log processor that does this:

with open("access.log") as f:
    lines = f.readlines()          # entire file in RAM

for line in lines:
    process(line)

readlines() allocates a Python list containing every line as a string object. A 10 GB file consumes at least 10 GB of RAM for the raw bytes, plus Python's object overhead — in practice 20–30 GB. On a server with 16 GB RAM, the process is killed by the OS before it finishes.

The fix requires no new dependencies and no architectural change:

with open("access.log") as f:
    for line in f:                 # one line at a time
        process(line)

The file object is an iterator. Iterating over it yields one line, processes it, then discards it. Memory stays near zero regardless of file size. The runtime is the same — the work is identical — but the peak allocation drops from O(n) to O(1).

Generators extend the pattern

A function that yields values instead of building a list is a generator. The caller pulls one item at a time; the generator runs until it hits yield, then pauses. No items accumulate:

def parse_log_lines(path):
    with open(path) as f:
        for line in f:
            if line.strip():
                yield line.rstrip()

def extract_ips(lines):
    for line in lines:
        yield line.split()[0]   # first field is the IP

for ip in extract_ips(parse_log_lines("access.log")):
    record(ip)

Each stage in the pipeline processes one line, passes it to the next stage, and immediately forgets it. You can chain ten stages this way and peak memory is still O(1 line).

The Unix philosophy connection

Unix tools — grep, awk, sed, sort -u — are all line-at-a-time processors connected by pipes. grep does not read the entire input before printing matches; it prints each match as it finds it. Your tool, when used in a pipeline, should behave the same way.

for line in sys.stdin achieves this automatically:

import sys

for line in sys.stdin:
    result = process(line.rstrip())
    print(result)

Output appears as input arrives. The tool composes naturally with other Unix tools, handles infinite streams, and never buffers more than one line.

Some operations genuinely require the entire input: sorting, deduplication, computing a median. When you need to buffer, be explicit about it and document the memory requirement. The problem is when buffering happens accidentally — not when it happens intentionally.

Where to go next

Next: streaming in practice — a side-by-side Runnable comparing a list-buffering function to its generator equivalent, with memory measurements.

Finished reading? Mark it complete to track your progress.

On this page