Streaming in practice

Rewrite a list-buffering file processor as a generator pipeline and measure the memory difference.

Theory is easy to accept; the memory numbers make it concrete. The demo below generates 100,000 lines in memory, runs both the buffering and streaming versions, and shows the size difference directly.

Side-by-side comparison

Run both versions and observe the allocation sizes:

Python — editable, runs in your browser

import sys

# Generate 100 000 lines without writing to disk
lines = ["2025-06-01 INFO processed record " + str(i) for i in range(100_000)]

# --- Buffering version: collects all transformed lines into a list ---
def process_all(input_lines):
  return [line.upper() for line in input_lines]

# --- Streaming version: yields one line at a time ---
def process_stream(input_lines):
  for line in input_lines:
      yield line.upper()

# Measure the buffering result
result_list = process_all(lines)
list_size = sys.getsizeof(result_list)
# sys.getsizeof only counts the list container, not the strings.
# Total = container + sum of string sizes.
total_size = list_size + sum(sys.getsizeof(s) for s in result_list)

print("Buffering version:")
print(f"  List container: {list_size:,} bytes")
print(f"  Total (container + strings): {total_size:,} bytes")
print(f"  ({total_size / 1_048_576:.1f} MB for 100 000 lines)")

# Measure the streaming version — the generator itself is tiny
gen = process_stream(lines)
gen_size = sys.getsizeof(gen)

print()
print("Streaming version:")
print(f"  Generator object size: {gen_size} bytes")
print("  (No strings allocated until consumed)")

# Confirm they produce the same output — consume just the first 3 lines
gen2 = process_stream(lines)
buffered_sample = result_list[:3]
streaming_sample = [next(gen2) for _ in range(3)]
print()
print("Output matches:", buffered_sample == streaming_sample)

The generator object itself is around 100 bytes regardless of input size. The list version allocated tens of megabytes to hold every transformed string before the caller could read a single one.

Building a pipeline

Generators compose by passing one into another. Each stage is a function that takes an iterable and yields transformed values:

Python — editable, runs in your browser

Python's built-in map() and filter() are also lazy — they return iterators, not lists. map(str.upper, lines) is a streaming equivalent of the list comprehension [line.upper() for line in lines]. Use them when the transformation is a single function call; use explicit generators when you need multiple statements or conditional logic.

When to reach for itertools

The itertools module in the standard library provides streaming combinators: chain() (concatenate iterables), islice() (take the first N items), groupby() (group consecutive items), and tee() (fork a single iterator into two). Reaching for these before writing your own loop often produces both cleaner and more memory-efficient code.

Where to go next

Next: memory profiling — using tracemalloc to find the specific lines responsible for peak allocations, so you know exactly where to apply the streaming refactor.

Finished reading? Mark it complete to track your progress.

Side-by-side comparison

Building a pipeline

When to reach for itertools

Where to go next

On this page