Code of the Day
IntermediateData I/O & manipulation

Data transformation patterns

Sort, filter, group, count, and reshape data using Python's built-in tools.

PythonIntermediate10 min read
By the end of this lesson you will be able to:
  • Sort collections using sorted() with a key= function
  • Filter data with list comprehensions vs the filter() built-in
  • Group records by a field using collections.defaultdict
  • Count occurrences with collections.Counter
  • Use zip() and enumerate() for paired iteration
  • Merge dicts with the | operator and ** unpacking

Once data is loaded from a file or API, most programs spend their time transforming it — sorting, grouping, counting, reshaping. Python's standard library has well-chosen tools for each of these patterns; knowing them means reaching for a loop and a scratch dict far less often.

Sorting with sorted() and key=

sorted() returns a new sorted list without changing the original. The key= parameter takes a function that extracts the value to sort by:

people = [{"name": "Charlie", "age": 30}, {"name": "Alice", "age": 25}]

by_name = sorted(people, key=lambda p: p["name"])
by_age  = sorted(people, key=lambda p: p["age"], reverse=True)

lambda here is just a small anonymous function — read lambda p: p["name"] as "given a person, return their name". Passing reverse=True sorts descending.

For simple attribute or key access, operator.attrgetter and operator.itemgetter are faster alternatives to a lambda:

from operator import itemgetter
by_age = sorted(people, key=itemgetter("age"))

Filtering

A with an if clause is almost always the clearest filter:

adults = [p for p in people if p["age"] >= 18]

The built-in filter(func, iterable) does the same thing but returns a lazy iterator. Use it when you're chaining many operations and don't want intermediate lists; use a comprehension when you want immediate, readable results.

Grouping with defaultdict

Grouping is the "put items into buckets by a key" operation. A collections.defaultdict(list) removes the "is the key in the dict yet?" check:

from collections import defaultdict

records = [
    {"dept": "eng", "name": "Alice"},
    {"dept": "mkt", "name": "Bob"},
    {"dept": "eng", "name": "Carol"},
]

by_dept = defaultdict(list)
for r in records:
    by_dept[r["dept"]].append(r["name"])

# {"eng": ["Alice", "Carol"], "mkt": ["Bob"]}

On first access of a missing key, defaultdict(list) automatically inserts an empty list — so by_dept[key].append(...) never raises a KeyError.

Python — editable, runs in your browser

Counting with Counter

collections.Counter counts occurrences in one call and offers a handy most_common() method:

from collections import Counter

votes = ["yes", "no", "yes", "yes", "no", "abstain"]
counts = Counter(votes)

print(counts["yes"])           # 3
print(counts.most_common(2))   # [("yes", 3), ("no", 2)]

Counter also supports arithmetic: adding two counters merges their tallies, which is useful when aggregating counts from multiple sources.

zip() and enumerate() for paired iteration

zip() walks two iterables in lockstep, pairing their items:

names  = ["Alice", "Bob", "Carol"]
scores = [91, 85, 78]

for name, score in zip(names, scores):
    print(f"{name}: {score}")

enumerate() adds an index to any iteration, replacing the manual i = 0; i += 1 pattern:

for i, name in enumerate(names, start=1):
    print(f"{i}. {name}")   # 1. Alice, 2. Bob, ...

Both return lazy iterators, so they're memory-efficient over large collections.

Flattening nested lists

A nested comprehension with two for clauses flattens one level:

nested = [[1, 2], [3, 4], [5]]
flat = [item for sublist in nested for item in sublist]   # [1, 2, 3, 4, 5]

For deeper nesting, itertools.chain.from_iterable is more readable.

Merging dicts

Python 3.9 added the | operator for dict merging — it creates a new dict with all keys from both, and the right-hand side wins on conflicts:

defaults = {"color": "blue", "size": "M"}
overrides = {"size": "L", "weight": 200}

merged = defaults | overrides
# {"color": "blue", "size": "L", "weight": 200}

For earlier Python or more than two dicts, ** unpacking in a dict literal does the same:

merged = {**defaults, **overrides}

When you find yourself writing a loop that fills a dict, check whether defaultdict, Counter, or a comprehension already expresses the same intent more clearly. The collections module and these built-ins exist precisely to replace ad-hoc accumulation loops.

Where to go next

Next: String manipulation and pattern matching — extracting and validating structured data inside strings using the re module.

Finished reading? Mark it complete to track your progress.

On this page