Data transformation patterns
Sort, filter, group, count, and reshape data using Python's built-in tools.
- Sort collections using sorted() with a key= function
- Filter data with list comprehensions vs the filter() built-in
- Group records by a field using collections.defaultdict
- Count occurrences with collections.Counter
- Use zip() and enumerate() for paired iteration
- Merge dicts with the | operator and ** unpacking
Once data is loaded from a file or API, most programs spend their time transforming it — sorting, grouping, counting, reshaping. Python's standard library has well-chosen tools for each of these patterns; knowing them means reaching for a loop and a scratch dict far less often.
Sorting with sorted() and key=
sorted() returns a new sorted list without changing the original. The key=
parameter takes a function that extracts the value to sort by:
people = [{"name": "Charlie", "age": 30}, {"name": "Alice", "age": 25}]
by_name = sorted(people, key=lambda p: p["name"])
by_age = sorted(people, key=lambda p: p["age"], reverse=True)lambda here is just a small anonymous function — read lambda p: p["name"] as
"given a person, return their name". Passing reverse=True sorts descending.
For simple attribute or key access, operator.attrgetter and operator.itemgetter
are faster alternatives to a lambda:
from operator import itemgetter
by_age = sorted(people, key=itemgetter("age"))Filtering
A list comprehension with an if clause is almost always the clearest filter:
adults = [p for p in people if p["age"] >= 18]The built-in filter(func, iterable) does the same thing but returns a lazy
iterator. Use it when you're chaining many operations and don't want intermediate
lists; use a comprehension when you want immediate, readable results.
Grouping with defaultdict
Grouping is the "put items into buckets by a key" operation. A
collections.defaultdict(list) removes the "is the key in the dict yet?" check:
from collections import defaultdict
records = [
{"dept": "eng", "name": "Alice"},
{"dept": "mkt", "name": "Bob"},
{"dept": "eng", "name": "Carol"},
]
by_dept = defaultdict(list)
for r in records:
by_dept[r["dept"]].append(r["name"])
# {"eng": ["Alice", "Carol"], "mkt": ["Bob"]}On first access of a missing key, defaultdict(list) automatically inserts an
empty list — so by_dept[key].append(...) never raises a KeyError.
Counting with Counter
collections.Counter counts occurrences in one call and offers a handy
most_common() method:
from collections import Counter
votes = ["yes", "no", "yes", "yes", "no", "abstain"]
counts = Counter(votes)
print(counts["yes"]) # 3
print(counts.most_common(2)) # [("yes", 3), ("no", 2)]Counter also supports arithmetic: adding two counters merges their tallies, which is useful when aggregating counts from multiple sources.
zip() and enumerate() for paired iteration
zip() walks two iterables in lockstep, pairing their items:
names = ["Alice", "Bob", "Carol"]
scores = [91, 85, 78]
for name, score in zip(names, scores):
print(f"{name}: {score}")enumerate() adds an index to any iteration, replacing the manual i = 0; i += 1 pattern:
for i, name in enumerate(names, start=1):
print(f"{i}. {name}") # 1. Alice, 2. Bob, ...Both return lazy iterators, so they're memory-efficient over large collections.
Flattening nested lists
A nested comprehension with two for clauses flattens one level:
nested = [[1, 2], [3, 4], [5]]
flat = [item for sublist in nested for item in sublist] # [1, 2, 3, 4, 5]For deeper nesting, itertools.chain.from_iterable is more readable.
Merging dicts
Python 3.9 added the | operator for dict merging — it creates a new dict with
all keys from both, and the right-hand side wins on conflicts:
defaults = {"color": "blue", "size": "M"}
overrides = {"size": "L", "weight": 200}
merged = defaults | overrides
# {"color": "blue", "size": "L", "weight": 200}For earlier Python or more than two dicts, ** unpacking in a dict literal does
the same:
merged = {**defaults, **overrides}When you find yourself writing a loop that fills a dict, check whether
defaultdict, Counter, or a comprehension already expresses the same intent
more clearly. The collections module and these built-ins exist precisely to
replace ad-hoc accumulation loops.
Where to go next
Next: String manipulation and pattern matching — extracting and validating
structured data inside strings using the re module.