Cache locality in practice
Temporal and spatial locality, the matrix traversal example, and how struct layout choices change real-world performance.
- Define cache locality and give a concrete example of code that exploits it
Why cache locality matters in your code
Cache locality is the single most actionable architectural concept for application developers. It comes in two flavours:
Temporal locality — if you access a value, you are likely to access it again soon. Keep hot data in variables (registers/L1) rather than re-loading it from memory. Loop invariants pulled outside a loop exploit temporal locality.
Spatial locality — if you access one address, you are likely to access nearby addresses soon. Processing data in memory order exploits the fact that the entire cache line was already fetched.
The matrix traversal example
The classic demonstration is matrix traversal. A matrix stored in row-major order
has rows laid out contiguously in memory. Iterating row-by-row (inner loop over
columns) is cache-friendly: you march through memory sequentially. Iterating
column-by-column (inner loop over rows) skips num_columns × element_size bytes
between accesses — every access is likely a cache miss.
# Row-major traversal — cache friendly (accesses elements in memory order)
total = 0
for row in matrix:
for val in row:
total += val
# Column-major traversal — cache hostile (jumps across rows in memory)
total = 0
for col in range(len(matrix[0])):
for row in range(len(matrix)):
total += matrix[row][col]On a large matrix this difference can be 5–10x in wall-clock time — not because the algorithm changes, but because the access pattern changes how many cache misses occur.
Struct-of-arrays vs array-of-structs
Data structure layout also matters. A struct-of-arrays (SoA) layout groups all values of one field together; an array-of-structs (AoS) layout groups all fields of one record together. If a hot loop reads only one field, SoA gives better spatial locality. If a loop reads all fields of each record in turn, AoS is better. Game engines and high-performance numerical code choose SoA or AoS deliberately for exactly this reason.
// Array-of-Structs (AoS) — all fields of each entity together
struct Entity { float x, y, z; int health; float speed; };
Entity entities[1000];
// Struct-of-Arrays (SoA) — each field packed with its peers
float x[1000], y[1000], z[1000];
int health[1000];
float speed[1000];A physics loop that only updates x, y, z runs faster on SoA because those
three arrays are dense in memory. The health and speed fields — not touched by
the physics loop — stay out of the cache entirely.
The CPU you are writing for today has out-of-order execution, hardware prefetchers, and speculative execution — mechanisms that mask some of the penalties described here. Do not guess: measure with a profiler before restructuring data layouts. Premature optimisation around cache misses in code that is not actually bottlenecked on memory wastes time and reduces readability.
Where to go next
Understanding how the CPU sees your code makes OS-level concepts click into place — processes, threads, context switching, and virtual memory all interact directly with the register and cache state described here. The Operating Systems track continues the story from where the hardware leaves off.
Knowledge check
- 1.Which of the following programming practices improve cache locality?
- 2.A physics loop only updates the x, y, z position fields of 10 000 entities — it never reads health or speed. Which layout gives better cache performance?