The CPU and memory hierarchy

The fetch-decode-execute cycle and the speed/capacity trade-offs across registers, cache, RAM, and disk.

When you write a loop in Python or C, you probably think of it as "doing work some number of times." The CPU sees something very different: a stream of individual instructions, each manipulating a handful of bytes inside a tiny set of storage cells called registers. Everything about performance — and a surprising amount about correctness — flows from understanding how that stream moves through the hardware.

The fetch-decode-execute cycle

Every CPU core repeats the same three-step cycle billions of times per second:

Fetch — The program counter (PC) register holds the address of the next instruction. The CPU loads that instruction from memory into an instruction register.
Decode — Specialised circuits parse the binary instruction: which operation is this? Which registers are the operands? Is there an immediate value embedded in the instruction?
Execute — The arithmetic-logic unit (ALU) or another functional unit performs the operation and writes the result back to a register (or to memory). The PC advances — or, for a branch, jumps — to the next instruction's address.

This cycle, in its pure form, takes one clock cycle per instruction on an ideal machine. Real CPUs are far from ideal, and the gap between the theoretical maximum and what code actually achieves is where almost all performance work happens.

Registers vs cache vs RAM vs disk

The memory hierarchy exists because fast storage is expensive and physically small:

Level         Size            Latency         Notes
-----------   ----------      -----------     --------------------------------
Registers     ~1 KB total     0 cycles        On-chip; ALU operands live here
L1 cache      32–64 KB        ~4 cycles       Per core; instruction + data split
L2 cache      256 KB–1 MB     ~12 cycles      Per core; unified
L3 cache      4–32 MB         ~40 cycles      Shared across cores on same chip
DRAM (RAM)    4–64 GB         ~200 cycles     Main memory; off-chip
SSD           256 GB–4 TB     ~50 000 cycles  NVMe, block-based
HDD           1–20 TB         ~10 000 000 cycles  Mechanical seek

The numbers are approximate and vary by CPU generation, but the ratios are the lesson: an L1 cache hit is fifty times faster than a DRAM access. A disk seek is fifty thousand times slower. When you hear someone say "this algorithm is fast," the real question is often "fast on what level of the hierarchy?"

Registers are the only operands the ALU can directly use. All computation reads from and writes to registers; loads and stores move data between registers and the cache/memory system. A modern desktop CPU has sixteen to thirty-two general-purpose integer registers and a similar number of floating-point/SIMD registers.

Cache is a set of small, fast memories that sit between the registers and DRAM. Cache operates on cache lines — fixed-size blocks, typically 64 bytes. When a load instruction misses the L1 cache, the hardware fetches an entire 64- byte line from the next level of the hierarchy, even if you only wanted 4 bytes. This prefetches nearby data, which is great if your access pattern is sequential (you're likely to use those nearby bytes) and wasteful if it isn't.

An L1 cache miss costs roughly 40–200 cycles depending on whether the data is in L2, L3, or DRAM. A function that does 5 arithmetic operations but suffers one DRAM miss per iteration is bottlenecked on memory, not the CPU. Profilers that count cache misses (Linux perf, Intel VTune) reveal this directly.

Knowledge check

1.
In the fetch-decode-execute cycle, what does the program counter (PC) register hold?
The result of the last arithmetic operationThe address of the next instruction to fetchThe number of clock cycles elapsedThe currently executing process ID
2.
When the CPU reads a single byte from DRAM, only that one byte is transferred into the cache.
TrueFalse

Finished reading? Mark it complete to track your progress.

The fetch-decode-execute cycle

Registers vs cache vs RAM vs disk

Knowledge check

On this page