The CPU and memory hierarchy
The fetch-decode-execute cycle and the speed/capacity trade-offs across registers, cache, RAM, and disk.
- Describe the fetch-decode-execute cycle and what happens at each stage
- Distinguish registers, L1/L2/L3 cache, RAM, and disk by speed and capacity
When you write a loop in Python or C, you probably think of it as "doing work some number of times." The CPU sees something very different: a stream of individual instructions, each manipulating a handful of bytes inside a tiny set of storage cells called registers. Everything about performance — and a surprising amount about correctness — flows from understanding how that stream moves through the hardware.
The fetch-decode-execute cycle
Every CPU core repeats the same three-step cycle billions of times per second:
- Fetch — The program counter (PC) register holds the address of the next instruction. The CPU loads that instruction from memory into an instruction register.
- Decode — Specialised circuits parse the binary instruction: which operation is this? Which registers are the operands? Is there an immediate value embedded in the instruction?
- Execute — The arithmetic-logic unit (ALU) or another functional unit performs the operation and writes the result back to a register (or to memory). The PC advances — or, for a branch, jumps — to the next instruction's address.
This cycle, in its pure form, takes one clock cycle per instruction on an ideal machine. Real CPUs are far from ideal, and the gap between the theoretical maximum and what code actually achieves is where almost all performance work happens.
Registers vs cache vs RAM vs disk
The memory hierarchy exists because fast storage is expensive and physically small:
Level Size Latency Notes
----------- ---------- ----------- --------------------------------
Registers ~1 KB total 0 cycles On-chip; ALU operands live here
L1 cache 32–64 KB ~4 cycles Per core; instruction + data split
L2 cache 256 KB–1 MB ~12 cycles Per core; unified
L3 cache 4–32 MB ~40 cycles Shared across cores on same chip
DRAM (RAM) 4–64 GB ~200 cycles Main memory; off-chip
SSD 256 GB–4 TB ~50 000 cycles NVMe, block-based
HDD 1–20 TB ~10 000 000 cycles Mechanical seekThe numbers are approximate and vary by CPU generation, but the ratios are the lesson: an L1 cache hit is fifty times faster than a DRAM access. A disk seek is fifty thousand times slower. When you hear someone say "this algorithm is fast," the real question is often "fast on what level of the hierarchy?"
Registers are the only operands the ALU can directly use. All computation reads from and writes to registers; loads and stores move data between registers and the cache/memory system. A modern desktop CPU has sixteen to thirty-two general-purpose integer registers and a similar number of floating-point/SIMD registers.
Cache is a set of small, fast memories that sit between the registers and DRAM. Cache operates on cache lines — fixed-size blocks, typically 64 bytes. When a load instruction misses the L1 cache, the hardware fetches an entire 64- byte line from the next level of the hierarchy, even if you only wanted 4 bytes. This prefetches nearby data, which is great if your access pattern is sequential (you're likely to use those nearby bytes) and wasteful if it isn't.
An L1 cache miss costs roughly 40–200 cycles depending on whether the data is
in L2, L3, or DRAM. A function that does 5 arithmetic operations but suffers
one DRAM miss per iteration is bottlenecked on memory, not the CPU. Profilers
that count cache misses (Linux perf, Intel VTune) reveal this directly.
Knowledge check
- 1.In the fetch-decode-execute cycle, what does the program counter (PC) register hold?
- 2.When the CPU reads a single byte from DRAM, only that one byte is transferred into the cache.