Why Architecture Matters to Developers
Understanding how the CPU, memory, and cache work lets you write faster code. The difference between an O(n) algorithm with good cache behaviour and one with poor locality can be 10× in practice — even though they have the same Big-O. Architecture knowledge also underpins OS design, embedded systems, and systems programming.
CPU Components
The CPU (Central Processing Unit) is the brain of the computer. It executes instructions — arithmetic, logic, memory access, and control flow. Modern CPUs are incredibly complex, but the conceptual model from the 1940s still applies.
| Component | Role |
|---|---|
| ALU (Arithmetic Logic Unit) | Performs arithmetic (+, -, *, /) and logic (AND, OR, NOT, XOR) operations |
| Control Unit | Fetches instructions from memory, decodes them, coordinates ALU, registers, and memory |
| Registers | Tiny, ultra-fast storage inside the CPU (PC, SP, general-purpose: RAX, RBX…). Access in ~1 cycle |
| Program Counter (PC) | Holds the address of the next instruction to execute |
| Stack Pointer (SP) | Points to the top of the current stack frame (for function calls and local variables) |
The Instruction Cycle
Every instruction a CPU executes follows the same cycle, repeating billions of times per second.
1. FETCH — Read the instruction at address [PC] from memory
2. DECODE — Interpret the instruction (opcode + operands)
3. EXECUTE — ALU performs the operation (add, compare, load, store…)
4. WRITEBACK — Write results back to registers or memory
After EXECUTE: PC ← PC + instruction_size (or branch target for jumps)
// Example: "ADD R1, R2, R3" → R1 = R2 + R3
// 1. Fetch: load opcode 0x01 and operands R1, R2, R3 from memory
// 2. Decode: "this is an ADD instruction on registers"
// 3. Execute: ALU computes R2 + R3
// 4. Writeback: store result in R1
Pipelining
Instead of waiting for one instruction to fully complete before fetching the next, pipelining overlaps stages. Think of a car wash assembly line — while one car is being rinsed, another is being soaped.
Without pipelining (4 instructions, 4 stages each = 16 cycles):
Cycle: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
I1: F D E W
I2: F D E W
I3: F D E W
I4: F D E W
With 4-stage pipeline (4 instructions = 7 cycles):
Cycle: 1 2 3 4 5 6 7
I1: F D E W
I2: - F D E W
I3: - - F D E W
I4: - - - F D E W
// Throughput = 1 instruction/cycle (after pipeline fills)
// vs 1 instruction/4 cycles without pipelining
Pipeline Hazards
| Hazard | Cause | Solution |
|---|---|---|
| Data Hazard | I2 needs a value that I1 hasn't written back yet | Forwarding (pass result early), stall (insert bubble) |
| Control Hazard | Branch: pipeline fetches wrong instructions | Branch prediction; flush on misprediction |
| Structural Hazard | Two stages need the same hardware simultaneously | Duplicate resources (separate I-cache and D-cache) |
Memory Hierarchy
Memory is organised in a hierarchy: the closer to the CPU, the faster and more expensive (smaller). The key principle is locality — recently-used data is likely to be reused (temporal locality), and nearby data is likely to be used soon (spatial locality). Cache exploits both.
| Level | Latency | Size | Where |
|---|---|---|---|
| Registers | ~1 cycle | ~1 KB | Inside CPU core |
| L1 Cache | ~4 cycles | 32–64 KB per core | On-chip, per core |
| L2 Cache | ~12 cycles | 256 KB – 1 MB per core | On-chip, per core |
| L3 Cache | ~40 cycles | 4–32 MB shared | On-chip, shared across cores |
| RAM (DRAM) | ~100 cycles | 8–64 GB | On motherboard |
| SSD | ~10,000 cycles | 256 GB – 4 TB | External to CPU |
| HDD | ~5M cycles | 1–20 TB | External to CPU |
Cache-friendly code
Row-major vs column-major iteration: in C/Java/JS, 2D arrays are stored row-by-row. Iterating row by row accesses consecutive memory — cache-friendly. Iterating column by column jumps across rows — each access may miss the cache. This single change can make matrix multiplication 5× faster.
CISC vs RISC
| Aspect | CISC (x86, x86-64) | RISC (ARM, RISC-V) |
|---|---|---|
| Instructions | Many, variable-length, complex | Few, fixed-length, simple |
| Memory ops | Can operate on memory directly | Load/store only — must move to register first |
| Pipeline-friendly | Harder (variable instruction length) | Easier (uniform stages) |
| Power | Higher (desktops, servers) | Lower (phones, tablets, Apple M-series) |
| Modern note | Modern x86 CPUs internally translate to RISC-like micro-ops | ARM dominates mobile; Apple Silicon (M1/M4) outperforms x86 in perf/watt |
Key Takeaways
- The instruction cycle — Fetch, Decode, Execute, Writeback — is the fundamental loop of every CPU.
- Pipelining executes multiple instruction stages simultaneously, drastically improving throughput.
- Cache is the most important factor in real-world performance — write cache-friendly code (sequential access patterns).
- A cache miss in L1 is ~4× slower than a hit; a RAM access is ~100× slower; an SSD access is ~10,000× slower.
- RISC dominates mobile and increasingly servers (ARM/RISC-V); CISC (x86) dominates desktops and traditional servers.