Quick reference for CPU components, pipelining stages, cache levels, and memory hierarchy.
CPUPipeliningCacheMemory Hierarchy
CPU Components
| Component | Role |
| ALU | Arithmetic (+, -, *, /) and logic (AND, OR, NOT, XOR) operations |
| Control Unit | Fetch, decode instructions; coordinate ALU, registers, memory |
| Registers | Ultra-fast storage inside CPU (~1 cycle). PC, SP, general-purpose (RAX, RBX…) |
| PC (Program Counter) | Address of the next instruction to execute |
| SP (Stack Pointer) | Top of the current stack frame (function calls + local variables) |
Instruction Cycle
1. FETCH — Load instruction at [PC] from memory
2. DECODE — Interpret opcode + operands
3. EXECUTE — ALU performs the operation
4. WRITEBACK — Write result to register or memory
— PC ← PC + instruction_size (or branch target)
// Example: ADD R1, R2, R3 → R1 = R2 + R3
// Fetch: opcode + R1, R2, R3 | Decode: "ADD registers"
// Execute: ALU computes R2+R3 | Writeback: store in R1
Pipelining
// Without pipelining: 4 instructions × 4 stages = 16 cycles
// With 4-stage pipeline: 4 + 3 = 7 cycles (then 1 instr/cycle steady state)
Cycle: 1 2 3 4 5 6 7
I1: F D E W
I2: F D E W
I3: F D E W
I4: F D E W
// Throughput = 1 instruction/cycle after pipeline fills
Pipeline Hazards
| Hazard | Cause | Fix |
| Data | I2 needs a value I1 hasn't written yet | Forwarding (pass result early) or stall (bubble) |
| Control | Branch — wrong instructions fetched | Branch prediction; flush on misprediction |
| Structural | Two stages need same hardware simultaneously | Duplicate resources (separate I-cache + D-cache) |
Memory Hierarchy
| Level |
Latency |
Size |
Location |
| Registers | ~1 cycle | ~1 KB | Inside CPU core |
| L1 Cache | ~4 cycles | 32–64 KB/core | On-chip, per core |
| L2 Cache | ~12 cycles | 256 KB–1 MB/core | On-chip, per core |
| L3 Cache | ~40 cycles | 4–32 MB shared | On-chip, shared |
| RAM (DRAM) | ~100 cycles | 8–64 GB | Motherboard |
| SSD | ~10,000 cycles | 256 GB–4 TB | External |
| HDD | ~5M cycles | 1–20 TB | External |
Cache-Friendly Code
// 2D arrays in C/Java/JS are row-major (row[0][0], row[0][1], … row[0][n])
// Row-major iteration → sequential access → cache-friendly ✓
for (let i = 0; i < N; i++)
for (let j = 0; j < N; j++)
sum += matrix[i][j] // accesses consecutive memory addresses
// Column-major iteration → jumps across rows → cache thrash ✗
for (let j = 0; j < N; j++)
for (let i = 0; i < N; i++)
sum += matrix[i][j] // each access may be a cache miss
// Locality principles:
// Temporal — recently used data is likely to be reused soon
// Spatial — nearby memory addresses are likely to be used next
CISC vs RISC
| Aspect | CISC (x86, x86-64) | RISC (ARM, RISC-V) |
| Instructions | Many, variable-length, complex | Few, fixed-length, simple |
| Memory ops | Can operate directly on memory | Load/store only — register first |
| Pipeline | Harder (variable instruction length) | Easier (uniform stages) |
| Power | Higher — desktops, servers | Lower — phones, tablets, M-series |
| Note | Modern x86 internally translates to RISC-like micro-ops | Apple M1/M4 outperforms x86 in perf/watt |
Key Rules
- Instruction cycle: Fetch → Decode → Execute → Writeback. Repeats billions of times/second.
- Pipelining: steady-state throughput = 1 instruction/cycle. Hazards cause stalls that reduce this.
- Cache miss penalty: L1 miss = ~4× slower; RAM = ~100× slower; SSD = ~10,000× slower than L1 hit.
- Write cache-friendly code: sequential memory access, struct-of-arrays over array-of-structs for hot loops.
- ARM dominates mobile and is gaining server market; RISC-V is the open-source ISA on the rise.