Computer Architecture Cheatsheet

Quick reference for CPU components, pipelining stages, cache levels, and memory hierarchy.

CPUPipeliningCacheMemory Hierarchy
NotesCheatsheet

CPU Components

ComponentRole
ALUArithmetic (+, -, *, /) and logic (AND, OR, NOT, XOR) operations
Control UnitFetch, decode instructions; coordinate ALU, registers, memory
RegistersUltra-fast storage inside CPU (~1 cycle). PC, SP, general-purpose (RAX, RBX…)
PC (Program Counter)Address of the next instruction to execute
SP (Stack Pointer)Top of the current stack frame (function calls + local variables)

Instruction Cycle

1. FETCH     — Load instruction at [PC] from memory
2. DECODE    — Interpret opcode + operands
3. EXECUTE   — ALU performs the operation
4. WRITEBACK — Write result to register or memory
             — PC ← PC + instruction_size  (or branch target)

// Example: ADD R1, R2, R3  →  R1 = R2 + R3
// Fetch: opcode + R1, R2, R3  |  Decode: "ADD registers"
// Execute: ALU computes R2+R3  |  Writeback: store in R1

Pipelining

// Without pipelining: 4 instructions × 4 stages = 16 cycles
// With 4-stage pipeline: 4 + 3 = 7 cycles  (then 1 instr/cycle steady state)

Cycle:  1   2   3   4   5   6   7
I1:     F   D   E   W
I2:         F   D   E   W
I3:             F   D   E   W
I4:                 F   D   E   W

// Throughput = 1 instruction/cycle after pipeline fills

Pipeline Hazards

HazardCauseFix
DataI2 needs a value I1 hasn't written yetForwarding (pass result early) or stall (bubble)
ControlBranch — wrong instructions fetchedBranch prediction; flush on misprediction
StructuralTwo stages need same hardware simultaneouslyDuplicate resources (separate I-cache + D-cache)

Memory Hierarchy

Level Latency Size Location
Registers~1 cycle~1 KBInside CPU core
L1 Cache~4 cycles32–64 KB/coreOn-chip, per core
L2 Cache~12 cycles256 KB–1 MB/coreOn-chip, per core
L3 Cache~40 cycles4–32 MB sharedOn-chip, shared
RAM (DRAM)~100 cycles8–64 GBMotherboard
SSD~10,000 cycles256 GB–4 TBExternal
HDD~5M cycles1–20 TBExternal

Cache-Friendly Code

// 2D arrays in C/Java/JS are row-major (row[0][0], row[0][1], … row[0][n])
// Row-major iteration → sequential access → cache-friendly ✓
for (let i = 0; i < N; i++)
  for (let j = 0; j < N; j++)
    sum += matrix[i][j]     // accesses consecutive memory addresses

// Column-major iteration → jumps across rows → cache thrash ✗
for (let j = 0; j < N; j++)
  for (let i = 0; i < N; i++)
    sum += matrix[i][j]     // each access may be a cache miss

// Locality principles:
// Temporal  — recently used data is likely to be reused soon
// Spatial   — nearby memory addresses are likely to be used next

CISC vs RISC

AspectCISC (x86, x86-64)RISC (ARM, RISC-V)
InstructionsMany, variable-length, complexFew, fixed-length, simple
Memory opsCan operate directly on memoryLoad/store only — register first
PipelineHarder (variable instruction length)Easier (uniform stages)
PowerHigher — desktops, serversLower — phones, tablets, M-series
NoteModern x86 internally translates to RISC-like micro-opsApple M1/M4 outperforms x86 in perf/watt

Key Rules

  • Instruction cycle: Fetch → Decode → Execute → Writeback. Repeats billions of times/second.
  • Pipelining: steady-state throughput = 1 instruction/cycle. Hazards cause stalls that reduce this.
  • Cache miss penalty: L1 miss = ~4× slower; RAM = ~100× slower; SSD = ~10,000× slower than L1 hit.
  • Write cache-friendly code: sequential memory access, struct-of-arrays over array-of-structs for hot loops.
  • ARM dominates mobile and is gaining server market; RISC-V is the open-source ISA on the rise.