Computer Architecture Cheatsheet

CPU Components

Component	Role
ALU	Arithmetic (+, -, *, /) and logic (AND, OR, NOT, XOR) operations
Control Unit	Fetch, decode instructions; coordinate ALU, registers, memory
Registers	Ultra-fast storage inside CPU (~1 cycle). PC, SP, general-purpose (RAX, RBX…)
PC (Program Counter)	Address of the next instruction to execute
SP (Stack Pointer)	Top of the current stack frame (function calls + local variables)

Instruction Cycle

1. FETCH     — Load instruction at [PC] from memory
2. DECODE    — Interpret opcode + operands
3. EXECUTE   — ALU performs the operation
4. WRITEBACK — Write result to register or memory
             — PC ← PC + instruction_size  (or branch target)

// Example: ADD R1, R2, R3  →  R1 = R2 + R3
// Fetch: opcode + R1, R2, R3  |  Decode: "ADD registers"
// Execute: ALU computes R2+R3  |  Writeback: store in R1

Pipelining

// Without pipelining: 4 instructions × 4 stages = 16 cycles
// With 4-stage pipeline: 4 + 3 = 7 cycles  (then 1 instr/cycle steady state)

Cycle:  1   2   3   4   5   6   7
I1:     F   D   E   W
I2:         F   D   E   W
I3:             F   D   E   W
I4:                 F   D   E   W

// Throughput = 1 instruction/cycle after pipeline fills

Pipeline Hazards

Hazard	Cause	Fix
Data	I2 needs a value I1 hasn't written yet	Forwarding (pass result early) or stall (bubble)
Control	Branch — wrong instructions fetched	Branch prediction; flush on misprediction
Structural	Two stages need same hardware simultaneously	Duplicate resources (separate I-cache + D-cache)

Memory Hierarchy

Level	Latency	Size	Location
Registers	~1 cycle	~1 KB	Inside CPU core
L1 Cache	~4 cycles	32–64 KB/core	On-chip, per core
L2 Cache	~12 cycles	256 KB–1 MB/core	On-chip, per core
L3 Cache	~40 cycles	4–32 MB shared	On-chip, shared
RAM (DRAM)	~100 cycles	8–64 GB	Motherboard
SSD	~10,000 cycles	256 GB–4 TB	External
HDD	~5M cycles	1–20 TB	External

Cache-Friendly Code

// 2D arrays in C/Java/JS are row-major (row[0][0], row[0][1], … row[0][n])
// Row-major iteration → sequential access → cache-friendly ✓
for (let i = 0; i < N; i++)
  for (let j = 0; j < N; j++)
    sum += matrix[i][j]     // accesses consecutive memory addresses

// Column-major iteration → jumps across rows → cache thrash ✗
for (let j = 0; j < N; j++)
  for (let i = 0; i < N; i++)
    sum += matrix[i][j]     // each access may be a cache miss

// Locality principles:
// Temporal  — recently used data is likely to be reused soon
// Spatial   — nearby memory addresses are likely to be used next

CISC vs RISC

Aspect	CISC (x86, x86-64)	RISC (ARM, RISC-V)
Instructions	Many, variable-length, complex	Few, fixed-length, simple
Memory ops	Can operate directly on memory	Load/store only — register first
Pipeline	Harder (variable instruction length)	Easier (uniform stages)
Power	Higher — desktops, servers	Lower — phones, tablets, M-series
Note	Modern x86 internally translates to RISC-like micro-ops	Apple M1/M4 outperforms x86 in perf/watt

Key Rules

Instruction cycle: Fetch → Decode → Execute → Writeback. Repeats billions of times/second.
Pipelining: steady-state throughput = 1 instruction/cycle. Hazards cause stalls that reduce this.
Cache miss penalty: L1 miss = ~4× slower; RAM = ~100× slower; SSD = ~10,000× slower than L1 hit.
Write cache-friendly code: sequential memory access, struct-of-arrays over array-of-structs for hot loops.
ARM dominates mobile and is gaining server market; RISC-V is the open-source ISA on the rise.