Learn/cs fundamentals/Computer Architecture
Intermediate~15 min read

Computer Architecture

CPU design, instruction cycle, pipelining, cache hierarchy, and memory systems.

CPUPipeliningCacheMemory Hierarchy

Why Architecture Matters to Developers

Understanding how the CPU, memory, and cache work lets you write faster code. The difference between an O(n) algorithm with good cache behaviour and one with poor locality can be 10× in practice — even though they have the same Big-O. Architecture knowledge also underpins OS design, embedded systems, and systems programming.

CPU Components

The CPU (Central Processing Unit) is the brain of the computer. It executes instructions — arithmetic, logic, memory access, and control flow. Modern CPUs are incredibly complex, but the conceptual model from the 1940s still applies.

ComponentRole
ALU (Arithmetic Logic Unit)Performs arithmetic (+, -, *, /) and logic (AND, OR, NOT, XOR) operations
Control UnitFetches instructions from memory, decodes them, coordinates ALU, registers, and memory
RegistersTiny, ultra-fast storage inside the CPU (PC, SP, general-purpose: RAX, RBX…). Access in ~1 cycle
Program Counter (PC)Holds the address of the next instruction to execute
Stack Pointer (SP)Points to the top of the current stack frame (for function calls and local variables)

The Instruction Cycle

Every instruction a CPU executes follows the same cycle, repeating billions of times per second.

1. FETCH    — Read the instruction at address [PC] from memory
2. DECODE   — Interpret the instruction (opcode + operands)
3. EXECUTE  — ALU performs the operation (add, compare, load, store…)
4. WRITEBACK — Write results back to registers or memory

After EXECUTE: PC ← PC + instruction_size  (or branch target for jumps)

// Example: "ADD R1, R2, R3" → R1 = R2 + R3
// 1. Fetch: load opcode 0x01 and operands R1, R2, R3 from memory
// 2. Decode: "this is an ADD instruction on registers"
// 3. Execute: ALU computes R2 + R3
// 4. Writeback: store result in R1

Pipelining

Instead of waiting for one instruction to fully complete before fetching the next, pipelining overlaps stages. Think of a car wash assembly line — while one car is being rinsed, another is being soaped.

Without pipelining (4 instructions, 4 stages each = 16 cycles):
Cycle:  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
I1:     F   D   E   W
I2:                 F   D   E   W
I3:                             F   D   E   W
I4:                                         F   D   E   W

With 4-stage pipeline (4 instructions = 7 cycles):
Cycle:  1   2   3   4   5   6   7
I1:     F   D   E   W
I2:     -   F   D   E   W
I3:     -   -   F   D   E   W
I4:     -   -   -   F   D   E   W

// Throughput = 1 instruction/cycle (after pipeline fills)
// vs 1 instruction/4 cycles without pipelining

Pipeline Hazards

HazardCauseSolution
Data HazardI2 needs a value that I1 hasn't written back yetForwarding (pass result early), stall (insert bubble)
Control HazardBranch: pipeline fetches wrong instructionsBranch prediction; flush on misprediction
Structural HazardTwo stages need the same hardware simultaneouslyDuplicate resources (separate I-cache and D-cache)

Memory Hierarchy

Memory is organised in a hierarchy: the closer to the CPU, the faster and more expensive (smaller). The key principle is locality — recently-used data is likely to be reused (temporal locality), and nearby data is likely to be used soon (spatial locality). Cache exploits both.

Level Latency Size Where
Registers~1 cycle~1 KBInside CPU core
L1 Cache~4 cycles32–64 KB per coreOn-chip, per core
L2 Cache~12 cycles256 KB – 1 MB per coreOn-chip, per core
L3 Cache~40 cycles4–32 MB sharedOn-chip, shared across cores
RAM (DRAM)~100 cycles8–64 GBOn motherboard
SSD~10,000 cycles256 GB – 4 TBExternal to CPU
HDD~5M cycles1–20 TBExternal to CPU

Cache-friendly code

Row-major vs column-major iteration: in C/Java/JS, 2D arrays are stored row-by-row. Iterating row by row accesses consecutive memory — cache-friendly. Iterating column by column jumps across rows — each access may miss the cache. This single change can make matrix multiplication 5× faster.

CISC vs RISC

AspectCISC (x86, x86-64)RISC (ARM, RISC-V)
InstructionsMany, variable-length, complexFew, fixed-length, simple
Memory opsCan operate on memory directlyLoad/store only — must move to register first
Pipeline-friendlyHarder (variable instruction length)Easier (uniform stages)
PowerHigher (desktops, servers)Lower (phones, tablets, Apple M-series)
Modern noteModern x86 CPUs internally translate to RISC-like micro-opsARM dominates mobile; Apple Silicon (M1/M4) outperforms x86 in perf/watt

Key Takeaways

  • The instruction cycle — Fetch, Decode, Execute, Writeback — is the fundamental loop of every CPU.
  • Pipelining executes multiple instruction stages simultaneously, drastically improving throughput.
  • Cache is the most important factor in real-world performance — write cache-friendly code (sequential access patterns).
  • A cache miss in L1 is ~4× slower than a hit; a RAM access is ~100× slower; an SSD access is ~10,000× slower.
  • RISC dominates mobile and increasingly servers (ARM/RISC-V); CISC (x86) dominates desktops and traditional servers.