Computer Architecture · contentintech

Why Architecture Matters to Developers

Understanding how the CPU, memory, and cache work lets you write faster code. The difference between an O(n) algorithm with good cache behaviour and one with poor locality can be 10× in practice — even though they have the same Big-O. Architecture knowledge also underpins OS design, embedded systems, and systems programming.

CPU Components

The CPU (Central Processing Unit) is the brain of the computer. It executes instructions — arithmetic, logic, memory access, and control flow. Modern CPUs are incredibly complex, but the conceptual model from the 1940s still applies.

Component	Role
ALU (Arithmetic Logic Unit)	Performs arithmetic (+, -, *, /) and logic (AND, OR, NOT, XOR) operations
Control Unit	Fetches instructions from memory, decodes them, coordinates ALU, registers, and memory
Registers	Tiny, ultra-fast storage inside the CPU (PC, SP, general-purpose: RAX, RBX…). Access in ~1 cycle
Program Counter (PC)	Holds the address of the next instruction to execute
Stack Pointer (SP)	Points to the top of the current stack frame (for function calls and local variables)

The Instruction Cycle

Every instruction a CPU executes follows the same cycle, repeating billions of times per second.

1. FETCH    — Read the instruction at address [PC] from memory
2. DECODE   — Interpret the instruction (opcode + operands)
3. EXECUTE  — ALU performs the operation (add, compare, load, store…)
4. WRITEBACK — Write results back to registers or memory

After EXECUTE: PC ← PC + instruction_size  (or branch target for jumps)

// Example: "ADD R1, R2, R3" → R1 = R2 + R3
// 1. Fetch: load opcode 0x01 and operands R1, R2, R3 from memory
// 2. Decode: "this is an ADD instruction on registers"
// 3. Execute: ALU computes R2 + R3
// 4. Writeback: store result in R1

Pipelining

Instead of waiting for one instruction to fully complete before fetching the next, pipelining overlaps stages. Think of a car wash assembly line — while one car is being rinsed, another is being soaped.

Without pipelining (4 instructions, 4 stages each = 16 cycles):
Cycle:  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
I1:     F   D   E   W
I2:                 F   D   E   W
I3:                             F   D   E   W
I4:                                         F   D   E   W

With 4-stage pipeline (4 instructions = 7 cycles):
Cycle:  1   2   3   4   5   6   7
I1:     F   D   E   W
I2:     -   F   D   E   W
I3:     -   -   F   D   E   W
I4:     -   -   -   F   D   E   W

// Throughput = 1 instruction/cycle (after pipeline fills)
// vs 1 instruction/4 cycles without pipelining

Pipeline Hazards

Hazard	Cause	Solution
Data Hazard	I2 needs a value that I1 hasn't written back yet	Forwarding (pass result early), stall (insert bubble)
Control Hazard	Branch: pipeline fetches wrong instructions	Branch prediction; flush on misprediction
Structural Hazard	Two stages need the same hardware simultaneously	Duplicate resources (separate I-cache and D-cache)

Memory Hierarchy

Memory is organised in a hierarchy: the closer to the CPU, the faster and more expensive (smaller). The key principle is locality — recently-used data is likely to be reused (temporal locality), and nearby data is likely to be used soon (spatial locality). Cache exploits both.

Level	Latency	Size	Where
Registers	~1 cycle	~1 KB	Inside CPU core
L1 Cache	~4 cycles	32–64 KB per core	On-chip, per core
L2 Cache	~12 cycles	256 KB – 1 MB per core	On-chip, per core
L3 Cache	~40 cycles	4–32 MB shared	On-chip, shared across cores
RAM (DRAM)	~100 cycles	8–64 GB	On motherboard
SSD	~10,000 cycles	256 GB – 4 TB	External to CPU
HDD	~5M cycles	1–20 TB	External to CPU

Cache-friendly code

Row-major vs column-major iteration: in C/Java/JS, 2D arrays are stored row-by-row. Iterating row by row accesses consecutive memory — cache-friendly. Iterating column by column jumps across rows — each access may miss the cache. This single change can make matrix multiplication 5× faster.

CISC vs RISC

Aspect	CISC (x86, x86-64)	RISC (ARM, RISC-V)
Instructions	Many, variable-length, complex	Few, fixed-length, simple
Memory ops	Can operate on memory directly	Load/store only — must move to register first
Pipeline-friendly	Harder (variable instruction length)	Easier (uniform stages)
Power	Higher (desktops, servers)	Lower (phones, tablets, Apple M-series)
Modern note	Modern x86 CPUs internally translate to RISC-like micro-ops	ARM dominates mobile; Apple Silicon (M1/M4) outperforms x86 in perf/watt

Key Takeaways

The instruction cycle — Fetch, Decode, Execute, Writeback — is the fundamental loop of every CPU.
Pipelining executes multiple instruction stages simultaneously, drastically improving throughput.
Cache is the most important factor in real-world performance — write cache-friendly code (sequential access patterns).
A cache miss in L1 is ~4× slower than a hit; a RAM access is ~100× slower; an SSD access is ~10,000× slower.
RISC dominates mobile and increasingly servers (ARM/RISC-V); CISC (x86) dominates desktops and traditional servers.