The CPU in 20 Minutes

Type this right now

// save as step.c, compile: gcc -g -O0 -o step step.c
#include <stdio.h>
int main() {
    int a = 10;
    int b = 32;
    int x = a + b;
    printf("x = %d\n", x);
    return 0;
}

Now run it under GDB:

$ gdb ./step
(gdb) break main
(gdb) run
(gdb) info registers
(gdb) si
(gdb) info registers

Watch the rip register change. That register is the CPU's bookmark — it just moved to the next instruction. You're watching the CPU work, one step at a time.


The only thing a CPU does

A CPU does exactly one thing, over and over, billions of times per second:

    ┌──────────────────────────┐
    │                          │
    │   1. FETCH instruction   │◄──── Read bytes at address in rip
    │          │                │
    │          ▼                │
    │   2. DECODE instruction  │◄──── Figure out what the bytes mean
    │          │                │
    │          ▼                │
    │   3. EXECUTE instruction │◄──── Do the thing (add, move, jump...)
    │          │                │
    │          ▼                │
    │   4. Advance rip         │◄──── Point to the next instruction
    │          │                │
    │          └────────────────┘
    └──────────────────────────┘

Fetch. Decode. Execute. Advance. That's it. Every program you've ever used — your web browser, your OS, a video game — is just this loop running at 3+ billion iterations per second.


Registers: the CPU's own storage

Before the CPU can add two numbers, those numbers need to be inside the CPU. RAM is far away (relatively speaking). So the CPU has its own tiny, blazing-fast storage: registers.

On x86-64, you have 16 general-purpose registers, each 64 bits (8 bytes) wide:

    ┌─────────────────────────────────────────────┐
    │             x86-64 Registers                 │
    ├──────────┬──────────────────────────────────-┤
    │   rax    │  Accumulator, return values       │
    │   rbx    │  General purpose (callee-saved)   │
    │   rcx    │  Counter, 4th argument            │
    │   rdx    │  Data, 3rd argument               │
    │   rsi    │  Source index, 2nd argument        │
    │   rdi    │  Destination index, 1st argument   │
    │   rbp    │  Base pointer (frame pointer)     │
    │   rsp    │  Stack pointer ◄── top of stack   │
    │   r8-r15 │  Additional general-purpose regs  │
    ├──────────┼───────────────────────────────────┤
    │   rip    │  Instruction pointer (CPU's       │
    │          │  bookmark, points to NEXT instr.) │
    │   rflags │  Status flags (zero, carry, etc.) │
    └──────────┴───────────────────────────────────┘

When GDB shows you info registers, you're seeing the CPU's entire working state.

Fun Fact: All 16 general-purpose registers together hold just 128 bytes. Your smallest source file is probably bigger. Yet these 128 bytes are where all the real work happens.


Two special registers you must know

rip — the instruction pointer. The CPU is reading a list of instructions, one after another. rip tells it where it currently is.

    Address        Instruction
    ─────────────────────────────
    0x401000       mov  rax, 10       ◄── rip points here
    0x401007       mov  rbx, 32
    0x40100e       add  rax, rbx

After executing mov rax, 10, the CPU advances rip to 0x401007. Jump instructions (jmp, je, call) change rip to a different address — that's how if, for, and function calls work.

rsp — the stack pointer. Always points to the top of the current function's stack frame. Moves down on function calls, back up on returns.


Inside the CPU: the big picture

    ┌──────────────────────────────────────────────────┐
    │                      CPU                          │
    │                                                   │
    │   ┌──────────────┐      ┌──────────────────────┐ │
    │   │ Control Unit │      │     Registers        │ │
    │   │              │      │  rax rbx rcx rdx     │ │
    │   │ Fetches and  │      │  rsi rdi rbp rsp     │ │
    │   │ decodes      │      │  r8-r15              │ │
    │   │ instructions │      │  rip  rflags         │ │
    │   └──────┬───────┘      └──────────┬──────────┘ │
    │          │                          │            │
    │          ▼                          ▼            │
    │   ┌──────────────────────────────────────┐      │
    │   │     ALU (Arithmetic Logic Unit)       │      │
    │   │   add, sub, mul, and, or, xor, cmp   │      │
    │   └──────────────────┬───────────────────┘      │
    └──────────────────────┼───────────────────────────┘
                    ┌──────┴──────┐
                    │ Memory Bus  │
                    └──────┬──────┘
                    ┌──────┴──────┐
                    │     RAM     │
                    └─────────────┘

Control Unit fetches and decodes. ALU does math. Registers hold the data. Memory bus connects to RAM — fast, but much slower than registers.


How x = a + b really executes

C and Rust side by side:

int a = 10;
int b = 32;
int x = a + b;
#![allow(unused)]
fn main() {
let a: i32 = 10;
let b: i32 = 32;
let x: i32 = a + b;
}

Both compile to something like this (simplified, -O0):

mov  DWORD PTR [rbp-4], 10     ; store 10 on the stack (a)
mov  DWORD PTR [rbp-8], 32     ; store 32 on the stack (b)
mov  eax, DWORD PTR [rbp-4]    ; load a into eax
add  eax, DWORD PTR [rbp-8]    ; add b to eax
mov  DWORD PTR [rbp-12], eax   ; store result as x

What do you think happens? Why rbp-4, rbp-8, rbp-12? Why negative offsets?

Reveal: The stack grows downward. rbp is the base of the current frame. Local variables go at decreasing addresses below it.

Trace the CPU's state at each step:

    Instruction                  │ eax   │ [rbp-4] │ [rbp-8] │ [rbp-12]
    ─────────────────────────────┼───────┼─────────┼─────────┼─────────
    mov DWORD PTR [rbp-4], 10   │  ???  │   10    │   ???   │   ???
    mov DWORD PTR [rbp-8], 32   │  ???  │   10    │    32   │   ???
    mov eax, DWORD PTR [rbp-4]  │  10   │   10    │    32   │   ???
    add eax, DWORD PTR [rbp-8]  │  42   │   10    │    32   │   ???
    mov DWORD PTR [rbp-12], eax │  42   │   10    │    32   │    42

Five instructions. That's what x = a + b actually is.


Clock speed: how fast is fast?

    3 GHz = 3,000,000,000 ticks/second
    One tick = 0.33 nanoseconds ≈ time for light to travel 10 cm

A simple add takes 1 cycle. A memory load from RAM can take hundreds of cycles. This is why caches matter — next chapter.

Fun Fact: At 3 GHz, your CPU executes roughly 100 billion instructions during a 30-second YouTube ad. Each one went through fetch-decode-execute.


Seeing it with your own eyes: GDB

$ gcc -g -O0 -o step step.c
$ gdb ./step
(gdb) break main
(gdb) run
(gdb) info registers rip rsp rax
rip            0x401136    0x401136 <main+4>
rsp            0x7fffffffde10  0x7fffffffde10
(gdb) si
(gdb) info registers rip
rip            0x40113d    0x40113d <main+11>

rip moved from 0x401136 to 0x40113d — 7 bytes forward. The instruction was 7 bytes long. The CPU read it, executed it, and advanced rip by exactly that amount.

(gdb) si
(gdb) info registers rip
rip            0x401144    0x401144 <main+18>

You're watching fetch-decode-execute happen in real time.

The same works in Rust:

fn main() {
    let a: i32 = 10;
    let b: i32 = 32;
    let x: i32 = a + b;
    println!("x = {}", x);
}
$ rustc -g -o step_rs step.rs
$ gdb ./step_rs
(gdb) break step::main
(gdb) run
(gdb) si
(gdb) info registers rip rsp

Same CPU. Same registers. Same loop. The language is different; the CPU doesn't care.


Recap

    ┌─────────────────────────────────────────────────┐
    │  The CPU does ONE thing:                         │
    │  Fetch → Decode → Execute → Advance rip          │
    │                                                   │
    │  It works with:                                   │
    │  • Registers (tiny, fast, inside the CPU)        │
    │  • RAM (huge, slow, outside the CPU)             │
    │                                                   │
    │  Key registers:                                   │
    │  • rip  = where am I in the instruction list?    │
    │  • rsp  = where is the top of the stack?         │
    │  • rax  = general workhorse, return values       │
    │                                                   │
    │  Clock speed (3 GHz) = 3 billion cycles/sec      │
    │  Simple instruction = ~1 cycle = ~0.3 ns         │
    └─────────────────────────────────────────────────┘

Task

  1. Compile step.c with gcc -g -O0 -o step step.c.
  2. Open it in GDB: gdb ./step.
  3. Set a breakpoint on main, run, and use si to step through 10 instructions.
  4. After each step, run info registers rip rax rbp rsp.
  5. Write down how rip changes. Is each increment the same? (No — instructions have different lengths.)
  6. Find the instruction where eax becomes 42. That's the add.

Bonus: Run objdump -d step | less and find the main function. Match each instruction address with what GDB showed you. They should be identical.

In the next chapter, we'll find out why loading data from memory is 100x to 1000x slower than working with registers — and how caches save us from that penalty.