How Programs Really Run: From CPU to Process Memory

Type this right now

// save as where.c, compile: gcc -g -o where where.c
#include <stdio.h>
#include <stdlib.h>

int global = 99;

int main() {
    int stack_var = 42;
    int *heap_var = malloc(sizeof(int));
    *heap_var = 7;

    printf("Code:   %p\n", (void *)main);
    printf("Global: %p\n", (void *)&global);
    printf("Stack:  %p\n", (void *)&stack_var);
    printf("Heap:   %p\n", (void *)heap_var);

    free(heap_var);
    return 0;
}

Run it. You'll see four wildly different addresses. Those addresses tell a story — a story about how your operating system, your CPU, and your compiler conspire to make your program work.

By the end of this book, you'll know exactly why each of those addresses is where it is.


What this book is about

When you write int x = 42; in C or let x: i32 = 42; in Rust, a staggering amount of machinery activates. The compiler translates your intent into machine instructions. The linker stitches object files into a binary. The OS loader maps that binary into virtual memory. The CPU fetches instructions one by one, reading and writing to registers and RAM.

Most programmers treat all of that as a black box. This book opens the box.

We won't hand-wave. We won't say "the OS handles it" and move on. We'll show you the actual mechanism — the CPU instruction, the kernel data structure, the page table entry — and then give you a tool to observe it yourself on a real, running system.

Who this is for

You should already know how to write code in C or Rust (or both). You don't need to be an expert. If you can write a function, allocate memory, and compile a program, you have enough background.

What you probably don't know yet:

  • Why your stack variable lives at 0x7ffd... but your heap variable lives at 0x55a...
  • What the CPU actually does with mov rax, [rbp-8]
  • Why a segfault happens at the hardware level
  • What strace is showing you and why it matters
  • How Rust's borrow checker maps to the same physical reality as C's raw pointers

That's what we're here for.

C and Rust, side by side

Every concept in this book is demonstrated in both C and Rust. Not because one is better — because seeing the same low-level reality from two different languages makes both clearer.

C gives you no guardrails. You see the raw mechanism. Rust adds compile-time guarantees. You see how safety is enforced without runtime cost.

Same CPU. Same instructions. Same virtual memory. Different contracts with the programmer.

How to read this book

Every chapter follows the same structure:

  1. Something you can type and run in 30 seconds. Seeing is believing.
  2. The concept, explained with ASCII diagrams. No hand-waving.
  3. The mechanism. What the CPU/OS/compiler actually does.
  4. Code in C and Rust. Side by side.
  5. A tool to observe it live. GDB, strace, /proc, objdump, readelf.
  6. A hands-on task. You learn by doing, not by reading.

You can read front-to-back, or jump to any chapter that interests you. But Part I (The Machine) is worth reading first — everything else builds on it.

What you'll need

  • A Linux system (native, WSL2, or a VM all work)
  • GCC and Rust (rustc/cargo) installed
  • GDB, strace, objdump, readelf (standard on most Linux distros)
  • A text editor and a terminal

That's it. No special hardware. No expensive tools. Everything we use is free and open source.

The journey

Part I:   The Machine        — CPU, memory, instructions, privilege
Part II:  The Illusion        — How your program sees memory
Part III: The Binary          — ELF files, compilation, linking, loading
Part IV:  The Mechanism       — Virtual memory, page tables, page faults
Part V:   Allocation          — malloc, Rust's allocator, data layout
Part VI:  Threads and Safety  — Shared memory, C vs Rust tradeoffs
Part VII: Observe and Build   — Tools and experiments

By the end, int x = 42 won't be magic anymore. You'll know the register it passes through, the cache line it occupies, the page table entry that maps it, and the virtual address the OS assigned to it.

Let's start with the CPU.

The CPU in 20 Minutes

Type this right now

// save as step.c, compile: gcc -g -O0 -o step step.c
#include <stdio.h>
int main() {
    int a = 10;
    int b = 32;
    int x = a + b;
    printf("x = %d\n", x);
    return 0;
}

Now run it under GDB:

$ gdb ./step
(gdb) break main
(gdb) run
(gdb) info registers
(gdb) si
(gdb) info registers

Watch the rip register change. That register is the CPU's bookmark — it just moved to the next instruction. You're watching the CPU work, one step at a time.


The only thing a CPU does

A CPU does exactly one thing, over and over, billions of times per second:

    ┌──────────────────────────┐
    │                          │
    │   1. FETCH instruction   │◄──── Read bytes at address in rip
    │          │                │
    │          ▼                │
    │   2. DECODE instruction  │◄──── Figure out what the bytes mean
    │          │                │
    │          ▼                │
    │   3. EXECUTE instruction │◄──── Do the thing (add, move, jump...)
    │          │                │
    │          ▼                │
    │   4. Advance rip         │◄──── Point to the next instruction
    │          │                │
    │          └────────────────┘
    └──────────────────────────┘

Fetch. Decode. Execute. Advance. That's it. Every program you've ever used — your web browser, your OS, a video game — is just this loop running at 3+ billion iterations per second.


Registers: the CPU's own storage

Before the CPU can add two numbers, those numbers need to be inside the CPU. RAM is far away (relatively speaking). So the CPU has its own tiny, blazing-fast storage: registers.

On x86-64, you have 16 general-purpose registers, each 64 bits (8 bytes) wide:

    ┌─────────────────────────────────────────────┐
    │             x86-64 Registers                 │
    ├──────────┬──────────────────────────────────-┤
    │   rax    │  Accumulator, return values       │
    │   rbx    │  General purpose (callee-saved)   │
    │   rcx    │  Counter, 4th argument            │
    │   rdx    │  Data, 3rd argument               │
    │   rsi    │  Source index, 2nd argument        │
    │   rdi    │  Destination index, 1st argument   │
    │   rbp    │  Base pointer (frame pointer)     │
    │   rsp    │  Stack pointer ◄── top of stack   │
    │   r8-r15 │  Additional general-purpose regs  │
    ├──────────┼───────────────────────────────────┤
    │   rip    │  Instruction pointer (CPU's       │
    │          │  bookmark, points to NEXT instr.) │
    │   rflags │  Status flags (zero, carry, etc.) │
    └──────────┴───────────────────────────────────┘

When GDB shows you info registers, you're seeing the CPU's entire working state.

Fun Fact: All 16 general-purpose registers together hold just 128 bytes. Your smallest source file is probably bigger. Yet these 128 bytes are where all the real work happens.


Two special registers you must know

rip — the instruction pointer. The CPU is reading a list of instructions, one after another. rip tells it where it currently is.

    Address        Instruction
    ─────────────────────────────
    0x401000       mov  rax, 10       ◄── rip points here
    0x401007       mov  rbx, 32
    0x40100e       add  rax, rbx

After executing mov rax, 10, the CPU advances rip to 0x401007. Jump instructions (jmp, je, call) change rip to a different address — that's how if, for, and function calls work.

rsp — the stack pointer. Always points to the top of the current function's stack frame. Moves down on function calls, back up on returns.


Inside the CPU: the big picture

    ┌──────────────────────────────────────────────────┐
    │                      CPU                          │
    │                                                   │
    │   ┌──────────────┐      ┌──────────────────────┐ │
    │   │ Control Unit │      │     Registers        │ │
    │   │              │      │  rax rbx rcx rdx     │ │
    │   │ Fetches and  │      │  rsi rdi rbp rsp     │ │
    │   │ decodes      │      │  r8-r15              │ │
    │   │ instructions │      │  rip  rflags         │ │
    │   └──────┬───────┘      └──────────┬──────────┘ │
    │          │                          │            │
    │          ▼                          ▼            │
    │   ┌──────────────────────────────────────┐      │
    │   │     ALU (Arithmetic Logic Unit)       │      │
    │   │   add, sub, mul, and, or, xor, cmp   │      │
    │   └──────────────────┬───────────────────┘      │
    └──────────────────────┼───────────────────────────┘
                    ┌──────┴──────┐
                    │ Memory Bus  │
                    └──────┬──────┘
                    ┌──────┴──────┐
                    │     RAM     │
                    └─────────────┘

Control Unit fetches and decodes. ALU does math. Registers hold the data. Memory bus connects to RAM — fast, but much slower than registers.


How x = a + b really executes

C and Rust side by side:

int a = 10;
int b = 32;
int x = a + b;
#![allow(unused)]
fn main() {
let a: i32 = 10;
let b: i32 = 32;
let x: i32 = a + b;
}

Both compile to something like this (simplified, -O0):

mov  DWORD PTR [rbp-4], 10     ; store 10 on the stack (a)
mov  DWORD PTR [rbp-8], 32     ; store 32 on the stack (b)
mov  eax, DWORD PTR [rbp-4]    ; load a into eax
add  eax, DWORD PTR [rbp-8]    ; add b to eax
mov  DWORD PTR [rbp-12], eax   ; store result as x

What do you think happens? Why rbp-4, rbp-8, rbp-12? Why negative offsets?

Reveal: The stack grows downward. rbp is the base of the current frame. Local variables go at decreasing addresses below it.

Trace the CPU's state at each step:

    Instruction                  │ eax   │ [rbp-4] │ [rbp-8] │ [rbp-12]
    ─────────────────────────────┼───────┼─────────┼─────────┼─────────
    mov DWORD PTR [rbp-4], 10   │  ???  │   10    │   ???   │   ???
    mov DWORD PTR [rbp-8], 32   │  ???  │   10    │    32   │   ???
    mov eax, DWORD PTR [rbp-4]  │  10   │   10    │    32   │   ???
    add eax, DWORD PTR [rbp-8]  │  42   │   10    │    32   │   ???
    mov DWORD PTR [rbp-12], eax │  42   │   10    │    32   │    42

Five instructions. That's what x = a + b actually is.


Clock speed: how fast is fast?

    3 GHz = 3,000,000,000 ticks/second
    One tick = 0.33 nanoseconds ≈ time for light to travel 10 cm

A simple add takes 1 cycle. A memory load from RAM can take hundreds of cycles. This is why caches matter — next chapter.

Fun Fact: At 3 GHz, your CPU executes roughly 100 billion instructions during a 30-second YouTube ad. Each one went through fetch-decode-execute.


Seeing it with your own eyes: GDB

$ gcc -g -O0 -o step step.c
$ gdb ./step
(gdb) break main
(gdb) run
(gdb) info registers rip rsp rax
rip            0x401136    0x401136 <main+4>
rsp            0x7fffffffde10  0x7fffffffde10
(gdb) si
(gdb) info registers rip
rip            0x40113d    0x40113d <main+11>

rip moved from 0x401136 to 0x40113d — 7 bytes forward. The instruction was 7 bytes long. The CPU read it, executed it, and advanced rip by exactly that amount.

(gdb) si
(gdb) info registers rip
rip            0x401144    0x401144 <main+18>

You're watching fetch-decode-execute happen in real time.

The same works in Rust:

fn main() {
    let a: i32 = 10;
    let b: i32 = 32;
    let x: i32 = a + b;
    println!("x = {}", x);
}
$ rustc -g -o step_rs step.rs
$ gdb ./step_rs
(gdb) break step::main
(gdb) run
(gdb) si
(gdb) info registers rip rsp

Same CPU. Same registers. Same loop. The language is different; the CPU doesn't care.


Recap

    ┌─────────────────────────────────────────────────┐
    │  The CPU does ONE thing:                         │
    │  Fetch → Decode → Execute → Advance rip          │
    │                                                   │
    │  It works with:                                   │
    │  • Registers (tiny, fast, inside the CPU)        │
    │  • RAM (huge, slow, outside the CPU)             │
    │                                                   │
    │  Key registers:                                   │
    │  • rip  = where am I in the instruction list?    │
    │  • rsp  = where is the top of the stack?         │
    │  • rax  = general workhorse, return values       │
    │                                                   │
    │  Clock speed (3 GHz) = 3 billion cycles/sec      │
    │  Simple instruction = ~1 cycle = ~0.3 ns         │
    └─────────────────────────────────────────────────┘

Task

  1. Compile step.c with gcc -g -O0 -o step step.c.
  2. Open it in GDB: gdb ./step.
  3. Set a breakpoint on main, run, and use si to step through 10 instructions.
  4. After each step, run info registers rip rax rbp rsp.
  5. Write down how rip changes. Is each increment the same? (No — instructions have different lengths.)
  6. Find the instruction where eax becomes 42. That's the add.

Bonus: Run objdump -d step | less and find the main function. Match each instruction address with what GDB showed you. They should be identical.

In the next chapter, we'll find out why loading data from memory is 100x to 1000x slower than working with registers — and how caches save us from that penalty.

The Memory Hierarchy: Why Speed Has Layers

Type this right now

// save as cache_test.c, compile: gcc -O2 -o cache_test cache_test.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define SIZE (64 * 1024 * 1024)  // 64 million ints = 256 MB

int main() {
    int *arr = malloc(SIZE * sizeof(int));
    for (int i = 0; i < SIZE; i++) arr[i] = i;

    clock_t start;
    long long sum;

    // Sequential access
    sum = 0; start = clock();
    for (int i = 0; i < SIZE; i++) sum += arr[i];
    double seq = (double)(clock() - start) / CLOCKS_PER_SEC;

    // Strided access (every 16th element = skip one cache line)
    sum = 0; start = clock();
    for (int i = 0; i < SIZE; i += 16) sum += arr[i];
    double stride = (double)(clock() - start) / CLOCKS_PER_SEC;

    printf("Sequential: %.3f sec  (sum=%lld)\n", seq, sum);
    printf("Stride-16:  %.3f sec  (sum=%lld)\n", stride, sum);
    printf("Stride does 16x fewer adds but is NOT 16x faster.\n");
    free(arr);
}

Run it. The stride-16 loop does 16 times fewer additions but is nowhere near 16x faster. Why? Because memory access time dominates, and every 16th element is a new cache line.


The brutal reality: speed vs. size

    ┌───────────────┬──────────────┬─────────────┬────────────────────┐
    │    Level      │   Latency    │    Size      │  Slower than regs? │
    ├───────────────┼──────────────┼─────────────┼────────────────────┤
    │  Registers    │    ~0.3 ns   │   ~128 B    │        1x          │
    │  L1 Cache     │     ~1 ns    │   32-64 KB  │        3x          │
    │  L2 Cache     │     ~4 ns    │   256 KB    │       13x          │
    │  L3 Cache     │    ~12 ns    │   8-32 MB   │       40x          │
    │  RAM (DDR5)   │   ~100 ns    │   8-128 GB  │      300x          │
    │  SSD (NVMe)   │ ~100,000 ns  │   0.5-4 TB  │   300,000x         │
    │  HDD          │~10,000,000 ns│   1-20 TB   │ 30,000,000x        │
    └───────────────┴──────────────┴─────────────┴────────────────────┘

If a register access were 1 second, RAM would be 5 minutes. An SSD would be 3.5 days. A hard drive seek would be almost an entire year.


The pyramid

                      ▲
                     /|\          Registers: 128 B, 0.3 ns
                    / | \
                   /  |  \        L1: 32-64 KB, 1 ns
                  /   |   \
                 /    |    \      L2: 256 KB, 4 ns
                /     |     \
               /      |      \    L3: 8-32 MB, 12 ns
              /       |       \
             /        |        \  RAM: 8-128 GB, 100 ns
            /         |         \
           /          |          \ SSD/HDD: TBs, us-ms
          /───────────────────────\

    FASTER, SMALLER ▲        ▼ SLOWER, BIGGER

Physics forces this tradeoff. Faster storage requires transistors closer to the CPU, which limits capacity. The speed of light itself imposes a minimum latency for reaching distant storage.


Cache lines: the unit of transfer

The CPU never reads a single byte from RAM. It reads in cache lines of 64 bytes.

When you access array[0], the CPU fetches a 64-byte block containing array[0] through array[15] (assuming 4-byte ints):

    You access array[0]. RAM sends back a 64-byte cache line:
    ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
    │ [0]│ [1]│ [2]│ [3]│ [4]│ [5]│ [6]│ [7]│ [8]│ [9]│[10]│[11]│[12]│[13]│[14]│[15]│
    └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
    ◄──────────────────── 64 bytes ─────────────────────►

    Now array[1] through array[15] are in L1 cache: ~1 ns instead of ~100 ns.

This is spatial locality: data near what you just accessed is already cached.


Two kinds of locality

Spatial locality — accessing nearby addresses in sequence:

for (int i = 0; i < N; i++)
    sum += array[i];  // Each access is 4 bytes after the last

You pay the ~100 ns RAM penalty once per cache line and get 15 more accesses essentially free.

Temporal locality — accessing the same data again soon:

int count = 0;
for (int i = 0; i < N; i++)
    count++;  // 'count' stays in L1 the entire loop

What do you think happens? Your local variables are on the stack. Is stack memory special hardware?

Reveal: Same RAM as everything else. But the top of the stack is accessed constantly (temporal locality) in a contiguous region (spatial locality). It's always in L1 cache. That's why local variables are fast.


Why linked lists are slow

Each node is somewhere on the heap. Following each next pointer is a potential cache miss:

    Linked list (pointer chasing):
    Node A @ 0x1000     Node B @ 0x5F00      Node C @ 0x3200
    ┌─────┬──────┐      ┌─────┬──────┐       ┌─────┬──────┐
    │ val │ next─┼─────►│ val │ next─┼──────►│ val │ NULL │
    └─────┴──────┘      └─────┴──────┘       └─────┴──────┘
    Each arrow = potential cache miss = ~100 ns

    Array with 3 elements @ 0x1000:
    ┌─────┬─────┬─────┐
    │ [0] │ [1] │ [2] │   All in one cache line = ~100 ns total
    └─────┴─────┴─────┘

This is why Vec<T> in Rust and arrays in C demolish linked lists in almost every benchmark.

Fun Fact: Bjarne Stroustrup (creator of C++) showed that even for insertion in the middle — the textbook linked-list use case — std::vector was faster because the O(n) memmove is sequential and cache-friendly, while the O(1) pointer update causes a cache miss.


Cache-friendly data layout

A game with 10,000 entities. Array of Structs — the obvious way:

struct Entity { float x, y, z; float vx, vy, vz; int health; };
struct Entity entities[10000];
#![allow(unused)]
fn main() {
struct Entity { x: f32, y: f32, z: f32, vx: f32, vy: f32, vz: f32, health: i32 }
let entities: Vec<Entity> = Vec::with_capacity(10000);
}

Struct of Arrays — cache-friendly when processing one field at a time:

struct Entities { float x[10000], y[10000], z[10000];
                  float vx[10000], vy[10000], vz[10000]; int health[10000]; };
#![allow(unused)]
fn main() {
struct Entities { x: Vec<f32>, y: Vec<f32>, z: Vec<f32>,
                  vx: Vec<f32>, vy: Vec<f32>, vz: Vec<f32>, health: Vec<i32> }
}

If your physics update only touches positions and velocities:

    AoS cache line (64 bytes):
    │ x₀ y₀ z₀ vx₀ vy₀ vz₀ hp₀ pad │ x₁ y₁ z₁ vx₁ vy₁ vz₁ hp₁ pad │
    2 entities per line, health + padding wasted

    SoA cache line (64 bytes):
    │ x₀ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ x₁₀ x₁₁ x₁₂ x₁₃ x₁₄ x₁₅ │
    16 x-values per line, every byte useful

SoA can be 2-4x faster for batch processing.


The stride experiment

This program shows the cache hierarchy directly:

// save as stride.c, compile: gcc -O2 -o stride stride.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE (32 * 1024 * 1024)
int main() {
    int *arr = malloc(ARRAY_SIZE * sizeof(int));
    for (int i = 0; i < ARRAY_SIZE; i++) arr[i] = 1;
    int strides[] = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024};
    for (int s = 0; s < 11; s++) {
        int stride = strides[s];
        volatile long long sum = 0;
        clock_t start = clock();
        for (int iter = 0; iter < 10; iter++)
            for (int i = 0; i < ARRAY_SIZE; i += stride) sum += arr[i];
        double elapsed = (double)(clock() - start) / CLOCKS_PER_SEC;
        long long accesses = (long long)10 * (ARRAY_SIZE / stride);
        printf("Stride %5d: %.2f ns/access\n",
               stride, elapsed * 1e9 / accesses);
    }
    free(arr);
}

Typical results:

    Stride     1: 0.68 ns/access    ◄── sequential, everything cached
    Stride    16: 2.85 ns/access    ◄── one access per cache line
    Stride   256: 8.43 ns/access
    Stride  1024: 9.02 ns/access    ◄── dominated by RAM latency

The same program in Rust produces the same results — same hardware, same cache hierarchy:

use std::time::Instant;
fn main() {
    let size: usize = 32 * 1024 * 1024;
    let arr: Vec<i32> = vec![1; size];
    let strides = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024];
    for &stride in &strides {
        let mut sum: i64 = 0;
        let start = Instant::now();
        for _ in 0..10 {
            let mut i = 0;
            while i < size { sum += arr[i] as i64; i += stride; }
        }
        let ns = start.elapsed().as_nanos() as f64 / (10 * size / stride) as f64;
        println!("Stride {:5}: {:.2} ns/access (sum={})", stride, ns, sum);
    }
}

Recap

    ┌────────────────────────────────────────────────────────────┐
    │  Memory is a hierarchy: fast-small at top, slow-huge below │
    │  CPU fetches data in 64-byte cache lines                   │
    │                                                            │
    │  Spatial locality: access nearby addresses (arrays win)    │
    │  Temporal locality: re-use recent data (loops, stack win)  │
    │                                                            │
    │  Stack is fast: always in L1 cache                         │
    │  Linked lists are slow: every node is a cache miss         │
    │  SoA can be 2-4x faster than AoS for batch processing     │
    └────────────────────────────────────────────────────────────┘

Task

  1. Compile and run stride.c (or the Rust version). Record ns/access for strides 1, 16, and 1024.
  2. Calculate: how many times slower is stride-1024 vs. stride-1?
  3. Run perf stat -e L1-dcache-load-misses ./stride and watch miss counts climb with stride.
  4. Challenge: Modify the program to use random access — allocate an array of random indices, access arr[indices[i]]. Compare ns/access to sequential. This is what pointer chasing looks like to the cache.

Next up: the actual instructions the CPU executes, and how your C and Rust code becomes assembly.

Instructions and the x86-64 ISA

Type this right now

// save as add.c
int add(int a, int b) { return a + b; }
$ gcc -S -O1 -o add.s add.c
$ cat add.s

You just turned C into assembly. It'll be about 5-10 lines of actual instructions. Everything else is metadata. Let's learn to read it.


What is an ISA?

An Instruction Set Architecture is the set of instructions a CPU understands. x86-64 (also called AMD64) is the ISA on virtually every desktop, laptop, and server from Intel and AMD.

It's not abstract. It's a specific, concrete list of operations encoded as bytes in memory. When you compile C or Rust, the compiler translates your code into these exact instructions.


The essential instructions

You don't need hundreds. These 16 cover 90% of compiler output:

    ┌──────────┬──────────────────────────────────────────────────────┐
    │ Instr.   │ What it does                                        │
    ├──────────┼──────────────────────────────────────────────────────┤
    │ mov a, b │ Copy value from b into a                            │
    │ add a, b │ a = a + b                                           │
    │ sub a, b │ a = a - b                                           │
    │ lea a, b │ Load Effective Address: a = address of b            │
    │          │ (also used for quick math: lea rax, [rbx+rcx*4])    │
    │ push val │ Subtract 8 from rsp, store val at [rsp]             │
    │ pop  reg │ Load [rsp] into reg, add 8 to rsp                  │
    │ call lbl │ Push rip (return address), jump to lbl              │
    │ ret      │ Pop address from stack, jump to it                  │
    │ cmp a, b │ Compute a - b, set flags (don't store result)      │
    │ jmp lbl  │ Jump unconditionally (set rip = lbl)               │
    │ je  lbl  │ Jump if equal (zero flag set)                      │
    │ jne lbl  │ Jump if not equal                                   │
    │ jl  lbl  │ Jump if less than (signed)                          │
    │ jg  lbl  │ Jump if greater than (signed)                       │
    │ syscall  │ Transfer control to the kernel                      │
    │ nop      │ Do nothing (used for alignment)                     │
    └──────────┴──────────────────────────────────────────────────────┘

How a + b becomes assembly

int add(int a, int b) { return a + b; }
#![allow(unused)]
fn main() {
pub fn add(a: i32, b: i32) -> i32 { a + b }
}

Both produce (at -O1 / -C opt-level=1):

add:
    lea    eax, [rdi+rsi]
    ret

Two instructions. The System V AMD64 calling convention puts arg1 in rdi, arg2 in rsi. lea eax, [rdi+rsi] computes the sum in one instruction. Return value goes in eax.

Fun Fact: lea stands for "Load Effective Address." Designed for address math, but compilers abuse it as a fast calculator — addition and multiplication in one instruction without touching the flags register.


How if/else becomes assembly

int max(int a, int b) { return (a > b) ? a : b; }
#![allow(unused)]
fn main() {
pub fn max(a: i32, b: i32) -> i32 { if a > b { a } else { b } }
}

At -O1:

max:
    cmp     edi, esi       ; compare a and b
    mov     eax, esi       ; assume b is the answer
    cmovg   eax, edi       ; IF a > b, overwrite with a
    ret

What do you think happens? The compiler didn't use jmp or je. Why not?

Reveal: Branches cause pipeline stalls when mispredicted. cmovg (conditional move) avoids the branch entirely — the CPU always executes it but conditionally writes the result.

At -O0 you'd see the explicit version: cmp sets flags, jle conditionally jumps to the else branch. That's how every if works at the hardware level.


How function calls work

call does two things in one instruction: push the return address, then jump.

    Before CALL:              After CALL:
    rip = 0x401020            rip = 0x401200 (function start)
    rsp = 0x7fff10            rsp = 0x7fff08 (moved down 8)

    Stack:                    Stack:
    ┌──────────┐ 0x7fff10    ┌──────────┐ 0x7fff10
    │  ...     │             │  ...     │
    └──────────┘             ├──────────┤ 0x7fff08 ◄── rsp
                             │ 0x401025 │  return address
                             └──────────┘

ret does the reverse: pop the address, set rip to it.


The stack in assembly: prologue and epilogue

my_function:
    push   rbp            ; save caller's base pointer
    mov    rbp, rsp       ; set up our own base pointer
    sub    rsp, 32        ; room for local variables
    ; ... function body: locals at [rbp-4], [rbp-8], etc.
    mov    rsp, rbp       ; deallocate locals
    pop    rbp            ; restore caller's base pointer
    ret
    Higher addresses
    ┌─────────────────────┐
    │  caller's frame     │
    ├─────────────────────┤
    │  return address     │ ◄── pushed by 'call'
    ├─────────────────────┤
    │  saved rbp          │ ◄── rbp points here
    ├─────────────────────┤
    │  local var 1        │ [rbp-4]
    │  local var 2        │ [rbp-8]
    ├─────────────────────┤
    │                     │ ◄── rsp (top of stack)
    └─────────────────────┘
    Lower addresses

With optimization, the compiler often skips this entirely (frame pointer omission) — uses rsp directly, making code faster but debugging harder.


System calls: the ONLY door to the kernel

Your program (Ring 3) cannot access files, screens, or networks directly. The syscall instruction is the only way to ask the kernel.

Here's write(1, "hello\n", 6) in assembly:

mov    rax, 1          ; syscall number: write
mov    rdi, 1          ; fd: stdout
lea    rsi, [msg]      ; pointer to string
mov    rdx, 6          ; byte count
syscall                ; >>> kernel handles it, result in rax <<<

Convention: syscall number in rax, args in rdi, rsi, rdx, r10, r8, r9.


A complete "Hello, World" in x86-64 assembly

No C library. No runtime. Just raw syscalls.

; save as hello.s
; build: as -o hello.o hello.s && ld -o hello hello.o
        .section .text
        .global _start
_start:
        mov     rax, 1          ; syscall: write
        mov     rdi, 1          ; fd: stdout
        lea     rsi, [msg]      ; buffer
        mov     rdx, 14         ; count
        syscall
        mov     rax, 60         ; syscall: exit
        mov     rdi, 0          ; status: 0
        syscall

        .section .rodata
msg:    .ascii "Hello, World!\n"
$ as -o hello.o hello.s && ld -o hello hello.o && ./hello
Hello, World!

Thirteen lines. Two syscalls. A complete program at the lowest level before raw bytes.


Godbolt: the compiler explorer

godbolt.org — paste C or Rust on the left, see assembly on the right, live. Try int square(int x) { return x * x; } in C and pub fn square(x: i32) -> i32 { x * x } in Rust. Both produce mov eax, edi; imul eax, edi; ret. Same CPU language, different programmer languages.

Fun Fact: Godbolt color-codes which assembly lines map to which source lines.


Reading compiler output: a walkthrough

long sum_array(int *arr, int len) {
    long total = 0;
    for (int i = 0; i < len; i++) total += arr[i];
    return total;
}

At -O1:

sum_array:
        mov     eax, 0          ; total = 0
        mov     ecx, 0          ; i = 0
        jmp     .L2
.L3:
        movsx   rdx, DWORD PTR [rdi+rcx*4]   ; rdx = arr[i]
        add     rax, rdx        ; total += arr[i]
        add     ecx, 1          ; i++
.L2:
        cmp     ecx, esi        ; i < len?
        jl      .L3             ; if yes, loop body
        ret
    rdi = arr (1st arg)     esi = len (2nd arg)
    rax = total (return)    ecx = i (counter)
    [rdi+rcx*4] = arr + i*4 = arr[i]  (CPU does array indexing in hardware)

The *4 in [rdi+rcx*4] computes the byte offset — the CPU has built-in support for scaled-index addressing.


Recap

    ┌──────────────────────────────────────────────────────────┐
    │  x86-64 is the concrete ISA on your CPU.                 │
    │                                                          │
    │  ~16 instructions cover 90% of compiler output:          │
    │  mov, add, sub, lea, push, pop, call, ret,              │
    │  cmp, jmp, je/jne/jl/jg, syscall, nop                   │
    │                                                          │
    │  if/else → cmp + conditional jump (or cmov)              │
    │  function call → call (push rip, jump) / ret (pop rip)   │
    │  syscall → the ONLY way to talk to the kernel            │
    │                                                          │
    │  C and Rust produce the same assembly at the same        │
    │  optimization level.                                     │
    └──────────────────────────────────────────────────────────┘

Task

  1. Write a simple C function (try int factorial(int n) with a loop).
  2. Compile with gcc -S -O0 -o fact_O0.s fact.c and gcc -S -O2 -o fact_O2.s fact.c.
  3. In the -O0 version, identify: the prologue, where locals are stored, the cmp+jump for the loop, and the epilogue.
  4. In -O2, see how much the compiler eliminated. Can you still match assembly to source?
  5. Bonus: Paste the same function in Godbolt with both GCC and rustc at -O2. Compare.

Next: the wall between your code and the kernel — and why the CPU enforces it with hardware.

Privilege, Protection, and System Calls

Type this right now

$ cat > hello.c << 'EOF'
#include <stdio.h>
int main() { printf("Hello from C!\n"); return 0; }
EOF
$ gcc -o hello hello.c
$ strace ./hello 2>&1 | tail -20

You'll see a flood of output like execve(...), brk(...), mmap(...), write(1, "Hello from C!\n", 14). Those are system calls — every interaction between your program and the OS. Your simple printf triggered dozens of them.


The two worlds

Your CPU has (at least) two privilege levels, enforced by the hardware itself:

    ┌─────────────────────────────────────────────────────────┐
    │                    Ring 3: User Mode                     │
    │                                                         │
    │  Your program lives here.                               │
    │  CAN: execute instructions, read/write own memory       │
    │  CANNOT: access hardware, read other processes' memory,  │
    │          change page tables, halt the CPU                │
    ├═════════════════════════════════════════════════════════╡
    │  ▲▲▲  HARDWARE-ENFORCED WALL  ▲▲▲                       │
    │  The ONLY way through: the syscall instruction.         │
    ├═════════════════════════════════════════════════════════╡
    │                    Ring 0: Kernel Mode                   │
    │                                                         │
    │  The OS kernel lives here.                              │
    │  CAN: everything — all memory, all hardware, all regs   │
    └─────────────────────────────────────────────────────────┘

This isn't a software convention. The CPU has a register bit tracking the current privilege level. When it says "Ring 3," certain instructions are physically impossible. The CPU will fault if you try.

What do you think happens? What stops a program from changing that privilege bit to Ring 0?

Reveal: The instruction to change privilege level is itself privileged. You can't run it from Ring 3. Hardware catch-22, by design. The only way into Ring 0 is through controlled entry points the kernel sets up at boot time.


The x86 protection rings

x86 defines four rings, but modern OSes use only two:

                    ┌───────────────────┐
                    │   Ring 0: Kernel  │  Full access
                    ├───────────────────┤
                    │   Ring 1 (unused) │
                    ├───────────────────┤
                    │   Ring 2 (unused) │
                    ├───────────────────┤
                    │   Ring 3: User    │  Your program
                    └───────────────────┘

Why you can't read another process's memory

Two mechanisms work together:

1. Virtual memory. Each process has its own page table. Process A's address 0x4000 maps to a completely different physical location than Process B's 0x4000.

2. Privilege levels. Page tables live in kernel memory. Only Ring 0 can modify them. You can't remap your own pages because writing to cr3 (page table base register) is privileged.

    Process A                    Process B
    ┌──────────┐                 ┌──────────┐
    │ 0x4000 ──┼──┐              │ 0x4000 ──┼──┐
    └──────────┘  │              └──────────┘  │
                  ▼ page table A              ▼ page table B
              ┌──────────┐               ┌──────────┐
              │ phys:    │               │ phys:    │
              │ 0x1A000  │               │ 0x3F000  │
              └──────────┘               └──────────┘

    Same virtual address → different physical memory.
    Page tables in kernel memory — Ring 3 can't touch them.

The CPU enforces this on every memory access. Hardware gate, not software check.


System calls: the controlled gateway

Your program can't read files, write to the screen, or allocate memory from the OS directly. It asks the kernel through system calls.

    Your program (Ring 3)         Kernel (Ring 0)
    ─────────────────────         ────────────────
    1. Set up arguments
       rax = syscall number
       rdi, rsi, rdx = args

    2. Execute 'syscall'  ──────► 3. CPU switches to Ring 0
                                     Kernel validates args
                                     Kernel does the work

    5. Continue in Ring 3 ◄────── 4. Result in rax
                                     CPU switches back to Ring 3

The syscall instruction atomically saves rip/rflags, switches to Ring 0, and jumps to the kernel's entry point. The kernel runs sysret to switch back.


Common system calls

    ┌───────────┬────────┬───────────────────────────────────────────┐
    │  Syscall  │ Number │ What it does                              │
    ├───────────┼────────┼───────────────────────────────────────────┤
    │  read     │   0    │ Read bytes from a file descriptor         │
    │  write    │   1    │ Write bytes to a file descriptor          │
    │  open     │   2    │ Open a file, get a file descriptor        │
    │  close    │   3    │ Close a file descriptor                   │
    │  mmap     │   9    │ Map a file or allocate memory pages       │
    │  mprotect │  10    │ Change memory protection (r/w/x)          │
    │  brk      │  12    │ Expand/shrink the heap                    │
    │  fork     │  57    │ Create a copy of the current process      │
    │  execve   │  59    │ Replace process with a new program        │
    │  exit     │  60    │ Terminate the process                     │
    └───────────┴────────┴───────────────────────────────────────────┘

strace: watching syscalls in real time

C hello world

$ strace ./hello 2>&1 | grep -E "execve|write|exit|mmap|brk"
execve("./hello", ["./hello"], [...])        = 0
brk(NULL)                                    = 0x55a8c3d2f000
mmap(NULL, 2228224, PROT_READ, ...)          = 0x7f3a...
mmap(0x7f3a..., 1540096, PROT_READ|PROT_EXEC, ...) = 0x7f3a...
brk(0x55a8c3d50000)                          = 0x55a8c3d50000
write(1, "Hello from C!\n", 14)              = 14
exit_group(0)                                = ?

Your one-line printf caused: execve (load program), mmap (load libc), brk (allocate buffers), write (the actual output), exit_group (terminate).

Rust hello world

fn main() { println!("Hello from Rust!"); }
$ rustc -o hello_rs hello.rs
$ strace ./hello_rs 2>&1 | grep -E "write|exit"
write(1, "Hello from Rust!\n", 17)           = 17
exit_group(0)                                = ?

Same write syscall. Both languages eventually call write(1, ...) to put text on your screen.

Fun Fact: A statically linked C hello world can have as few as 3-4 syscalls total. A statically linked Rust hello world typically has 20-30 because Rust's runtime initializes signal handlers, thread-local storage, and a panic handler.


What happens when you break the rules

int main() {
    int *p = (int *)0xDEADBEEF;  // an address we don't own
    return *p;                    // try to read it
}

Here's the chain of events:

    1. CPU executes: mov eax, [0xDEADBEEF]
    2. CPU consults page table → page not mapped
    3. CPU generates PAGE FAULT exception (hardware, not software)
    4. CPU switches to Ring 0, jumps to kernel's fault handler
    5. Kernel: address 0xDEADBEEF is not valid for this process
    6. Kernel sends SIGSEGV signal to the process
    7. Default handler: terminate + core dump
    8. You see: "Segmentation fault (core dumped)"

Verify with strace:

$ strace ./segfault 2>&1 | tail -3
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xdeadbeef} ---
+++ killed by SIGSEGV (core dumped) +++

SEGV_MAPERR = address not mapped. This is a hardware mechanism. The CPU caught it before the read completed.


Other protection violations

    ┌──────────────────┬──────────────────────────────────────────┐
    │  Violation        │ What happens                             │
    ├──────────────────┼──────────────────────────────────────────┤
    │  Read unmapped    │ Page fault → SIGSEGV (SEGV_MAPERR)      │
    │  memory           │                                          │
    │  Write to         │ Page fault → SIGSEGV (SEGV_ACCERR)      │
    │  read-only page   │ e.g., writing to string literals         │
    │  Execute non-     │ Page fault → SIGSEGV (NX bit)            │
    │  executable page  │                                          │
    │  Privileged       │ CPU exception → SIGILL                   │
    │  instruction      │ Ring 0 instr. from Ring 3                │
    └──────────────────┴──────────────────────────────────────────┘

All follow the same pattern: CPU detects violation in hardware, generates exception, kernel converts to signal, process dies.

In Rust, most of these are prevented at compile time. But via unsafe or FFI, the hardware mechanism is identical.


The complete picture

    ┌──────────────────────────────────────────────────┐
    │  Your Program (Ring 3)                            │
    │  ┌─────────────┐   ┌──────────────┐             │
    │  │ Your code    │──►│ C lib / Rust │             │
    │  │ (main, etc.) │   │ runtime      │             │
    │  └─────────────┘   └──────┬───────┘             │
    │                    printf / println               │
    │                    malloc / Vec::push             │
    │                           │                      │
    │                    ┌──────▼──────┐               │
    │                    │   syscall   │               │
    ├════════════════════╪════════════════════════════╡
    │  Kernel (Ring 0)   │                             │
    │              ┌─────▼──────┐                      │
    │              │  Syscall   │                      │
    │              │  handler   │                      │
    │              └──────┬─────┘                      │
    │           ┌─────────┼─────────┐                  │
    │           ▼         ▼         ▼                  │
    │      Filesystem  Memory    Network               │
    ├══════════╪═════════╪═════════╪══════════════════╡
    │  HW     Disk      RAM      NIC                   │
    └──────────────────────────────────────────────────┘

Your code calls a library. The library executes syscall. CPU switches to Ring 0. Kernel does the work. CPU switches back. Your program continues.

Every. Single. Time. No shortcuts, no backdoors.


Recap

    ┌──────────────────────────────────────────────────────────┐
    │  Ring 0 / Ring 3 = HARDWARE-enforced privilege levels     │
    │                                                          │
    │  Your program cannot:                                    │
    │  • Access another process's memory (page tables)         │
    │  • Access hardware directly (privileged instructions)    │
    │  • Modify its own page tables (privileged registers)     │
    │                                                          │
    │  syscall = the ONLY door to Ring 0                       │
    │  Segfaults = hardware events: page fault → SIGSEGV       │
    │  strace shows every syscall your program makes           │
    └──────────────────────────────────────────────────────────┘

Task

  1. Run strace -c ./hello (C) and strace -c ./hello_rs (Rust). Compare total syscall counts.
  2. Run strace -e write ./hello to filter for just write calls.
  3. Violations. Compile int main() { return *(int *)0; } and int main() { char *s = "hello"; s[0] = 'H'; return 0; }. Strace each. Note SEGV_MAPERR vs. SEGV_ACCERR.
  4. Bonus: Write assembly that executes hlt from user mode. What signal do you get?

Next: we leave the hardware behind and enter the illusion — how your program thinks memory is laid out.

Your Program's View of Memory

Type this right now

Open a terminal and run:

$ cat /proc/self/maps

You just asked the cat process to show you its own memory map. You'll see something like this:

55a3b2c00000-55a3b2c02000 r--p 00000000 08:01 1234567  /usr/bin/cat
55a3b2c02000-55a3b2c07000 r-xp 00002000 08:01 1234567  /usr/bin/cat
55a3b2c07000-55a3b2c0a000 r--p 00007000 08:01 1234567  /usr/bin/cat
55a3b2c0a000-55a3b2c0b000 r--p 00009000 08:01 1234567  /usr/bin/cat
55a3b2c0b000-55a3b2c0c000 rw-p 0000a000 08:01 1234567  /usr/bin/cat
55a3b3e00000-55a3b3e21000 rw-p 00000000 00:00 0        [heap]
7f8a12000000-7f8a12200000 r--p 00000000 08:01 2345678  /usr/lib/libc.so.6
...
7ffd4a100000-7ffd4a122000 rw-p 00000000 00:00 0        [stack]

That is a real, live view of a process's memory. Every process on your system has one. Including the ones you write.


The flat address space illusion

Your program believes it has a simple, flat stretch of memory from address 0 all the way up. On a 64-bit system, addresses are 64 bits wide, but only 48 bits are actually used:

0x0000_0000_0000_0000   ← Bottom of address space
        |
        |   User space (your program lives here)
        |
0x0000_7FFF_FFFF_FFFF   ← Top of user space
        |
        |   (non-canonical gap — addresses here cause a fault)
        |
0xFFFF_8000_0000_0000   ← Bottom of kernel space
        |
        |   Kernel space (off-limits to your code)
        |
0xFFFF_FFFF_FFFF_FFFF   ← Top of address space

Your program gets the lower half. The kernel takes the upper half. Try to touch kernel space and the CPU itself — not the OS, the hardware — blocks you.

🧠 What do you think happens?

If you write *(int *)0xFFFF800000000000 = 42; in a C program, what happens? Does the compiler stop you? Does the OS stop you? Does the CPU stop you? Try it and observe the error message.


See the layout with your own code — C

This program prints the address of something from each major memory region:

// save as memview.c — compile: gcc -o memview memview.c
#include <stdio.h>
#include <stdlib.h>

int global_var = 42;            // .data section

int main() {
    int stack_var = 7;          // stack
    int *heap_var = malloc(64); // heap

    printf("Code  (main):  %p\n", (void *)main);
    printf("Global:        %p\n", (void *)&global_var);
    printf("Heap:          %p\n", (void *)heap_var);
    printf("Stack:         %p\n", (void *)&stack_var);

    free(heap_var);
    return 0;
}

Run it:

$ gcc -o memview memview.c && ./memview
Code  (main):  0x55a3b2c02149
Global:        0x55a3b2c04010
Heap:          0x55a3b3e002a0
Stack:         0x7ffd4a121a5c

Look at the addresses. Really look at them.

  • Code (0x55a...): relatively low
  • Global (0x55a...): right next to code
  • Heap (0x55a3b3...): higher, but still in the 0x55... range
  • Stack (0x7ffd...): way up high, near the top of user space

The layout reveals itself through raw addresses.


Same thing in Rust

// save as memview.rs — compile: rustc -o memview_rs memview.rs
use std::boxed::Box;

static GLOBAL_VAR: i32 = 42;

fn main() {
    let stack_var: i32 = 7;
    let heap_var = Box::new(64);

    println!("Code  (main):  {:p}", main as fn() as *const ());
    println!("Global:        {:p}", &GLOBAL_VAR as *const i32);
    println!("Heap:          {:p}", &*heap_var as *const i32);
    println!("Stack:         {:p}", &stack_var as *const i32);
}
$ rustc -o memview_rs memview.rs && ./memview_rs
Code  (main):  0x55f8a1005b30
Global:        0x55f8a1009000
Heap:          0x55f8a1a2b9d0
Stack:         0x7ffc3b2e1d44

Same pattern. Same regions. Same operating system underneath. Rust's ownership model is a compile-time concept — at runtime, it's the same address space as C.


Parsing /proc/self/maps

Every line in /proc/self/maps has this format:

address          perms offset  dev   inode  pathname
55a3b2c02000-... r-xp  00002000 08:01 12345 /usr/bin/cat
ColumnMeaning
addressVirtual address range (start-end)
permsr = read, w = write, x = execute, p = private, s = shared
offsetOffset into the file (for file-backed mappings)
devDevice (major:minor)
inodeInode number of the file
pathnameFile path, or [heap], [stack], [vdso], etc.

The permissions tell you what each region is:

  • r-xp → code (read + execute, no write)
  • rw-p → data, heap, stack (read + write, no execute)
  • r--p → read-only data (constants, string literals)

💡 Fun Fact

The [vdso] entry stands for "virtual dynamic shared object." It's a tiny piece of kernel code mapped into every process's address space so that certain system calls (like gettimeofday) can run without actually entering the kernel. It's a performance trick — the kernel pretends to be a shared library.


The big picture

High addresses
0x7FFF_FFFF_FFFF ┌─────────────────────────┐
                 │        Stack             │  ← grows downward
                 │   (local vars, frames)   │
                 ├─────────────────────────┤
                 │                         │
                 │    (unmapped gap)        │
                 │                         │
                 ├─────────────────────────┤
                 │        Heap              │  ← grows upward
                 │   (malloc, Box::new)     │
                 ├─────────────────────────┤
                 │     BSS (zeroed globals) │
                 ├─────────────────────────┤
                 │     Data (init globals)  │
                 ├─────────────────────────┤
                 │     Text (your code)     │
0x0000_0000_0000 └─────────────────────────┘
Low addresses

Every program you've ever run has this shape. The next chapter dissects each region in detail.


🔧 Task

Write a C program and a Rust program that each print the address of:

  1. A function (code region)
  2. A global/static variable (data region)
  3. A local variable (stack)
  4. A heap-allocated value

Run both. Compare the addresses. Do they fall in the same general ranges? Now run cat /proc/<PID>/maps for each (use getpid() in C or std::process::id() in Rust to print the PID, then sleep so you can inspect the maps file before the process exits).

Bonus: Pipe the output of /proc/self/maps through your own program. In C:

FILE *f = fopen("/proc/self/maps", "r");
char line[256];
while (fgets(line, sizeof(line), f)) printf("%s", line);
fclose(f);

In Rust:

#![allow(unused)]
fn main() {
println!("{}", std::fs::read_to_string("/proc/self/maps").unwrap());
}

Anatomy of a Process Address Space

Type this right now

// save as regions.c — compile: gcc -g -o regions regions.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int initialized_global = 42;     // .data
int uninitialized_global;        // .bss
const char *string_lit = "I live in .rodata";

int main() {
    int stack_var = 1;
    int *heap_var = malloc(64);

    printf("--- Memory Regions ---\n");
    printf("Text  (main):     %p\n", (void *)main);
    printf("Rodata (string):  %p\n", (void *)string_lit);
    printf("Data  (init):     %p\n", (void *)&initialized_global);
    printf("BSS   (uninit):   %p\n", (void *)&uninitialized_global);
    printf("Heap  (malloc):   %p\n", (void *)heap_var);
    printf("Stack (local):    %p\n", (void *)&stack_var);
    printf("PID: %d (inspect /proc/%d/maps)\n", getpid(), getpid());

    sleep(30); // time to inspect
    free(heap_var);
    return 0;
}

Compile and run it. While it sleeps, open another terminal and run cat /proc/<PID>/maps. You'll see every region we're about to discuss.


THE Diagram

This is the memory layout of a running process on x86-64 Linux. Commit it to memory.

 0xFFFF_FFFF_FFFF_FFFF ┌─────────────────────────────────────────────┐
                        │                                             │
                        │            Kernel Space                     │
                        │   (mapped into every process, but you       │
                        │    can't touch it — ring 0 only)            │
                        │                                             │
 0xFFFF_8000_0000_0000  ├─────────────────────────────────────────────┤
                        │                                             │
                        │       (non-canonical address gap)           │
                        │                                             │
 0x0000_7FFF_FFFF_FFFF  ├─────────────────────────────────────────────┤
                        │                                             │
                        │   Stack          [rw-p]                     │
                        │   grows ↓ downward                          │
                        │   (local vars, return addrs, saved regs)    │
                        │                                             │
                        ├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤
                        │   Guard page     [---p]  (unmapped)         │
                        ├─────────────────────────────────────────────┤
                        │                                             │
                        │   Memory-mapped region                      │
                        │   (shared libraries: libc.so, ld-linux.so)  │
                        │   (mmap'd files, anonymous mmap)            │
                        │                                             │
                        ├─────────────────────────────────────────────┤
                        │                                             │
                        │                 (gap)                       │
                        │                                             │
                        ├─────────────────────────────────────────────┤
                        │                                             │
                        │   Heap           [rw-p]                     │
                        │   grows ↑ upward                            │
                        │   (malloc, calloc, Box::new, Vec::new)      │
                        │                                             │
                        ├─────────────────────────────────────────────┤
                        │   BSS            [rw-p]                     │
                        │   (uninitialized globals, zeroed at load)   │
                        ├─────────────────────────────────────────────┤
                        │   Data           [rw-p]                     │
                        │   (initialized globals: int x = 42)         │
                        ├─────────────────────────────────────────────┤
                        │   Rodata         [r--p]                     │
                        │   (string literals, const arrays)           │
                        ├─────────────────────────────────────────────┤
                        │   Text           [r-xp]                     │
                        │   (your compiled code — machine instr.)     │
 0x0000_0000_0000_0000  └─────────────────────────────────────────────┘

Now let's walk through each region from bottom to top.


Text: your compiled code

The .text section holds your program's machine instructions — the compiled output of every function you wrote.

PropertyValue
Permissionsr-xp (read, execute, no write)
SourceLoaded from the ELF binary
LifetimeEntire process lifetime
Who managesOS loader maps it from disk

Read and execute, but not writable. This is enforced by the hardware (page table permissions). If your code could rewrite itself, every buffer overflow would be an arbitrary code execution exploit. The CPU enforces W^X: a page is either writable or executable, never both.

🧠 What do you think happens?

What if you cast a function pointer to int * and try to write to it?

int *p = (int *)main;
*p = 0x90909090; // NOP sled?

Try it. The CPU will raise a fault before the write completes.


Data: initialized globals

int answer = 42;                // C: goes in .data
#![allow(unused)]
fn main() {
static ANSWER: i32 = 42;       // Rust: goes in .data
}

This section holds global and static variables that have explicit initial values. The values are stored in the ELF binary itself — when you cat the binary, the bytes 2a 00 00 00 (42 in little-endian) are literally sitting in the file.

PropertyValue
Permissionsrw-p (read + write)
SourceValues loaded from ELF binary
LifetimeEntire process lifetime
Who managesOS loader

BSS: uninitialized globals

int counter;                    // C: goes in .bss (implicitly zero)
static int buffer[4096];       // C: 16KB of zeros — in .bss
#![allow(unused)]
fn main() {
static mut COUNTER: i32 = 0;   // Rust: goes in .bss (explicitly zero)
}

BSS stands for "Block Started by Symbol" — an old assembler directive. What matters: the OS zeroes this memory at load time. The values are not stored in the binary.

PropertyValue
Permissionsrw-p (read + write)
SourceZeroed by OS at load time — NOT stored on disk
LifetimeEntire process lifetime
Who managesOS loader

💡 Fun Fact

If you declare static int bigarray[1000000]; in C, your binary does NOT grow by 4MB. The ELF file just records "I need 4,000,000 bytes of BSS." The OS allocates and zeroes them when the process starts. This is why BSS exists — it would be absurd to store millions of zeros on disk.

To see the savings yourself:

$ readelf -S regions | grep -E "\.data|\.bss"
  [24] .data    PROGBITS  0000000000004000  003000  000008  0  WA  0  0  8
  [25] .bss     NOBITS    0000000000004008  003008  000004  0  WA  0  0  4

Notice .bss is NOBITS. Zero bytes on disk. Full size in memory.


Heap: dynamic allocation

int *p = malloc(100);           // C: heap allocation
#![allow(unused)]
fn main() {
let p = Box::new(42);           // Rust: heap allocation
let v = vec![1, 2, 3];         // Rust: heap allocation (via Vec)
}

The heap is where dynamic allocations live. It starts just above BSS and grows upward toward higher addresses.

PropertyValue
Permissionsrw-p (read + write)
SourceAllocated at runtime via brk or mmap system calls
LifetimeUntil explicitly freed (C) or dropped (Rust)
Who managesThe allocator (malloc/free), kernel provides pages

The heap is managed in two layers:

Your code       malloc(100)  /  Box::new(42)
                    │
                    ▼
Allocator       glibc malloc / jemalloc / etc.
(user space)    Maintains free lists, splits/merges blocks
                    │
                    ▼
Kernel          brk() for small allocations
                mmap() for large allocations (>128KB)

We'll dissect the allocator in Chapter 20. For now, know that malloc doesn't call the kernel every time — it maintains its own pool.


Stack: function call frames

void foo() {
    int x = 10;     // lives on the stack
    int arr[100];   // 400 bytes on the stack
}
#![allow(unused)]
fn main() {
fn foo() {
    let x: i32 = 10;       // lives on the stack
    let arr = [0i32; 100];  // 400 bytes on the stack
}
}

The stack starts near the top of user space and grows downward toward lower addresses.

PropertyValue
Permissionsrw-p (read + write, no execute)
SourceAllocated by the OS when the process starts
LifetimeUntil the function returns
Who managesThe CPU (rsp register), compiler (frame layout)
Size limitDefault 8MB (ulimit -s)

Every function call pushes a frame onto the stack: return address, saved registers, local variables. Every return pops it. The stack pointer (rsp) moves up and down — that's it. No allocator, no free lists, no fragmentation. One register, one instruction to allocate, one instruction to free.

That's why the stack is fast.


Memory-mapped regions

Between the heap and the stack, you'll find memory-mapped regions. These include:

  • Shared libraries: libc.so, ld-linux-x86-64.so, libpthread.so
  • Anonymous mappings: large malloc calls (>128KB) use mmap instead of brk
  • File mappings: mmap() can map a file directly into your address space
7f8a12000000-7f8a12200000 r--p  /usr/lib/x86_64-linux-gnu/libc.so.6
7f8a12200000-7f8a12395000 r-xp  /usr/lib/x86_64-linux-gnu/libc.so.6
7f8a12395000-7f8a123ed000 r--p  /usr/lib/x86_64-linux-gnu/libc.so.6

Notice libc has multiple entries — different sections (code, read-only data, writable data) are mapped with different permissions. Same library, different protection levels.


Kernel space: here be dragons

0xFFFF_8000_0000_0000  and above

The top half of the address space is reserved for the kernel. It's mapped into every process's page table, but the page table entries are marked supervisor only. The CPU checks your current privilege level (ring 3 for user code) against the page permissions (ring 0 required for kernel pages). If you try to access kernel memory from user code, the CPU raises a page fault. The kernel handles it by sending your process a SIGSEGV.

You interact with kernel space only through system calls — read, write, mmap, brk. Those switch the CPU to ring 0, run kernel code, then switch back. That boundary is absolute.


Rust: same layout, different guarantees

Here's the key insight: Rust programs have the exact same memory layout as C programs.

static GLOBAL: i32 = 42;          // .data — same as C

static UNINIT: std::sync::atomic::AtomicI32 =
    std::sync::atomic::AtomicI32::new(0);  // .bss (zero-initialized)

fn main() {                        // .text — same as C
    let local = 10;                // stack — same as C
    let boxed = Box::new(20);      // heap  — same as C
}

Ownership, borrowing, lifetimes — they exist only at compile time. The generated machine code uses the same stack, the same heap, the same text/data/bss sections. rustc doesn't invent a new memory model. It enforces rules about how you use the one that already exists.

💡 Fun Fact

You can link Rust and C code together into a single binary. They share the same address space, the same heap, the same stack. A Rust function can call a C function (via extern "C") and the stack frames interleave seamlessly. There's no boundary at runtime — only at compile time.


🔧 Task

Write a program (in C or Rust — or both) that places data in every region:

  1. A function → text
  2. An initialized global → data
  3. An uninitialized global → BSS
  4. A string literal → rodata
  5. A malloc/Box::new → heap
  6. A local variable → stack

Print the address of each. Then, while the program is sleeping, run:

$ cat /proc/<PID>/maps

For each printed address, find the corresponding line in the maps output. Verify:

  • The address falls within the range on that line
  • The permissions match what you'd expect (code is r-xp, globals are rw-p, etc.)
  • The pathname column tells you whether it's from your binary, a library, or anonymous

Bonus: Use readelf -S ./regions to list all sections. Find .text, .data, .bss, and .rodata. Compare their sizes with what you'd predict from your code.

The Stack: Disciplined and Fast

Type this right now

// save as stacktrace.c — compile: gcc -g -o stacktrace stacktrace.c
#include <stdio.h>

void c() {
    int z = 3;
    printf("c: &z = %p\n", (void *)&z);
}

void b() {
    int y = 2;
    printf("b: &y = %p\n", (void *)&y);
    c();
}

void a() {
    int x = 1;
    printf("a: &x = %p\n", (void *)&x);
    b();
}

int main() {
    a();
    return 0;
}
$ gcc -g -o stacktrace stacktrace.c && ./stacktrace
a: &x = 0x7ffd3a1b1c5c
b: &y = 0x7ffd3a1b1c3c
c: &z = 0x7ffd3a1b1c1c

Each address is lower than the last. The stack grows downward. Every function call you've ever made used this mechanism. Let's see exactly how.


What is a stack frame?

When you call a function, the CPU creates a stack frame — a block of memory on the stack that holds everything that function needs:

High addresses (top of stack at process start)
                 ┌──────────────────────────┐
                 │   main's frame           │
                 │  ┌────────────────────┐  │
                 │  │ (main's locals)    │  │
                 │  │ saved rbp          │  │
                 │  │ return address     │  │ ← where to go when main returns
                 │  └────────────────────┘  │
                 ├──────────────────────────┤
                 │   a's frame              │
                 │  ┌────────────────────┐  │
                 │  │ x = 1              │  │ [rbp - 4]
                 │  │ saved rbp ─────────│──┘ points to main's rbp
                 │  │ return address     │    addr in main after call a()
                 │  └────────────────────┘  │
                 ├──────────────────────────┤
                 │   b's frame              │
                 │  ┌────────────────────┐  │
                 │  │ y = 2              │  │ [rbp - 4]
                 │  │ saved rbp ─────────│──┘ points to a's rbp
                 │  │ return address     │    addr in a after call b()
                 │  └────────────────────┘  │
                 ├──────────────────────────┤
                 │   c's frame              │
                 │  ┌────────────────────┐  │
                 │  │ z = 3              │  │ [rbp - 4]
                 │  │ saved rbp ─────────│──┘ points to b's rbp
                 │  │ return address     │    addr in b after call c()
  rsp ──────►   │  └────────────────────┘  │
                 └──────────────────────────┘
Low addresses (stack grows this direction)

Two registers drive the whole thing:

  • rsp (stack pointer): always points to the top of the stack (the lowest used address)
  • rbp (base pointer): points to the base of the current frame — a stable reference point

How call and ret work

When the CPU executes call some_function:

call some_function
; is equivalent to:
push rip          ; push address of next instruction onto stack
jmp some_function ; jump to the function

When the CPU executes ret:

ret
; is equivalent to:
pop rip           ; pop address from stack into instruction pointer
; execution continues at that address

That's it. call = push return address + jump. ret = pop return address + jump back.


The function prologue and epilogue

Every compiled function begins with a prologue and ends with an epilogue. This is the bookkeeping that creates and destroys stack frames.

Prologue (at the start of every function):

push rbp          ; save caller's base pointer
mov  rbp, rsp     ; set our base pointer to current stack top
sub  rsp, N       ; reserve N bytes for local variables

Epilogue (at the end of every function):

leave             ; equivalent to: mov rsp, rbp; pop rbp
ret               ; pop return address, jump to it

Let's trace through calling a():

Before call a():
  rsp → [... main's stuff ...]

1. call a          → push return address, jump to a
   rsp → [ret_addr | ... main's stuff ...]

2. push rbp        → save main's rbp
   rsp → [main_rbp | ret_addr | ... main's stuff ...]

3. mov rbp, rsp    → rbp now points here
   rbp → [main_rbp | ret_addr | ... main's stuff ...]

4. sub rsp, 16     → reserve space for locals
   rsp → [        | x = ? | padding | main_rbp | ret_addr | ...]
          ^^^^^^^^^^^^^^^^
          a's local variables

🧠 What do you think happens?

If you never execute the epilogue — say, you longjmp out of a function or a signal handler interrupts you — what happens to the stack? The space is still "allocated" (rsp was never restored). Think about what this means for stack usage in signal handlers.


Local variables: just offsets from rbp

The compiler doesn't give your local variables names at runtime. They're just offsets:

void example() {
    int a = 10;    // [rbp - 4]
    int b = 20;    // [rbp - 8]
    long c = 30;   // [rbp - 16]
}

The assembly:

example:
    push rbp
    mov  rbp, rsp
    sub  rsp, 16
    mov  DWORD PTR [rbp-4],  10    ; a
    mov  DWORD PTR [rbp-8],  20    ; b
    mov  QWORD PTR [rbp-16], 30    ; c
    leave
    ret

No names. No types. Just memory locations relative to rbp. The "variable" is a fiction that exists only in your source code and your debugger's symbol table.


C: extra stack tricks

C gives you a few extra ways to use the stack:

alloca — allocate on the stack dynamically:

#include <alloca.h>

void risky() {
    int n = 1000;
    char *buf = alloca(n);  // n bytes on the stack, freed on return
    buf[0] = 'A';
}

This just moves rsp down by n bytes. No malloc, no free, no heap involvement. Fast, but dangerous — if n is too large, you blow through the stack limit with no warning.

Variable-Length Arrays (VLAs):

void process(int n) {
    int arr[n];   // VLA: n elements on the stack
    arr[0] = 42;
}

Same idea as alloca — the compiler adjusts rsp at runtime. Same risks.


Rust: stack discipline

Rust uses the same stack mechanics but gives you fewer footguns:

#![allow(unused)]
fn main() {
fn example() {
    let a: i32 = 10;           // stack — same as C
    let b = [0u8; 100];        // stack — fixed-size array, size known at compile time
    let c = Box::new(42);      // heap — when you need dynamic or large allocation
    // no alloca, no VLAs
}
}

Rust doesn't have alloca or VLAs. If you need a runtime-sized buffer, you use Vec<T>, which allocates on the heap. This is a deliberate choice — the Rust team decided that stack overflows from unchecked dynamic stack allocation are too dangerous to allow in safe code.

If you really need stack allocation with a runtime size, you'd use a crate or unsafe code. But the standard path is: fixed sizes on the stack, dynamic sizes on the heap.


Stack size and limits

Check your current stack size limit:

$ ulimit -s
8192

That's 8192 KB = 8 MB. This is a per-thread limit. Every thread gets its own stack.

What this means in practice:

8 MB = 8,388,608 bytes
int takes 4 bytes
Maximum ints on the stack ≈ 2,000,000

But stack frames have overhead (return address, saved rbp, alignment),
and you have many nested function calls, so the practical limit is lower.

You can change it:

$ ulimit -s 16384   # 16 MB stack
$ ulimit -s unlimited  # no limit (dangerous)

💡 Fun Fact

Threads created with pthread_create get their own stack (default 2MB on most systems, not 8MB). You can configure this with pthread_attr_setstacksize. Rust's std::thread::Builder exposes stack_size() for the same purpose.


Guard pages: the safety net

The OS doesn't just give you 8MB of stack and hope for the best. It places a guard page at the bottom of the stack — an unmapped page (usually 4KB) that exists solely to catch stack overflows:

                 ┌───────────────────────┐
                 │   Stack               │  8 MB of rw-p pages
                 │   (used portion)      │
                 │                       │
                 │   (unused portion)    │
                 ├───────────────────────┤
  rsp could     │   Guard Page          │  ---p (no permissions)
  reach here →  │   (unmapped, 4KB)     │
                 ├───────────────────────┤
                 │   Other mappings...   │
                 └───────────────────────┘

When rsp drops into the guard page, the CPU triggers a page fault. The kernel sees that the faulting address is in the guard region and sends SIGSEGV to the process. Game over.


Stack overflow: what really happens

// save as overflow.c — compile: gcc -o overflow overflow.c
#include <stdio.h>

void recurse(int depth) {
    char buffer[4096];  // 4KB per frame, eats the stack fast
    buffer[0] = 'A';   // touch it so compiler doesn't optimize it away
    printf("depth: %d, &buffer = %p\n", depth, (void *)buffer);
    recurse(depth + 1);
}

int main() {
    recurse(0);
    return 0;
}
$ ./overflow
depth: 0, &buffer = 0x7ffd3a1b0c50
depth: 1, &buffer = 0x7ffd3a1afbe0
depth: 2, &buffer = 0x7ffd3a1aeb70
...
depth: 2040, &buffer = 0x7ffd39924f50
depth: 2041, &buffer = 0x7ffd39923ee0
Segmentation fault (core dumped)

Each frame is ~4KB. 8MB / 4KB = ~2048 frames. The numbers check out.

The same thing in Rust:

fn recurse(depth: u32) {
    let buffer = [0u8; 4096];
    // Use black_box to prevent optimization
    std::hint::black_box(&buffer);
    println!("depth: {}, &buffer = {:p}", depth, &buffer);
    recurse(depth + 1);
}

fn main() {
    recurse(0);
}

Rust's behavior is the same — a stack overflow triggers a SIGSEGV. The Rust runtime catches it and prints a friendlier message (thread 'main' has overflowed its stack), but the underlying mechanism is identical: rsp hits the guard page, CPU faults, kernel delivers a signal.


Why the stack is fast

Let's compare stack allocation to heap allocation:

Stack allocation:
    sub rsp, 32        ; one instruction — done

Stack deallocation:
    add rsp, 32        ; one instruction — done
    (or just: leave / ret)

Heap allocation:
    call malloc         ; enters allocator
    → search free lists ; find a suitable block
    → maybe call brk()  ; ask kernel for more memory
    → update metadata   ; bookkeeping
    → return pointer    ; finally done

Heap deallocation:
    call free           ; enters allocator
    → validate pointer  ; is this a real allocation?
    → coalesce blocks   ; merge with adjacent free blocks
    → update free list  ; bookkeeping
    → maybe call munmap ; return pages to kernel

The stack is also always "hot" in the CPU cache — the top of the stack was accessed by the previous instruction, so it's almost certainly in L1 cache. Heap allocations can be scattered across memory, causing cache misses.

💡 Fun Fact

The x86 push instruction does two things in one: it decrements rsp by 8 and writes the value to [rsp]. It's so common that the CPU has special hardware to accelerate it. On modern Intel CPUs, push can execute with the same throughput as a simple mov — the stack engine handles the rsp update for free.


Inspecting frames with GDB

You can see the actual stack frames in a debugger:

$ gcc -g -o stacktrace stacktrace.c
$ gdb ./stacktrace
(gdb) break c
(gdb) run
Breakpoint 1, c () at stacktrace.c:5
(gdb) backtrace
#0  c () at stacktrace.c:5
#1  b () at stacktrace.c:12
#2  a () at stacktrace.c:18
#3  main () at stacktrace.c:22

(gdb) info frame
Stack level 0, frame at 0x7fffffffde30:
 rip = 0x555555555149 in c (stacktrace.c:5);
 saved rip = 0x555555555185
 called by frame at 0x7fffffffde50
 ...

(gdb) info registers rsp rbp
rsp  0x7fffffffde10
rbp  0x7fffffffde20

The backtrace command walks the chain of saved rbp values — each rbp points to the previous frame's rbp, forming a linked list up the stack. That's how your debugger reconstructs the call chain.


🔧 Task

  1. Write a recursive function in C (and Rust) that prints its depth and the address of a local variable at each level. Run it and observe:

    • Addresses decrease (stack grows down)
    • Eventually it crashes (stack overflow)
    • The number of frames matches ulimit -s / frame_size
  2. Before running, calculate: with ulimit -s 8192 (8MB) and a 4096-byte local array per frame, how many levels of recursion should you get? Run it and compare.

  3. Try ulimit -s 16384 and run again. Do you get roughly twice as many frames?

  4. Check dmesg | tail after the crash — you should see something like:

    [12345.678] overflow[9999]: segfault at 7ffd39920ff0 ip ... sp ... error 6
    

    The kernel logged the fault address. Confirm it's near the bottom of the stack region shown in /proc/<PID>/maps.

The Heap: Flexible and Dangerous

Type this right now

// save as heap_intro.c — compile: gcc -o heap_intro heap_intro.c
#include <stdio.h>
#include <stdlib.h>

int main() {
    int *p = malloc(100);
    printf("malloc(100) returned: %p\n", (void *)p);

    printf("\nProcess maps (heap region):\n");
    FILE *f = fopen("/proc/self/maps", "r");
    char line[256];
    while (fgets(line, sizeof(line), f)) {
        if (strstr(line, "[heap]")) printf("  %s", line);
    }
    fclose(f);

    free(p);
    return 0;
}
$ gcc -o heap_intro heap_intro.c && ./heap_intro
malloc(100) returned: 0x55a8b7e002a0

Process maps (heap region):
  55a8b7e00000-55a8b7e21000 rw-p 00000000 00:00 0  [heap]

You asked for 100 bytes. The kernel gave the allocator 135,168 bytes (0x21000 = 132KB). That discrepancy is the whole story of this chapter.


The two sides of dynamic allocation

C — explicit control, explicit danger:

int *p = malloc(100);       // allocate 100 bytes
*p = 42;                    // use it
free(p);                    // you MUST remember to do this

Rust — same allocation, automatic cleanup:

#![allow(unused)]
fn main() {
let p = Box::new(42);       // allocate on heap
println!("{}", *p);         // use it
// p is dropped here — free() called automatically
}

Vec, String, Box — every Rust heap type calls the allocator underneath. The difference is that Rust's ownership system guarantees free happens exactly once, at the right time.


The allocator as middleman

Your code never talks to the kernel directly for heap memory. There's a middleman:

┌─────────────────┐
│   Your Code     │    malloc(100)  /  Box::new(42)
│                 │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Allocator     │    glibc malloc  /  jemalloc  /  mimalloc
│   (user space)  │    Maintains free lists, bins, arenas
│                 │    Splits large blocks, coalesces freed ones
└────────┬────────┘
         │  Needs more memory?
         ▼
┌─────────────────┐
│   Kernel        │    brk()  — move the program break
│                 │    mmap() — map new anonymous pages
└─────────────────┘

The allocator requests large chunks from the kernel, then carves them up to serve your small malloc calls. This amortizes the cost of system calls — calling the kernel is expensive (hundreds of cycles), but adjusting a pointer in a free list is cheap (a few cycles).


brk and sbrk: the old way

The simplest heap mechanism is brk/sbrk. The kernel maintains a pointer called the program break — the end of the data segment. You can move it:

Before any allocation:

     ┌──────────────┐
     │  Text         │
     │  Data         │
     │  BSS          │
     └──────┬───────┘
            │ ← program break (brk)
            │
            ▼  (unmapped)


After sbrk(4096):

     ┌──────────────┐
     │  Text         │
     │  Data         │
     │  BSS          │
     ├──────────────┤
     │  Heap (4KB)   │ ← new memory
     └──────┬───────┘
            │ ← program break moved up by 4096
            ▼  (unmapped)
#include <unistd.h>

void *old_brk = sbrk(0);        // get current break
sbrk(4096);                     // move break up by 4096 bytes
void *new_brk = sbrk(0);        // verify
printf("grew by: %ld\n", (char *)new_brk - (char *)old_brk);

Simple. Contiguous. But limited — you can only grow or shrink from one end. You can't free memory in the middle and return it to the OS.


mmap anonymous: the modern way

For large allocations (glibc's threshold is typically 128KB), the allocator skips brk entirely and asks the kernel for pages directly:

#include <sys/mman.h>

void *p = mmap(NULL, 1048576,        // 1 MB
               PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS,
               -1, 0);
// use p...
munmap(p, 1048576);                  // return to kernel

mmap returns memory from anywhere in the address space — it doesn't need to be contiguous with the heap. And when you munmap, the pages go back to the kernel immediately. No fragmentation problem at the kernel level.

💡 Fun Fact

When you malloc(1000000) in glibc (about 1MB), it doesn't use brk. It calls mmap internally. You can observe this with strace:

$ strace -e brk,mmap ./your_program
mmap(NULL, 1003520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f...

Small allocations use brk. Large ones use mmap. The crossover point is tunable (mallopt(M_MMAP_THRESHOLD, ...)).


Fragmentation: the heap's worst enemy

Here's where the heap gets ugly. Watch this allocation sequence:

1. Allocate A (32 bytes), B (64 bytes), C (32 bytes):

   ┌────┬──────────┬────┐
   │ A  │    B     │ C  │
   │ 32 │    64    │ 32 │
   └────┴──────────┴────┘

2. Free B:

   ┌────┬──────────┬────┐
   │ A  │  (free)  │ C  │
   │ 32 │    64    │ 32 │
   └────┴──────────┴────┘

3. Try to allocate D (96 bytes):

   ┌────┬──────────┬────┬────────────┐
   │ A  │  (free)  │ C  │     D      │
   │ 32 │    64    │ 32 │     96     │
   └────┴──────────┴────┴────────────┘
   ^      ^^^^^^^^
   │      64 bytes wasted — big enough
   │      for some things, but not for D

The 64-byte hole can never be used for anything that needs more than 64 bytes. Over time, your heap becomes Swiss cheese — lots of small holes, no large contiguous blocks. Total free memory might be 10MB, but the largest contiguous block is 64 bytes.

This is external fragmentation, and it's one of the hardest problems in allocator design.

There's also internal fragmentation: you ask for 100 bytes, the allocator gives you 128 (rounded up for alignment). Those 28 extra bytes are wasted.

Internal fragmentation:

   ┌──────────────────────┐
   │ Your 100 bytes │ pad │  ← allocator gives you a 128-byte block
   │                │ 28  │     28 bytes wasted inside the block
   └──────────────────────┘

C's four deadly heap bugs

The heap in C is a minefield. Here are the four classic ways to corrupt it, each with code and a diagram.

1. Use-after-free

int *p = malloc(sizeof(int));
*p = 42;
free(p);          // p is now a dangling pointer
*p = 99;          // BUG: writing to freed memory
After malloc:                  After free:               After *p = 99:

┌──────────┐                  ┌──────────┐              ┌──────────┐
│ p → [42] │  allocated       │ p → [??] │  freed       │ p → [99] │  CORRUPTED
└──────────┘                  └──────────┘              └──────────┘
                              Allocator may have         Someone else may
                              recycled this block        own this memory now

The pointer p still holds the old address, but that memory may now belong to a different allocation. Writing to it silently corrupts someone else's data. This is the most common source of security vulnerabilities in C programs.

2. Double-free

int *p = malloc(sizeof(int));
free(p);
free(p);          // BUG: freeing the same block twice
Free list before:     After free(p):        After free(p) again:

HEAD → [blk_A]       HEAD → [p] → [blk_A]  HEAD → [p] → [p] → [blk_A]
                                                    ^^^^^^^^^^^^
                                              Circular! Allocator metadata
                                              is now corrupted.

Double-free corrupts the allocator's internal free list. The next two malloc calls may return the same pointer, causing two parts of your program to stomp on each other's data.

3. Buffer overflow (heap)

int *p = malloc(10 * sizeof(int));  // 40 bytes
for (int i = 0; i <= 10; i++) {     // BUG: off-by-one, writes 11 elements
    p[i] = i;
}
Heap memory layout:

┌──────────────────────────────┬──────────────────────┐
│ p's allocation (40 bytes)    │ allocator metadata   │
│ [0][1][2]...[9]              │ [size|flags|next]    │
└──────────────────────────────┴──────────────────────┘
                               ^^^^^^^^^^^^^^^^
                               p[10] writes HERE
                               Corrupts the allocator's
                               bookkeeping for the next block

You wrote past the end of your allocation. The bytes that follow are typically allocator metadata for the next block — size, flags, free-list pointers. Corrupting them means the next malloc or free behaves unpredictably. The crash often happens far from the bug, making it nightmarish to debug.

4. Memory leak

void process_data() {
    int *data = malloc(1024 * 1024);  // 1 MB
    // ... use data ...
    return;  // BUG: forgot to call free(data)
}
After calling process_data() 1000 times:

Heap:
┌──────┬──────┬──────┬──────┬──────┬──────┬───   ─┬──────┐
│ 1 MB │ 1 MB │ 1 MB │ 1 MB │ 1 MB │ 1 MB │ ...  │ 1 MB │
│ LOST │ LOST │ LOST │ LOST │ LOST │ LOST │ LOST │ LOST │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
              ← 1 GB of memory, no way to free it →

Process RSS grows without bound. Eventually: OOM killer.

No pointer to the memory means no way to free it. The memory is allocated but unreachable. In a long-running server, this is death by a thousand cuts.


Rust prevents all four

Here's the remarkable thing: Rust's ownership system makes all four bugs compile-time errors.

Use-after-free: impossible

#![allow(unused)]
fn main() {
let p = Box::new(42);
drop(p);            // explicitly free
println!("{}", *p); // COMPILE ERROR: use of moved value `p`
}
error[E0382]: borrow of moved value: `p`
 --> src/main.rs:4:20
  |
2 |     let p = Box::new(42);
  |         - move occurs because `p` has type `Box<i32>`
3 |     drop(p);
  |          - value moved here
4 |     println!("{}", *p);
  |                    ^^ value borrowed here after move

After drop, the compiler knows p is gone. You can't use it. Period.

Double-free: impossible

#![allow(unused)]
fn main() {
let p = Box::new(42);
drop(p);
drop(p);  // COMPILE ERROR: use of moved value `p`
}

Same error. After the first drop, p is moved. You can't drop it again.

Buffer overflow: caught at runtime

#![allow(unused)]
fn main() {
let v = vec![0; 10];
println!("{}", v[10]); // RUNTIME PANIC: index out of bounds
}
thread 'main' panicked at 'index out of bounds: the len is 10 but the index is 10'

Rust bounds-checks every array and vector access in safe code. You can opt out with get_unchecked() in unsafe blocks, but the default is safety.

Memory leak: prevented by Drop

#![allow(unused)]
fn main() {
fn process_data() {
    let data = vec![0u8; 1024 * 1024]; // 1 MB on the heap
    // ... use data ...
}   // data is dropped here — Vec's Drop impl calls free()
}

When data goes out of scope, Rust calls its Drop implementation, which frees the heap memory. You didn't write free. You didn't need to remember. The compiler inserted it for you.

🧠 What do you think happens?

Rust makes leaking memory safe — you can intentionally leak with Box::leak() or mem::forget(). Why would Rust allow this? Because leaking memory doesn't cause undefined behavior — it's wasteful, but it can't corrupt other data or violate memory safety. The safety guarantee is about soundness, not resource efficiency.


Side by side: the bug and the save

BugC codeWhat happensRust equivalentWhat happens
Use-after-freefree(p); *p = 1;Silent corruptiondrop(p); *p = 1;Compile error
Double-freefree(p); free(p);Allocator corruptiondrop(p); drop(p);Compile error
Buffer overflowp[11] on size-10Silent write past endv[11] on size-10Panic with message
Memory leakForget to freeMemory grows foreverScope exitDrop runs automatically

Seeing the allocator in action

You can watch malloc call the kernel using strace:

$ strace -e brk,mmap,munmap ./heap_intro
brk(NULL)                       = 0x55a8b7e00000
brk(0x55a8b7e21000)             = 0x55a8b7e21000
...

That first brk(NULL) queries the current program break. The second moves it up by 0x21000 bytes (132KB). That's the allocator requesting its initial arena from the kernel — even though you only asked for 100 bytes.

For a larger allocation:

void *big = malloc(256 * 1024);  // 256 KB
$ strace -e brk,mmap ./big_alloc
mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f...

256KB > 128KB threshold, so malloc chose mmap instead of brk.


Heap metadata: what the allocator tracks

Every malloc block has hidden metadata. The allocator stores bookkeeping right next to your data:

What malloc(100) actually looks like in memory:

     ┌─────────────────┬──────────────────────────────────┐
     │  Chunk header    │  Your 100 bytes (user data)      │
     │  (16 bytes)      │                                  │
     │  size | flags    │                                  │
     └─────────────────┴──────────────────────────────────┘
     ^                  ^
     │                  └── pointer returned by malloc
     └── actual start of the allocation

The pointer you get back is PAST the header.

This is why buffer overflows are so catastrophic — write past your allocation and you corrupt the next chunk's header. The allocator trusts its own metadata. If you corrupt it, all bets are off.

💡 Fun Fact

glibc's malloc stores the size of the current chunk AND uses the low bits of the size field as flags (since chunks are always aligned to 16 bytes, the bottom 4 bits are free for flags). One bit indicates "previous chunk is in use." This clever trick is why glibc malloc can coalesce adjacent free blocks efficiently — but it's also why heap corruption is so devastating.


Valgrind: your heap safety net in C

Since C can't catch heap bugs at compile time, you use runtime tools. Valgrind is the classic:

// save as uaf.c — compile: gcc -g -o uaf uaf.c
#include <stdlib.h>

int main() {
    int *p = malloc(sizeof(int));
    *p = 42;
    free(p);
    *p = 99;  // use-after-free
    return 0;
}
$ valgrind ./uaf
==12345== Invalid write of size 4
==12345==    at 0x401156: main (uaf.c:7)
==12345==  Address 0x4a47040 is 0 bytes inside a block of size 4 free'd
==12345==    at 0x483CA3F: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12345==    by 0x40114D: main (uaf.c:6)

Valgrind tells you exactly what happened, where, and when the block was freed. It's the closest C gets to Rust's compile-time guarantees — but it only catches bugs you actually trigger at runtime, and it makes your program 20-50x slower.


Rust's allocator: same mechanism, different interface

Rust uses the system allocator by default (glibc malloc on Linux). Box::new(42) calls malloc internally. drop(box_value) calls free.

Rust                          C equivalent
─────────────────────────────────────────────
Box::new(42)                  malloc(4); *p = 42;
Vec::with_capacity(100)       malloc(100 * elem_size)
String::from("hello")        malloc(5); memcpy(...)
drop(x)                       free(x)

Same allocator. Same brk/mmap calls. Same heap metadata. The only difference is that Rust's type system ensures you call free exactly once, at exactly the right time.

You can even swap in a different allocator in Rust:

#![allow(unused)]
fn main() {
use std::alloc::System;

#[global_allocator]
static GLOBAL: System = System;  // use the system allocator explicitly

// or use jemalloc, mimalloc, etc. via crates
}

🔧 Task

  1. Write a C program that demonstrates use-after-free:

    int *p = malloc(sizeof(int));
    *p = 42;
    free(p);
    printf("After free: %d\n", *p);  // What prints?
    *p = 99;
    

    Compile and run it normally. Does it crash? (Probably not — that's what makes it dangerous.) Now run it under Valgrind: valgrind ./uaf. Observe the report.

  2. Try the same pattern in Rust:

    #![allow(unused)]
    fn main() {
    let p = Box::new(42);
    drop(p);
    println!("{}", *p);
    }

    Does it compile? Read the compiler error carefully — it tells you exactly what went wrong.

  3. Use strace -e brk,mmap on a program that does:

    • malloc(100) — observe brk
    • malloc(256 * 1024) — observe mmap
    • Why the different system calls?
  4. Challenge: Write a C program that allocates and frees memory in a pattern that creates fragmentation. Allocate 1000 blocks of 1KB, free the even-numbered ones, then try to allocate a single 500KB block. Does it fit in the holes? Use /proc/self/maps to observe the heap size before and after.

Global Data and Read-Only Memory

Type this right now

// save as globals.c — compile: gcc -o globals globals.c
#include <stdio.h>
#include <signal.h>

int initialized = 42;               // .data
int uninitialized;                   // .bss
const int constant = 99;            // .rodata
const char *greeting = "hello";     // pointer in .data, string in .rodata

int main() {
    printf(".data   (initialized):   %p  value=%d\n",
           (void *)&initialized, initialized);
    printf(".bss    (uninitialized):  %p  value=%d\n",
           (void *)&uninitialized, uninitialized);
    printf(".rodata (constant):      %p  value=%d\n",
           (void *)&constant, constant);
    printf(".rodata (string):        %p  value=\"%s\"\n",
           (void *)greeting, greeting);
    return 0;
}
$ gcc -o globals globals.c && ./globals
.data   (initialized):   0x55d3a7004010  value=42
.bss    (uninitialized):  0x55d3a7004018  value=0
.rodata (constant):      0x55d3a7002004  value=99
.rodata (string):        0x55d3a7002008  value="hello"

Notice: .data and .bss addresses are close together (both writable data). .rodata addresses are lower, near the code. The layout matches the diagram from Chapter 6.


.data: initialized globals

The .data section holds global and static variables with explicit non-zero initial values.

// C
int counter = 100;             // .data — stored in the binary
static int module_state = 5;   // .data — file-scoped, still in binary
#![allow(unused)]
fn main() {
// Rust
static COUNTER: i32 = 100;             // .data
static MODULE_STATE: i32 = 5;          // .data
}

These values are baked into the ELF binary. When you hexdump the binary, you'll find the bytes 64 00 00 00 (100 in little-endian) sitting right there in the file. The OS loader copies them into memory at process start.

ELF binary on disk:
┌────────────────────────────────────────────┐
│ ... ELF header ... │ .text │ .rodata │ .data: [64 00 00 00] [05 00 00 00] │
└────────────────────────────────────────────┘
                                              ^^^^^^^^^^^^^^^^^^
                                              counter = 100, module_state = 5
                                              literally stored in the file

Permissions: rw- — readable and writable, because your program modifies globals at runtime.


.bss: the zero optimization

// C
int zeroed_global;              // .bss — implicitly zero in C
static int big_buffer[100000]; // .bss — 400,000 bytes of zeros
#![allow(unused)]
fn main() {
// Rust
static ZEROED: i32 = 0;                        // .bss (compiler sees it's zero)
static BIG_BUFFER: [i32; 100000] = [0; 100000]; // .bss
}

Here's the key insight: the .bss section takes zero bytes on disk.

With .data, storing 400,000 bytes of zeros:
┌──────────────────────────────────────────────────┐
│ ELF header │ .text │ .data: [00 00 00 ... 400KB] │
└──────────────────────────────────────────────────┘
Binary size: ~400 KB larger

With .bss:
┌──────────────────────────────────────────┐
│ ELF header │ .text │ .bss header: "400000 bytes needed" │
└──────────────────────────────────────────┘
Binary size: just a few extra bytes for the header

The ELF file records "I need 400,000 bytes of BSS" but doesn't store the actual zeros. At load time, the OS allocates the memory and zeroes it. This is why .bss exists as a separate section — it's a size optimization that matters enormously for real programs.

💡 Fun Fact

The name "BSS" comes from an old IBM 704 assembler directive: "Block Started by Symbol." It's been used since the 1950s. Seventy years later, your Linux kernel still uses the same concept to avoid storing zeros on disk.

Verify it yourself:

$ readelf -S globals | grep -E "\.data|\.bss"
  [24] .data  PROGBITS  0000000000004000  003000  000010  00  WA  0  0  8
  [25] .bss   NOBITS    0000000000004010  003010  000004  00  WA  0  0  4

.data has type PROGBITS (actual bits in the program file). .bss has type NOBITS (nothing on disk).


.rodata: string literals and constants

// C
const char *msg = "Hello, world!";   // the string is in .rodata
const int lookup[] = {1, 1, 2, 3, 5, 8, 13, 21};  // .rodata
#![allow(unused)]
fn main() {
// Rust
let msg: &str = "Hello, world!";     // string literal → .rodata
const LOOKUP: [i32; 8] = [1, 1, 2, 3, 5, 8, 13, 21]; // may be in .rodata
}

The .rodata section holds data that should never be modified: string literals, constant arrays, jump tables for switch statements. Its permissions are r-- — readable, not writable, not executable.

This leads to one of C's most surprising crashes:

char *s = "hello";  // s points to .rodata
s[0] = 'H';         // SIGSEGV! Writing to read-only memory
Memory layout:

.rodata (r-- permissions):
┌─────────────────────────┐
│ 'h' 'e' 'l' 'l' 'o' \0 │ ← read-only page
└─────────────────────────┘
  ^
  s points here

s[0] = 'H'  →  CPU tries to write to a read-only page
             →  MMU raises a page fault
             →  Kernel sends SIGSEGV
             →  Program crashes

The string "hello" is a literal — it lives in .rodata, which is mapped read-only. The pointer s is in .data (it's a writable global), but the string it points to is not writable.

Compare with:

char s[] = "hello";  // s is a char ARRAY on the stack — copy of the string
s[0] = 'H';          // Fine! The array is writable stack memory

🧠 What do you think happens?

In C, what's the difference between char *s = "hello" and char s[] = "hello"? The first is a pointer to read-only memory. The second is a stack array initialized with a copy of the string. One crashes when you modify it, the other works fine. The syntax looks almost identical, but the memory layout is completely different.


Rust: no surprises here

In Rust, string literals are &'static str — a reference to data in .rodata. The type system makes the immutability obvious:

#![allow(unused)]
fn main() {
let s: &str = "hello";   // immutable reference to .rodata
// s is &str — there's no way to get a &mut str from a string literal
// The type TELLS you it's read-only
}

You can't accidentally modify a string literal in Rust because &str is an immutable reference. The compiler won't let you write through it. The C trap simply doesn't exist.


Rust's global variable story

Rust has several ways to declare global data, each with different properties:

#![allow(unused)]
fn main() {
// 1. const — compile-time constant, inlined at every use site
const MAX_SIZE: usize = 1024;
// Not a memory location — the value 1024 is copied wherever MAX_SIZE appears

// 2. static — true global variable, has a fixed address in .data or .bss
static COUNTER: i32 = 0;
// Cannot be mutated without unsafe or interior mutability

// 3. static mut — mutable global, requires unsafe to access
static mut DANGER: i32 = 0;
unsafe { DANGER += 1; }  // You asked for it

// 4. LazyLock — initialized on first access (like lazy_static!)
use std::sync::LazyLock;
static CONFIG: LazyLock<String> = LazyLock::new(|| {
    std::fs::read_to_string("/etc/myapp.conf").unwrap()
});
// First access initializes it. Thread-safe. No unsafe.
}
KindSectionMutable?Thread-safe?When initialized
constInlinedNoN/ACompile time
static.data/.bssNoYesLoad time
static mut.data/.bssYes (unsafe)NoLoad time
LazyLock.bss + heapVia interior mutabilityYesFirst access

💡 Fun Fact

static mut is so dangerous that the Rust community is actively discussing deprecating it. Every access requires unsafe, and it's a common source of data races. The recommended alternatives are AtomicI32, Mutex<T>, or LazyLock<T> — all of which provide safe mutation without unsafe.


Seeing sections with readelf

You can inspect exactly what sections your binary contains:

$ gcc -o globals globals.c
$ readelf -S globals
Section Headers:
  [Nr] Name      Type       Address          Off    Size   ES Flg
  ...
  [16] .text     PROGBITS   0000000000001060 001060 000185 00  AX
  [18] .rodata   PROGBITS   0000000000002000 002000 000040 00   A
  [24] .data     PROGBITS   0000000000004000 003000 000010 00  WA
  [25] .bss      NOBITS     0000000000004010 003010 000004 00  WA

The flags tell you everything:

  • A = allocated (loaded into memory)
  • X = executable
  • W = writable

.text: AX (allocated + executable) — code .rodata: A (allocated, not writable, not executable) — read-only data .data: WA (writable + allocated) — read-write globals .bss: WA (writable + allocated) — but NOBITS on disk


🔧 Task

  1. Write a C program with:

    • An initialized global (int x = 42;)
    • An uninitialized global (int y;)
    • A large uninitialized array (int big[100000];)
    • A string literal ("hello")
    • A const array (const int fib[] = {1,1,2,3,5,8};)
  2. Compile it: gcc -o sections sections.c

  3. Run readelf -S sections and find .data, .bss, .rodata. Note the sizes.

  4. Predict: .bss should be at least 400,000 bytes (100,000 ints * 4 bytes). Is it? Check that the binary file size did NOT grow by 400KB: ls -la sections.

  5. Try modifying the string literal at runtime: char *s = "hello"; s[0] = 'H'; Compile and run. Observe the SIGSEGV. Then try the same modification on a char[] array instead. It works. Explain why in terms of which section each lives in.

  6. Do the same in Rust. Use static, const, and a string literal. Compile with rustc -o rust_sections your_file.rs and inspect with readelf -S rust_sections.

Undefined Behavior: C's Silent Killer

Type this right now

// save as ub_demo.c — compile TWICE with different optimization levels
#include <stdio.h>
#include <limits.h>

int main() {
    int x = INT_MAX;
    printf("x = %d\n", x);
    printf("x + 1 = %d\n", x + 1);

    if (x + 1 > x) {
        printf("Overflow detected? x + 1 > x is TRUE\n");
    } else {
        printf("No overflow? x + 1 > x is FALSE\n");
    }
    return 0;
}
$ gcc -O0 -o ub_O0 ub_demo.c && ./ub_O0
x = 2147483647
x + 1 = -2147483648
No overflow? x + 1 > x is FALSE

$ gcc -O2 -o ub_O2 ub_demo.c && ./ub_O2
x = 2147483647
x + 1 = -2147483648
Overflow detected? x + 1 > x is TRUE

Read that again. The same code, compiled with different flags, produces opposite results. The -O0 build wraps around and gives FALSE. The -O2 build says TRUE — the compiler removed the comparison entirely because signed overflow is undefined behavior, so the compiler assumes it cannot happen, and therefore x + 1 > x must always be true.

This isn't a bug in GCC. This is exactly what the C standard permits. Welcome to undefined behavior.


What undefined behavior actually means

The C standard defines three categories of problematic code:

  • Implementation-defined: The behavior varies by compiler, but the compiler must document what it does (e.g., size of int, right-shifting signed numbers).
  • Unspecified: The compiler can choose among several options, but doesn't have to tell you (e.g., order of evaluation of function arguments).
  • Undefined: The standard imposes no requirements whatsoever.

That last one is what kills you. When your program has undefined behavior, the compiler is allowed to:

  • Produce the result you expected
  • Produce a different result
  • Crash
  • Delete your code
  • Make demons fly out of your nose (the community joke)
  • Optimize as if the UB can never happen

That last point is the real danger. Modern optimizing compilers actively exploit undefined behavior. They don't just ignore your bug — they use it to justify removing your safety checks.


Example 1: signed integer overflow

The C standard says signed integer overflow is undefined. The compiler exploits this:

int check_overflow(int x) {
    if (x + 1 > x)        // compiler: "signed overflow can't happen"
        return 1;          //           "so x + 1 is always > x"
    else                   //           "this branch is dead code"
        return 0;          //           "I'll remove it"
}

At -O2, GCC compiles this to:

check_overflow:
    mov eax, 1             ; just return 1, always
    ret

The entire if statement is gone. The compiler reasoned: "Signed overflow is UB. I'm allowed to assume the programmer never does UB. Therefore x + 1 is always greater than x. Therefore the function always returns 1."

The logic is airtight given the assumption. The assumption is wrong for INT_MAX. But the standard says you must never reach INT_MAX + 1, so the compiler is technically correct.

🧠 What do you think happens?

What if you change int to unsigned int in the example above? Unsigned overflow IS defined in C — it wraps around modulo 2^n. The compiler can no longer optimize away the check. Try it.


Example 2: null pointer "optimization"

void process(int *ptr) {
    int value = *ptr;       // dereference ptr — if ptr is NULL, this is UB
    if (ptr == NULL) {      // null check AFTER dereference
        printf("ptr is NULL!\n");
        return;
    }
    printf("value = %d\n", value);
}

A human reads this and thinks "the null check is there for safety." The compiler reads it differently:

Compiler's reasoning:
1. Line 2 dereferences ptr
2. If ptr were NULL, that would be UB
3. I'm allowed to assume UB doesn't happen
4. Therefore ptr is NOT NULL at line 2
5. Therefore ptr is NOT NULL at line 3
6. Therefore the NULL check always fails
7. I'll remove it

At -O2:

process:
    mov    eax, DWORD PTR [rdi]    ; dereference ptr (no null check)
    ; the if (ptr == NULL) block is GONE
    mov    esi, eax
    lea    rdi, .LC0               ; "value = %d\n"
    jmp    printf

The null check was deleted. If ptr is NULL, the program crashes with no diagnostic. The "safety" code you carefully wrote is not in the binary.

This is not a contrived example. The Linux kernel had exactly this bug in a network driver (CVE-2009-1897). A null check was removed by GCC because a dereference appeared earlier in the function.


Example 3: use-after-free, the optimizer's playground

#include <stdlib.h>
#include <stdio.h>

int main() {
    int *p = malloc(sizeof(int));
    *p = 42;
    free(p);

    // UB: accessing freed memory
    // The compiler may reuse p's register for something else,
    // or the allocator may recycle the memory.
    printf("value = %d\n", *p);
    return 0;
}

This might print 42, or 0, or 1735289204, or crash, depending on:

  • Optimization level
  • Allocator implementation
  • Whether another thread allocated between free and *p
  • Phase of the moon (only slightly joking)

The insidious part: it might work perfectly in testing and crash only in production. UB doesn't guarantee a crash. It guarantees nothing.


Example 4: uninitialized variables

int foo() {
    int x;      // uninitialized — reading it is UB
    return x;   // what value?
}

You might expect "whatever was on the stack." But the compiler is allowed to assume you never read uninitialized memory. This means:

int bar() {
    int x;
    if (x == 0) {
        printf("zero\n");
    }
    if (x != 0) {
        printf("nonzero\n");
    }
}

The compiler can print both, neither, or one of these messages. It's not required to be consistent — x doesn't have to have the same value in both checks. GCC and Clang have both been observed making different choices at different optimization levels.

💡 Fun Fact

The phrase "nasal demons" comes from a 1992 comp.std.c Usenet post. Someone argued that undefined behavior could cause "demons to fly out of your nose." It became a running joke, but it captures a real truth: the standard truly places NO constraints on what happens. The joke persists because the reality is hard to believe.


UB on Godbolt: see it live

You can see these optimizations yourself at godbolt.org. Paste the signed overflow example, select x86-64 GCC with -O2, and watch the assembly. The comparison vanishes.

Try these experiments:

  1. Change -O2 to -O0 — the comparison reappears
  2. Change int to unsigned int — the comparison stays even at -O2
  3. Add -fwrapv (tells GCC to treat signed overflow as wrapping) — the comparison stays

-fwrapv is GCC's escape hatch: it makes signed overflow defined (as two's complement wrapping). Some projects (including the Linux kernel) compile with -fwrapv to eliminate this entire class of UB.


Rust's answer: no undefined behavior in safe code

Rust makes a bold guarantee: safe Rust has no undefined behavior.

This isn't a suggestion or a best practice. It's a hard property of the language. The compiler rejects programs that could exhibit UB, or inserts runtime checks where compile-time prevention isn't possible.

Signed integer overflow:

fn main() {
    let x: i32 = i32::MAX;
    let y = x + 1;   // In debug: PANIC at runtime
    println!("{}", y);
}
$ cargo run
thread 'main' panicked at 'attempt to add with overflow'

$ cargo run --release
# In release mode: wraps to -2147483648 (defined behavior, not UB)

Rust made a choice: in debug mode, overflow panics (catches bugs). In release mode, overflow wraps (for performance). Either way, the behavior is defined. The compiler cannot exploit it for optimizations that break your code.

Null pointers:

#![allow(unused)]
fn main() {
// Rust doesn't have null pointers.
// Option<&T> is the equivalent, and you MUST check it:
fn process(ptr: Option<&i32>) {
    match ptr {
        Some(value) => println!("value = {}", value),
        None => println!("ptr is None!"),
    }
}
// The compiler enforces the check. You can't dereference without matching.
}

Uninitialized variables:

fn main() {
    let x: i32;
    println!("{}", x);  // COMPILE ERROR: use of possibly-uninitialized variable
}
error[E0381]: used binding `x` isn't initialized
 --> src/main.rs:3:20
  |
2 |     let x: i32;
  |         - binding declared here but left uninitialized
3 |     println!("{}", x);
  |                    ^ `x` used here but it isn't initialized

No guessing. No "whatever was on the stack." The compiler rejects it outright.


The unsafe boundary

Rust does allow operations that could cause UB — but only inside unsafe blocks:

fn main() {
    let x: i32;

    // This is safe Rust — compiler prevents UB
    // println!("{}", x);  // won't compile

    unsafe {
        // In unsafe, you can do things that might cause UB
        let ptr: *const i32 = 0x1234 as *const i32;
        // let val = *ptr;  // UB if address is invalid
    }
}

The unsafe keyword is:

  • Opt-in: You must explicitly ask for it
  • Explicit: It marks exactly which code has elevated risk
  • Auditable: You can grep for unsafe in any codebase
  • Contained: The responsibility for correctness is localized

The idea is that 95% of your code is safe Rust (no UB possible), and 5% is unsafe (UB is your problem). You audit the 5%. In C, you audit 100%.

🧠 What do you think happens?

If you have 10,000 lines of Rust and UB occurs, where do you look? You grep for unsafe — maybe 50 lines. In C, if you have 10,000 lines and UB occurs, where do you look? Everywhere. That's the practical value of the safe/unsafe boundary.


UB in unsafe Rust: still possible

unsafe Rust can still invoke UB. The most common causes:

#![allow(unused)]
fn main() {
unsafe {
    // 1. Dereferencing a raw pointer to invalid memory
    let ptr: *const i32 = std::ptr::null();
    let _val = *ptr;  // UB: null dereference

    // 2. Creating an invalid reference
    let _ref: &i32 = &*ptr;  // UB: references must never be null

    // 3. Data races
    // Two threads writing to the same memory without synchronization

    // 4. Breaking aliasing rules
    // Having &T and &mut T to the same data simultaneously

    // 5. Calling a function with wrong ABI or invalid arguments
}
}

Unsafe Rust is roughly as dangerous as C for the code inside the unsafe block. The difference is that the blast radius is contained — safe code can rely on invariants that unsafe code must uphold.


The complete comparison

Undefined Behavior in CWhat happensRust equivalentWhat happens
Signed overflow: INT_MAX + 1Compiler removes your checksi32::MAX + 1Debug: panic. Release: wraps (defined)
Null deref: *NULLCompiler removes null checksNo null pointersOption<&T> forces you to check
Use-after-free: free(p); *pSilent corruptiondrop(p); *pCompile error
Uninitialized read: int x; return x;Anything — optimizer goes wildlet x: i32; xCompile error
Buffer overflow: a[11] on 10Silent corruptiona[11] on 10Panic: index out of bounds
Double-free: free(p); free(p)Allocator corruptiondrop(p); drop(p)Compile error
Data race: two threads, no syncTorn reads, corruption&mut T aliasing rulesCompile error
Invalid enum valueOptimizer makes wrong branch assumptionsInvalid enum via unsafe onlySafe code can't create one

Compiler flags that help

If you must write C, these flags make UB less likely to bite you:

# Compile-time warnings:
gcc -Wall -Wextra -Wpedantic -Werror

# Runtime sanitizers (debug builds):
gcc -fsanitize=undefined     # UBSan: catches UB at runtime
gcc -fsanitize=address       # ASan: catches memory bugs
gcc -fsanitize=thread        # TSan: catches data races

# UB-safe overflow handling:
gcc -fwrapv                  # Signed overflow wraps (like unsigned)
gcc -ftrapv                  # Signed overflow traps (abort)

UBSan in action:

$ gcc -fsanitize=undefined -o ub_demo ub_demo.c && ./ub_demo
ub_demo.c:7:22: runtime error: signed integer overflow:
  2147483647 + 1 cannot be represented in type 'int'

UBSan caught it. But remember: sanitizers only catch UB that actually executes. Code paths you don't test can still harbor UB. Rust's approach — preventing UB at compile time — catches it whether you test that path or not.

💡 Fun Fact

The LLVM optimizer (used by both Clang and rustc) has a concept called "poison values." When you compute something with UB (like signed overflow), the result is "poison" — a special marker that infects everything it touches. If a poison value reaches a branch condition, both branches become valid. If it reaches a store, the stored value is undefined. This is the formal mechanism by which UB propagates through your program.


🔧 Task

  1. Compile the signed overflow example from the start of this chapter at -O0 and -O2:

    gcc -O0 -o ub_O0 ub_demo.c && ./ub_O0
    gcc -O2 -o ub_O2 ub_demo.c && ./ub_O2
    

    Observe the different output. The compiler is not broken — it's exploiting UB.

  2. Add -fwrapv to the -O2 build:

    gcc -O2 -fwrapv -o ub_wrap ub_demo.c && ./ub_wrap
    

    Does the output match -O0 now? Why?

  3. Compile with UBSan:

    gcc -fsanitize=undefined -o ub_san ub_demo.c && ./ub_san
    

    Read the sanitizer's report carefully.

  4. Write the equivalent in Rust:

    fn main() {
        let x: i32 = i32::MAX;
        let y = x + 1;
        println!("{}", y);
    }

    Run in debug mode (cargo run) and release mode (cargo run --release). Compare. Neither is undefined behavior — one panics, the other wraps. Both are specified outcomes.

  5. Challenge: Find the null-pointer example on Godbolt. Compile with -O0 and -O2. At -O2, confirm the null check is deleted from the assembly. Then add -fno-delete-null-pointer-checks and verify it comes back.

ELF: Dissecting Your Executable

Type this right now

// save as hello.c, compile: gcc -o hello hello.c
#include <stdio.h>
int main() {
    printf("Hello, ELF!\n");
    return 0;
}
xxd hello | head -4
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............
00000010: 0300 3e00 0100 0000 6010 0000 0000 0000  ..>.....`.......
00000020: 4000 0000 0000 0000 9839 0000 0000 0000  @........9......
00000030: 0000 0000 4000 3800 0d00 4000 1f00 1e00  ....@.8...@.....

That's the first 64 bytes of your compiled program. Every single byte has meaning. By the end of this chapter, you'll be able to read them like a sentence.


The magic bytes

Look at the very first four bytes: 7f 45 4c 46.

  • 0x7f — a non-printable byte, chosen specifically so ELF files can't be confused with text
  • 0x45 = E
  • 0x4c = L
  • 0x46 = F

Every ELF file on your system — every binary, every .so, every .o — starts with \x7fELF.

# Try it yourself — check /bin/ls
xxd /bin/ls | head -1

The kernel checks these four bytes first. If they don't match, execve() fails immediately.

💡 Fun Fact: The 0x7f byte was chosen by the original Unix System V designers because it's the ASCII DEL character — the highest single-byte value. It makes it nearly impossible to accidentally create a file that "looks like" an ELF binary.


The ELF header: 64 bytes that describe everything

The ELF header is a fixed-size structure sitting at offset 0. On a 64-bit system, it's exactly 64 bytes. Here's the layout:

Offset  Size  Field               What it means
──────  ────  ──────────────────  ──────────────────────────────────────
0x00    4     e_ident[EI_MAG]     Magic: 7f 45 4c 46 (\x7fELF)
0x04    1     e_ident[EI_CLASS]   Class: 1=32-bit, 2=64-bit
0x05    1     e_ident[EI_DATA]    Endianness: 1=little, 2=big
0x06    1     e_ident[EI_VERSION] ELF version (always 1)
0x07    1     e_ident[EI_OSABI]   OS/ABI: 0=UNIX System V
0x08    8     e_ident[EI_PAD]     Padding (zeros)
0x10    2     e_type              Type: 1=relocatable 2=exec 3=shared
0x12    2     e_machine           Machine: 0x3e=x86-64, 0xb7=aarch64
0x14    4     e_version           ELF version (again, always 1)
0x18    8     e_entry             Entry point address
0x20    8     e_phoff             Program header table offset
0x28    8     e_shoff             Section header table offset
0x30    4     e_flags             Processor-specific flags
0x34    2     e_ehsize            ELF header size (64 for 64-bit)
0x36    2     e_phentsize         Program header entry size
0x38    2     e_phnum             Number of program headers
0x3a    2     e_shentsize         Section header entry size
0x3c    2     e_shnum             Number of section headers
0x3e    2     e_shstrndx          Section header string table index

That's it. 64 bytes. And from them, the kernel knows everything it needs to begin loading your program.


Reading it the easy way: readelf -h

readelf -h hello
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x1060
  Start of program headers:          64 (bytes into file)
  Start of section headers:          14744 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         13
  Size of section headers:           64 (bytes)
  Number of section headers:         31
  Section header string table index: 30

Every field maps directly to the hex bytes you saw in xxd. The entry point 0x1060 is the virtual address where execution begins — not main, but _start, the C runtime startup code that calls main.

🧠 What do you think happens? Why is the Type DYN (shared object) instead of EXEC? Modern GCC produces Position-Independent Executables by default. The binary can be loaded at any base address. This enables ASLR — we'll cover this in Chapter 14.


The big picture: ELF file layout

┌──────────────────────────────┐  offset 0
│         ELF Header           │  64 bytes — the "table of contents"
│  (magic, type, entry point)  │
├──────────────────────────────┤  offset 64 (0x40)
│    Program Header Table      │  tells the LOADER how to map segments
│  (array of segment entries)  │
├──────────────────────────────┤
│                              │
│        .text section         │  your compiled machine code
│                              │
├──────────────────────────────┤
│       .rodata section        │  read-only data ("Hello, ELF!\n")
├──────────────────────────────┤
│        .data section         │  initialized global variables
├──────────────────────────────┤
│         .bss section         │  uninitialized globals (zero bytes on disk)
├──────────────────────────────┤
│      .symtab section         │  symbol table
├──────────────────────────────┤
│      .strtab section         │  string table (symbol names)
├──────────────────────────────┤
│   .debug_* sections (if -g)  │  DWARF debug info
├──────────────────────────────┤
│     ... other sections ...   │
├──────────────────────────────┤
│    Section Header Table      │  tells the LINKER about each section
│  (array of section entries)  │
└──────────────────────────────┘

The ELF header points to the program header table (for the loader) and the section header table (for the linker and debugger). Everything else sits between them.


Program headers: what the loader sees

readelf -l hello

This shows the segments — the chunks the kernel maps into memory. Each segment has a type, permissions, a file offset, and a virtual address. We'll dissect these in the next chapter.

Section headers: what the linker sees

readelf -S hello

This shows the sections — the fine-grained pieces the compiler and linker work with. You'll see .text, .data, .rodata, .bss, .symtab, and many more.


C vs Rust: why the Rust binary is bigger

Let's compile the same program in Rust:

// save as hello.rs, compile: rustc hello.rs
fn main() {
    println!("Hello, ELF!");
}
$ ls -la hello         # C binary
-rwxr-xr-x 1 user user   15960 Feb 19 10:00 hello

$ ls -la hello_rs      # Rust binary (renamed for comparison)
-rwxr-xr-x 1 user user 4374432 Feb 19 10:01 hello_rs

The Rust binary is ~270x larger. Why?

readelf -S hello    | wc -l    # C:    ~31 sections
readelf -S hello_rs | wc -l    # Rust: ~43 sections

Three reasons:

  1. Static linking of the standard library. Rust statically links libstd by default. The C binary dynamically links libc. All that code for println!, formatting, panic handling, and the Rust runtime gets baked in.

  2. More debug info. Rust emits richer debug sections (.debug_info, .debug_abbrev, .debug_line, .debug_str, etc.) even without explicit -g.

  3. Monomorphized generics. Rust generates specialized code for each concrete type used with generics. println! alone pulls in a substantial amount of formatting machinery.


Strip it down

$ strip hello_rs -o hello_rs_stripped
$ ls -la hello_rs hello_rs_stripped
-rwxr-xr-x 1 user user 4374432 Feb 19 10:01 hello_rs
-rwxr-xr-x 1 user user  311496 Feb 19 10:02 hello_rs_stripped

strip removes symbol tables and debug info — sections the runtime doesn't need. The binary still runs, but you can no longer debug it with meaningful function names.

$ strip hello -o hello_stripped
$ ls -la hello hello_stripped
-rwxr-xr-x 1 user user 15960 Feb 19 10:00 hello
-rwxr-xr-x 1 user user 14408 Feb 19 10:02 hello_stripped

The C binary barely shrinks — it was already small because it delegates everything to the shared libc.so.

💡 Fun Fact: Production Rust binaries are typically compiled with cargo build --release, which enables optimizations and can be further reduced with strip, lto = true in Cargo.toml, and opt-level = "z" for size optimization. A release + stripped Rust hello world can get down to ~300 KB.


Finding your function: the symbol table

readelf -s hello | grep main
    34: 0000000000001149    35 FUNC    GLOBAL DEFAULT   16 main

There it is. main lives at virtual address 0x1149, is 35 bytes long, is a function (FUNC), has global visibility, and sits in section index 16 (which is .text).

Now for Rust:

readelf -s hello_rs | grep 'main'
  2156: 0000000000008280    47 FUNC    GLOBAL DEFAULT   14 main
  5731: 00000000000082b0   103 FUNC    LOCAL  DEFAULT   14 hello_rs::main

Rust has two entries: the C-compatible main that the C runtime calls, and hello_rs::main which is your actual Rust function. The first one is a thin wrapper that calls the second.


The entry point is NOT main

Remember the entry point from readelf -h? It was 0x1060, but main is at 0x1149. What's at 0x1060?

readelf -s hello | grep ' _start'
     1: 0000000000001060     0 FUNC    GLOBAL DEFAULT   16 _start

_start is the real entry point. It's provided by the C runtime (crt1.o). It sets up the stack, initializes the C library, and then calls main. When main returns, _start calls exit().

Kernel jumps here
       │
       v
   _start          (from crt1.o)
       │
       ├── __libc_start_main()
       │       │
       │       ├── Initialize libc
       │       ├── Call constructors
       │       ├── Call main()  ◄── YOUR CODE
       │       ├── Call destructors
       │       └── Call exit()
       │
       └── (never reached)

Rust's entry point

Rust has its own startup sequence, but it still begins with _start and ends up calling the system's C runtime:

_start → __libc_start_main → main (Rust shim) → std::rt::lang_start
                                                       │
                                                       └── your fn main()

Same hardware. Same ELF. Same kernel. Different runtime path to reach your code.


🔧 Task: Read the ELF header by hand

  1. Compile a C program: gcc -o hello hello.c
  2. Hex dump the first 64 bytes: xxd hello | head -4
  3. Using the field table from this chapter, identify each field manually:
    • Bytes 0-3: Magic number. What are they?
    • Byte 4: Class. Is it 32-bit or 64-bit?
    • Byte 5: Endianness. Little or big?
    • Bytes 16-17: ELF type. What type is it?
    • Bytes 18-19: Machine. What architecture?
    • Bytes 24-31: Entry point. What address? (Remember: little-endian!)
  4. Verify your answers with readelf -h hello.
  5. Repeat with a Rust binary. Do the fields differ?

This is how forensic analysts and reverse engineers read binaries — byte by byte.

Sections vs Segments: Two Views of One File

Type this right now

# Compile a simple C program
gcc -o hello hello.c

# Two different views of the same binary:
echo "=== SECTIONS (compiler/linker view) ==="
readelf -S hello | head -40

echo ""
echo "=== SEGMENTS (loader/kernel view) ==="
readelf -l hello

Run both commands. You'll see two completely different lists describing the same file. By the end of this chapter, you'll understand why both exist and how they relate.


The key insight

An ELF file has two parallel indexing systems:

  • Sections — the compiler's and linker's view. Fine-grained. Named. Used during compilation and linking.
  • Segments — the loader's and kernel's view. Coarse-grained. Permission-based. Used when mapping the binary into memory.
         Compile time                         Run time
         ───────────                          ────────
    ┌─────────────────────┐            ┌──────────────────┐
    │  Section Header     │            │  Program Header   │
    │  Table              │            │  Table             │
    │                     │            │                    │
    │  .text              │            │  LOAD (r-x)        │
    │  .rodata            │──────────► │  (code + rodata)   │
    │  .data              │            │                    │
    │  .bss               │──────────► │  LOAD (rw-)        │
    │  .symtab            │            │  (data + bss)      │
    │  .strtab            │            │                    │
    │  .debug_*           │            │  (not loaded)      │
    └─────────────────────┘            └──────────────────┘
      Many small pieces                  Few big chunks

The linker needs sections so it can merge .text from file A with .text from file B. The kernel doesn't care about any of that — it just needs to know which bytes go to which addresses with which permissions.


Sections: the full catalog

Here are the sections you'll encounter most often:

Section     Contents                         Permissions
─────────   ──────────────────────────────   ───────────
.text       Machine code (your functions)    r-x
.rodata     Read-only data (string literals) r--
.data       Initialized global variables     rw-
.bss        Uninitialized globals (zeroed)   rw-
.symtab     Symbol table (function names)    ---  (not loaded)
.strtab     String table (section names)     ---  (not loaded)
.dynsym     Dynamic symbol table             r--
.dynstr     Dynamic string table             r--
.plt        Procedure Linkage Table          r-x
.got        Global Offset Table              rw-
.debug_*    DWARF debug information          ---  (not loaded)
.rel.text   Relocations for .text            ---  (not loaded)
.init       Startup code                     r-x
.fini       Cleanup code                     r-x

Not all sections get loaded into memory. .symtab, .strtab, and .debug_* exist only in the file — the kernel ignores them entirely. They're for the linker, debugger, and tools like nm.

🧠 What do you think happens? If .bss holds uninitialized globals that are all zero, how many bytes does it occupy in the file? Answer: zero. The section header records its size, but no actual bytes are stored. The kernel allocates and zeroes the memory at load time. This is why .bss saves disk space.


Segments: what the kernel actually maps

readelf -l hello
Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0002d8 0x0002d8 R   0x8
  INTERP         0x000318 0x0000000000000318 0x0000000000000318 0x00001c 0x00001c R   0x1
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x000628 0x000628 R   0x1000
  LOAD           0x001000 0x0000000000001000 0x0000000000001000 0x000185 0x000185 R E 0x1000
  LOAD           0x002000 0x0000000000002000 0x0000000000002000 0x000114 0x000114 R   0x1000
  LOAD           0x002db8 0x0000000000003db8 0x0000000000003db8 0x000258 0x000260 RW  0x1000
  DYNAMIC        0x002dc8 0x0000000000003dc8 0x0000000000003dc8 0x0001f0 0x0001f0 RW  0x8
  NOTE           0x000338 0x0000000000000338 0x0000000000000338 0x000030 0x000030 R   0x8
  NOTE           0x000368 0x0000000000000368 0x0000000000000368 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x00200c 0x000000000000200c 0x000000000000200c 0x00003c 0x00003c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x002db8 0x0000000000003db8 0x0000000000003db8 0x000248 0x000248 R   0x1

The key segment types:

TypePurpose
LOADMap these bytes into memory. This is the main event.
INTERPPath to the dynamic linker (/lib64/ld-linux-x86-64.so.2)
DYNAMICDynamic linking information (needed libraries, symbol tables)
NOTEMetadata (build ID, ABI tag)
GNU_STACKStack permissions (notably: no execute)
GNU_RELROMark GOT as read-only after relocation (security)

The mapping: sections merge into segments

This is where the two views connect. readelf -l also shows you which sections fall into each segment:

 Section to Segment mapping:
  Segment Sections...
   00
   01     .interp
   02     .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag
          .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
   03     .init .plt .plt.got .plt.sec .text .fini
   04     .rodata .eh_frame_hdr .eh_frame
   05     .init_array .fini_array .dynamic .got .data .bss
   06     .dynamic
   07     .note.gnu.property
   08     .note.gnu.build-id .note.ABI-tag
   09     .eh_frame_hdr
   10     (empty — GNU_STACK has no sections)
   11     .init_array .fini_array .dynamic .got

Now the picture becomes clear:

SECTIONS (file)                          SEGMENTS (memory)
─────────────────                        ─────────────────

┌─────────────┐
│   .init     │───┐
├─────────────┤   │
│   .plt      │───┤
├─────────────┤   ├──►  LOAD segment 03 (R E)
│   .text     │───┤     Executable code
├─────────────┤   │
│   .fini     │───┘
├─────────────┤
│   .rodata   │───┬──►  LOAD segment 04 (R)
├─────────────┤   │     Read-only data
│  .eh_frame  │───┘
├─────────────┤
│   .data     │───┐
├─────────────┤   ├──►  LOAD segment 05 (RW)
│   .bss      │───┘     Read-write data
├─────────────┤
│  .symtab    │         NOT LOADED
├─────────────┤         (debug/linker use only)
│  .strtab    │         NOT LOADED
├─────────────┤
│  .debug_*   │         NOT LOADED
└─────────────┘

Multiple sections with the same permissions merge into one segment. The kernel doesn't need to know the difference between .init and .text — both are executable code, so they get one mmap() call with PROT_READ | PROT_EXEC.


Why two views?

The split exists because compilation and execution have different needs.

The linker needs to:

  • Merge .text from dozens of object files into one .text
  • Merge .data from dozens of object files into one .data
  • Apply relocations section by section
  • Keep debug info separate from code

The kernel needs to:

  • Map as few memory regions as possible (fewer page table entries)
  • Set permissions per region (read, write, execute)
  • Know the entry point
  • Know where the dynamic linker is

Sections are the how (fine-grained building blocks). Segments are the what (what goes where in memory with what permissions).

💡 Fun Fact: A fully stripped, statically linked binary can have zero sections and still run. The kernel only reads program headers. You can literally delete the section header table with a hex editor and the binary works fine. Tools like readelf -S will complain, but the kernel won't even notice.


Seeing your code in .text

objdump -d hello | grep -A 15 '<main>'
0000000000001149 <main>:
    1149:   f3 0f 1e fa             endbr64
    114d:   55                      push   rbp
    114e:   48 89 e5                mov    rbp,rsp
    1151:   48 8d 05 ac 0e 00 00    lea    rax,[rip+0xeac]  # 2004 <_IO_stdin_used+0x4>
    1158:   48 89 c7                mov    rdi,rax
    115b:   e8 f0 fe ff ff          call   1050 <puts@plt>
    1160:   b8 00 00 00 00          mov    eax,0x0
    1165:   5d                      pop    rbp
    1166:   c3                      ret

Address 0x1149 — that's in the .text section, which is inside the LOAD segment with R E permissions. The lea instruction at 0x1151 references address 0x2004 — that's the "Hello, ELF!\n" string in .rodata, inside the read-only LOAD segment.

Everything maps. Addresses in the code point to addresses in specific sections, which live in specific segments, which get specific permissions.


Rust comparison

// save as hello.rs, compile: rustc hello.rs
fn main() {
    println!("Hello, ELF!");
}
readelf -l hello_rs | grep LOAD
  LOAD  0x000000 0x0000000000000000 ... R   0x1000
  LOAD  0x009000 0x0000000000009000 ... R E 0x1000
  LOAD  0x05b000 0x000000000005b000 ... R   0x1000
  LOAD  0x07e458 0x000000000007f458 ... RW  0x1000

Same pattern: read-only, executable, read-only data, read-write. The Rust binary just has more of everything because it includes the standard library.


🔧 Task: Map sections to segments by hand

  1. Compile: gcc -o hello hello.c
  2. Run readelf -S hello — note the address and flags of each section
  3. Run readelf -l hello — note the address range and flags of each segment
  4. For each section, determine which segment contains it:
    • Does the section's address fall within the segment's [VirtAddr, VirtAddr+MemSiz) range?
    • Do the permissions match? (A writable section should be in a writable segment)
  5. Verify your mapping against the "Section to Segment mapping" output at the bottom of readelf -l
  6. Find a section that is NOT in any segment. Why isn't it loaded?

This exercise makes the two-view model concrete. Once you can do this mapping by hand, the linker and loader will never be mysterious again.

Compilation and Linking: Source to Binary

Type this right now

cat > greet.c << 'EOF'
#include <stdio.h>
void greet(const char *name) { printf("Hello, %s!\n", name); }
EOF

cat > main.c << 'EOF'
void greet(const char *name);
int main() { greet("world"); return 0; }
EOF

gcc -c greet.c -o greet.o
gcc -c main.c  -o main.o
gcc greet.o main.o -o hello

echo "=== main.o symbols ===" && nm main.o
echo "=== After linking ==="  && nm hello | grep -E 'main|greet'
=== main.o symbols ===
                 U greet
0000000000000000 T main
=== After linking ===
0000000000001149 T greet
0000000000001172 T main

main.o has an undefined symbol greet (the U). After linking, it has a real address. That's the entire purpose of the linker — resolving references between files.


The C compilation pipeline

  hello.c          Your source code
     │
     │  gcc -E      Preprocessor (#include, #define, #ifdef)
     v
  hello.i          Pure C, all macros expanded
     │
     │  gcc -S      Compiler (cc1) — C to assembly
     v
  hello.s          Human-readable assembly
     │
     │  gcc -c      Assembler (as) — assembly to machine code
     v
  hello.o          Object file (relocatable ELF)
     │
     │  gcc          Linker (ld) — resolves symbols, assigns addresses
     v
  a.out            Executable (final ELF)

You can stop at any stage. Let's see each output.

Preprocessing (gcc -E): A hello.c with #include <stdio.h> expands to ~700 lines. Every header pasted in, every macro expanded. The output is pure C.

Compilation (gcc -S):

main:
    endbr64
    pushq   %rbp
    movq    %rsp, %rbp
    leaq    .LC0(%rip), %rdi
    call    puts@PLT
    movl    $0, %eax
    popq    %rbp
    ret
.LC0:
    .string "Hello, ELF!"

Notice puts@PLT — the compiler doesn't know where puts lives. It emits a reference the linker will resolve.

Assembly (gcc -c):

hello.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped

The type is relocatable — machine code with provisional addresses starting at zero.

Linking (gcc):

hello: ELF 64-bit LSB pie executable, x86-64, dynamically linked,
       interpreter /lib64/ld-linux-x86-64.so.2, ...

The linker resolves all symbols, assigns final addresses, produces the executable.


The Rust pipeline

  hello.rs         Your source code
     │
     │  rustc        Parser → AST → HIR (desugared, type-checked)
     v                            → MIR (borrow-checked, monomorphized)
  LLVM IR          LLVM's intermediate representation
     │
     │  LLVM        Machine code generation
     v
  hello.o          Object file (same format as C!)
     │
     │  linker      Same system linker (ld / lld)
     v
  hello            Executable (final ELF)

After LLVM, Rust and C produce the same kind of object file. They use the same linker.

💡 Fun Fact: You can mix C and Rust in one binary. Compile C to .o, Rust to .o, link them together. The linker doesn't care what language produced the object file — it only sees symbols, sections, and relocations.


Object files: code with holes

objdump -d main.o
0000000000000000 <main>:
   0:   f3 0f 1e fa             endbr64
   4:   55                      push   rbp
   5:   48 89 e5                mov    rbp,rsp
   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]    # address TBD
   f:   48 89 c7                mov    rdi,rax
  12:   e8 00 00 00 00          call   17 <main+0x17>   # address TBD
  17:   b8 00 00 00 00          mov    eax,0x0
  1c:   5d                      pop    rbp
  1d:   c3                      ret

See the zeros at offsets 0x8 and 0x12? Those are holes. The lea needs the string address. The call needs greet's address. Neither is known yet.

Relocation entries tell the linker where to fill in:

readelf -r main.o
Relocation section '.rela.text' at offset 0x1e8 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000b  000500000002 R_X86_64_PC32     0000000000000000 .rodata - 4
000000000013  000600000004 R_X86_64_PLT32    0000000000000000 greet - 4

Two relocations. Two holes. The linker fills them.


Symbol types: the nm command

Symbol  Meaning
──────  ────────────────────────────────────
  T     Text (code) — defined in this file
  D     Data — initialized global variable
  B     BSS — uninitialized global variable
  R     Read-only data
  U     Undefined — referenced but not here
  W     Weak — can be overridden

main.o defines main (T) but needs greet (U). greet.o defines greet (T) but needs printf (U). The linker connects all the U's to the T's.

🧠 What do you think happens? What if a symbol is U in every object file and never T? The linker emits: undefined reference to 'greet'. The most common linker error. No file provided a definition.


Linking: three jobs

1. Symbol resolution — match undefined references to definitions:

main.o                     greet.o
┌────────────────┐         ┌────────────────┐
│ T main         │         │ T greet        │
│ U greet ───────┼────────►│                │
│                │         │ U printf ──────┼──► libc.so
└────────────────┘         └────────────────┘

2. Relocation — fill in placeholder addresses:

Before (main.o):   call 0x00000000    # placeholder
After  (hello):    call 0x00001149    # actual address of greet

3. Section merging — combine sections from all object files:

main.o  .text ──┐
                ├──► final .text
greet.o .text ─┘

Static vs dynamic linking

Static linking:                    Dynamic linking (default):

┌──────────────────────┐          ┌──────────────┐
│     hello_static     │          │    hello      │
│  main()              │          │  main()       │
│  greet()             │          │  greet()      │
│  printf()            │          │  printf ──────┼──► libc.so.6
│  ... all of libc ... │          └───────────────┘
└──────────────────────┘            ~16 KB, needs .so
  ~880 KB, no dependencies
gcc -static main.o greet.o -o hello_static
ldd hello_static    # "not a dynamic executable"
ldd hello           # lists libc.so.6, ld-linux, vdso

Static: everything baked in. Dynamic: smaller binary, resolved at runtime.


Rust's linking story

Rust statically links its own standard library but dynamically links the system C library — a hybrid approach.

ldd hello_rs   # shows libc.so.6, libgcc_s.so.1, ld-linux

💡 Fun Fact: Fully static Rust: rustup target add x86_64-unknown-linux-musl && rustc --target x86_64-unknown-linux-musl hello.rs. Zero dynamic dependencies. Runs on any Linux.


PLT/GOT: dynamic linking at a glance

When your binary calls printf dynamically, it uses two structures:

  • PLT (Procedure Linkage Table) — code trampolines
  • GOT (Global Offset Table) — writable address slots
Your code               PLT                    GOT
─────────              ─────                   ────
call puts@PLT ──────► puts@PLT:              ┌──────────────┐
                       jmp [GOT[puts]]───────►│ address of   │
                       ...                    │ puts in libc │
                                              └──────────────┘

On the first call, the GOT entry points to a resolver that finds the real puts, patches the GOT, and jumps there. Subsequent calls go directly through the patched GOT. Details in Chapter 14.


🔧 Task: Watch symbols resolve across files

  1. Create math.c:
    int add(int a, int b) { return a + b; }
    int multiply(int a, int b) { return a * b; }
    
  2. Create app.c:
    #include <stdio.h>
    int add(int a, int b);
    int multiply(int a, int b);
    int main() {
        printf("%d %d\n", add(3, 4), multiply(5, 6));
        return 0;
    }
    
  3. Compile separately: gcc -c math.c and gcc -c app.c
  4. Run nm math.oadd and multiply should be T (defined)
  5. Run nm app.oadd and multiply should be U (undefined)
  6. Link: gcc math.o app.o -o app
  7. Run nm app | grep -E 'add|multiply' — both now T with real addresses
  8. Run ./app — output: 7 30
  9. Bonus: Delete math.o, try linking with just app.o. The error tells you exactly which symbols are missing.

Loading: Binary Becomes Process

Type this right now

// save as hello.c, compile: gcc -o hello hello.c
#include <stdio.h>
int main() {
    printf("main is at: %p\n", (void *)main);
    return 0;
}
for i in 1 2 3 4 5; do ./hello; done
main is at: 0x55a3f1c00149
main is at: 0x564e28a00149
main is at: 0x55f84dc00149
main is at: 0x558b7e200149
main is at: 0x563420200149

Every run, main is at a different address. The binary on disk hasn't changed. The loader is moving things around. On purpose.


One syscall to rule them all: execve()

strace -f ./hello 2>&1 | head -15
execve("./hello", ["./hello"], 0x7ffd5c6b4e50 /* 55 vars */) = 0
brk(NULL)                               = 0x55a4b2e33000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2c8a3f1000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 2125824, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f2c8a000000
mmap(0x7f2c8a028000, 1531904, PROT_READ|PROT_EXEC, ...) = 0x7f2c8a028000
mmap(0x7f2c8a1f4000, 24576, PROT_READ|PROT_WRITE, ...) = 0x7f2c8a1f4000
...

That first line — execve() — is where a file on disk becomes a running process.


What the kernel does

execve("./hello", ...)
         │
         v
  1. Read first bytes → check magic: 7f 45 4c 46 = ELF? Yes.
         │
         v
  2. Read ELF header → entry point, program header offset
         │
         v
  3. For each LOAD segment:
     └── mmap(vaddr, filesz, perms, MAP_PRIVATE|MAP_FIXED, fd, offset)
         (if memsz > filesz, zero-fill the extra — that's .bss)
         │
         v
  4. INTERP segment found? Load the dynamic linker too.
         │
         v
  5. Set up stack: argc, argv[], envp[], auxv[]
         │
         v
  6. Set instruction pointer → jump to entry point
     (dynamic binary: ld-linux entry; static binary: _start)

The kernel doesn't "run" your program. It prepares the address space and jumps to the entry point. User-space code takes over from there.


ELF segments become virtual memory

ELF file on disk                        Virtual address space
────────────────                        ─────────────────────

┌─────────────────┐
│   ELF header    │                     (not mapped)
├─────────────────┤
│  LOAD (R)       │ ────mmap()────────► Read-only (headers, notes)
├─────────────────┤
│  LOAD (R-X)     │ ────mmap()────────► Executable code (.text)
├─────────────────┤
│  LOAD (R)       │ ────mmap()────────► Read-only data (.rodata)
├─────────────────┤
│  LOAD (RW)      │ ────mmap()────────► Read-write data (.data+.bss)
├─────────────────┤
│  .symtab        │     NOT MAPPED      [heap]  ────────►
│  .debug_*       │                     ...
└─────────────────┘                     [stack] ◄────────

Each LOAD segment becomes one mmap() call. ELF permissions (R, RW, RX) become mmap flags (PROT_READ, PROT_WRITE, PROT_EXEC).

🧠 What do you think happens? The .bss segment has memsz > filesz. The extra bytes don't exist in the file. The kernel allocates them in memory and fills with zeros. Uninitialized globals become zero without wasting disk space.


The dynamic linker

readelf -l hello | grep INTERP -A 1
  INTERP  0x000318 0x0000000000000318 ... 0x00001c R 0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]

This tells the kernel: "Load this program first." The kernel loads the dynamic linker into memory and jumps to its entry point. The dynamic linker then:

  1. Reads the DYNAMIC segment of your binary
  2. Opens and mmaps every shared library (libc.so.6, etc.)
  3. Resolves symbols between binary and libraries
  4. Jumps to your binary's _start
Kernel → ld-linux → _start → __libc_start_main → main()
                                                    │
                                                YOUR CODE

PLT and GOT: lazy binding

When your binary calls puts from libc, it uses:

  • PLT (Procedure Linkage Table) — small code stubs
  • GOT (Global Offset Table) — writable address slots

First call (slow path):

call puts@PLT
       │
       v
  puts@PLT:
    jmp [GOT[puts]] ──► GOT[puts] = address of resolver (not puts yet!)
    push index                │
    jmp PLT[0]                v
                        _dl_runtime_resolve:
                          1. Look up "puts" in libc
                          2. Patch GOT[puts] = real puts address
                          3. Jump to puts

Second call (fast path):

call puts@PLT
       │
       v
  puts@PLT:
    jmp [GOT[puts]] ──► GOT[puts] = 0x7f...puts (real address!)
                               │
                               v
                          puts() in libc — direct, no resolver

First call resolves the symbol. Every subsequent call is a single indirect jump.

💡 Fun Fact: This is "lazy binding" — symbols resolved on first use. Force eager binding with LD_BIND_NOW=1 ./hello to resolve everything before main runs.


ASLR: why the address changes

Run 1:  base = 0x55a3f1c00000  →  main = 0x55a3f1c00149
Run 2:  base = 0x564e28a00000  →  main = 0x564e28a00149
Run 3:  base = 0x55f84dc00000  →  main = 0x55f84dc00149
                                          ^^^^^^^^^^^
                                           Random!
                                                ^^^
                                             Always 149

The last three hex digits are always 149 — the offset of main within the binary. What changes is the base address. This is Address Space Layout Randomization (ASLR).

The kernel randomizes the base address of the executable, every shared library, the stack, the heap, and the mmap region.

Without ASLR:                    With ASLR:
┌──────────┐ 0x400000           ┌──────────┐ 0x55a3f1c00000
│  code    │                    │  code    │
├──────────┤ 0x600000           ├──────────┤ 0x55a3f1e00000
│  data    │                    │  data    │
│  ...     │                    │  ...     │
│  stack   │ 0x7ffffffde000     │  stack   │ 0x7ffd5c680000
└──────────┘                    └──────────┘
  Same every run.                 Different every run.
  Attacker knows all.             Attacker must guess.

Without ASLR, an attacker who knows your binary can predict every address. With 28-bit randomization, there are ~268 million possible base locations.


PIE: Position Independent Executable

ASLR only works if the code doesn't depend on fixed addresses. PIE uses rip-relative addressing:

# PIE (works at any base):
lea rax, [rip+0xeac]    # relative to current instruction

# Non-PIE (fixed base only):
mov rax, 0x402004        # hardcoded absolute address
readelf -h hello | grep Type       # DYN = PIE
gcc -no-pie -o hello_nopie hello.c
readelf -h hello_nopie | grep Type # EXEC = fixed address, no code ASLR

The Rust version

// save as hello_addr.rs, compile: rustc hello_addr.rs
fn main() {
    let stack_var = 42;
    println!("main:  {:p}", main as fn() as *const ());
    println!("stack: {:p}", &stack_var as *const i32);
}
for i in 1 2 3 4 5; do ./hello_addr; done

Both code and stack randomized. Same ASLR, same kernel, regardless of language.


Disabling ASLR: proof it's real

setarch $(uname -m) -R ./hello
setarch $(uname -m) -R ./hello
setarch $(uname -m) -R ./hello
main is at: 0x555555555149
main is at: 0x555555555149
main is at: 0x555555555149

Same address every time. The 0x555555555000 base is the well-known GDB default.

Check the system-wide setting: cat /proc/sys/kernel/randomize_va_space0=off, 1=partial, 2=full (default).


The complete picture

  hello (ELF on disk)
         │
         │  execve()
         v
  ┌─ Kernel ────────────────────────────────────┐
  │  Read ELF header + program headers           │
  │  mmap LOAD segments (with ASLR offset)       │
  │  Load dynamic linker (from INTERP)           │
  │  Set up stack (argc, argv, envp, auxv)       │
  │  Jump to ld-linux entry point                │
  └──────────────────────────────────────────────┘
         │
         v
  ┌─ Dynamic linker (ld-linux) ─────────────────┐
  │  Load shared libraries (libc.so, etc.)       │
  │  Set up PLT/GOT, process relocations         │
  │  Jump to binary's _start                     │
  └──────────────────────────────────────────────┘
         │
         v
  ┌─ C Runtime (_start) ───────────────────────-┐
  │  __libc_start_main → main(argc, argv, envp) │
  │  exit(return_value)                          │
  └──────────────────────────────────────────────┘
         │
         v
     YOUR CODE RUNS

From execve() to main(), every step is an ELF field read, an mmap() call, or a symbol resolution. Nothing magic.


🔧 Task: Observe ASLR yourself

  1. Compile:
    #include <stdio.h>
    int global = 42;
    int main() {
        int local = 7;
        printf("main:   %p\n", (void *)main);
        printf("global: %p\n", (void *)&global);
        printf("local:  %p\n", (void *)&local);
        return 0;
    }
    
  2. Run five times. All three addresses change each run.
  3. Notice: the offset between main and global stays constant (same binary). The offset between local (stack) and main (code) varies — randomized independently.
  4. Disable ASLR: setarch $(uname -m) -R ./a.out — same addresses every time.
  5. Bonus: strace ./a.out 2>&1 | grep mmap — count the mmap calls. Each one creates a memory region. That's the loader in action.

Virtual Memory: The Grand Illusion

Type this right now

// save as vmaddr.c — compile: gcc -o vmaddr vmaddr.c
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>

int main() {
    int x = 42;
    pid_t pid = fork();

    if (pid == 0) {
        // Child process
        printf("[child] &x = %p, x = %d\n", (void *)&x, x);
        x = 99;
        printf("[child] &x = %p, x = %d  (after modification)\n", (void *)&x, x);
    } else {
        wait(NULL);
        printf("[parent] &x = %p, x = %d  (unchanged!)\n", (void *)&x, x);
    }
    return 0;
}
$ gcc -o vmaddr vmaddr.c && ./vmaddr
[child] &x = 0x7ffd3a4b1c2c, x = 42
[child] &x = 0x7ffd3a4b1c2c, x = 99  (after modification)
[parent] &x = 0x7ffd3a4b1c2c, x = 42  (unchanged!)

Same address. Different values. That should break your brain a little. Both processes see 0x7ffd3a4b1c2c, but they're looking at different physical memory. Welcome to virtual memory.


The problem

Your system is running hundreds of processes right now. Each one believes it owns a vast, private stretch of memory — on x86-64, up to 128 TB of user-space addresses. But you probably have 16 GB of RAM. Maybe 32 if you're fancy.

    Process A thinks:  "I have 128 TB to myself"
    Process B thinks:  "I have 128 TB to myself"
    Process C thinks:  "I have 128 TB to myself"
    ...
    Process Z thinks:  "I have 128 TB to myself"

    Physical RAM:       16 GB total. That's it.

How is this possible? The same way a magician makes one card look like fifty. Indirection.


The solution: one layer of translation

Every address your program uses is a virtual address. It is never placed directly on the memory bus. Instead, hardware translates it to a physical address before the memory access happens.

    Your C code: int *p = (int *)0x4000;
                         │
                         ▼
              ┌─────────────────────┐
              │   Virtual Address   │
              │      0x4000         │
              └────────┬────────────┘
                       │
                       ▼
              ┌─────────────────────┐
              │        MMU          │  ◄── Hardware inside the CPU
              │   (translates via   │
              │    page tables)     │
              └────────┬────────────┘
                       │
                       ▼
              ┌─────────────────────┐
              │  Physical Address   │
              │     0x7A2000        │  ◄── Actual RAM location
              └─────────────────────┘

Your program never sees physical addresses. It never needs to. The translation happens on every single memory access — every load, every store, every instruction fetch.


Every process gets its own map

This is the key insight. Process A and Process B can both use virtual address 0x4000. The MMU consults different page tables for each process, so 0x4000 lands at different physical locations.

  Process A                                    Process B
  ─────────                                    ─────────
  Virtual: 0x4000                              Virtual: 0x4000
       │                                            │
       ▼                                            ▼
  ┌──────────────┐                            ┌──────────────┐
  │ A's Page     │                            │ B's Page     │
  │ Table        │                            │ Table        │
  │              │                            │              │
  │ 0x4000 ──────┼──┐                         │ 0x4000 ──────┼──┐
  └──────────────┘  │                         └──────────────┘  │
                    │                                           │
                    ▼                                           ▼
       ┌────────────────────────────────────────────────────────────┐
       │                    Physical RAM                            │
       │                                                            │
       │   Frame 0x1A2  ◄─── A's data        Frame 0x5F7 ◄─── B's │
       │   ┌──────────┐                      ┌──────────┐          │
       │   │ x = 42   │                      │ x = 99   │          │
       │   └──────────┘                      └──────────┘          │
       └────────────────────────────────────────────────────────────┘

Same virtual address. Different physical frames. Complete isolation.


Three gifts from virtual memory

1. Isolation. Process A cannot read or write Process B's memory. There is no virtual address in A's page table that maps to B's physical frames. The hardware enforces this — not the OS, not a runtime check, the MMU itself refuses the translation.

2. Convenience. Every process can use the same virtual address layout: code near the bottom, heap growing up, stack at the top. The linker doesn't need to know where in physical RAM the program will land. It just targets the standard virtual layout.

3. Overcommit. The OS can promise more memory than physically exists. malloc(1 GB) succeeds even with 2 GB of RAM — because no physical RAM is allocated until you actually touch each page. We'll see how in Chapter 17.

🧠 What do you think happens?

If 50 processes each malloc(1 GB), is that 50 GB of RAM consumed? What if none of them ever write to the memory? What if they all write to every byte simultaneously?


The MMU: translation in hardware

The Memory Management Unit lives inside the CPU die. It is not a separate chip. It is not software. It is transistors that execute the page-table walk on every memory access.

    ┌──────────────────────────────────────────────┐
    │                   CPU                         │
    │                                               │
    │   ┌──────────┐    ┌──────────┐               │
    │   │  Core 0  │    │  Core 1  │    ...        │
    │   │          │    │          │               │
    │   │  ┌─────┐ │    │  ┌─────┐ │               │
    │   │  │ TLB │ │    │  │ TLB │ │  ◄── Cache    │
    │   │  └──┬──┘ │    │  └──┬──┘ │     of recent │
    │   │     │    │    │     │    │     translations│
    │   │  ┌──┴──┐ │    │  ┌──┴──┐ │               │
    │   │  │ MMU │ │    │  │ MMU │ │  ◄── Walks    │
    │   │  └─────┘ │    │  └─────┘ │     page      │
    │   └──────────┘    └──────────┘     tables     │
    └──────────────────────────────────────────────┘

Who sets up the mapping? The operating system kernel. It writes entries into the page tables in physical memory. It sets the CR3 register to point to the root of each process's page table.

Who enforces the mapping? The hardware. Every memory access goes through the MMU. If the page table says "no access," the CPU raises a page fault exception — even the kernel can't bypass the MMU without disabling it entirely (which no modern OS does).


Memory-mapped files

The same translation mechanism can point virtual pages at file contents on disk instead of anonymous RAM. This is mmap().

#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main() {
    int fd = open("/etc/hostname", O_RDONLY);
    char *data = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
    printf("Hostname: %s\n", data);  // Reading the file by just reading memory!
    munmap(data, 4096);
    close(fd);
    return 0;
}

No read() call. You access the file like it's a normal array in memory. When you touch a page, the kernel loads the file's contents into a physical frame and maps it in. This is how the kernel loads your program's .text section — it memory-maps the ELF binary.

💡 Fun Fact: Shared libraries (.so files) are memory-mapped once and shared between every process that uses them. If 100 processes use libc.so, there is only ONE copy of its .text section in physical RAM, mapped into 100 different virtual address spaces.


Copy-on-write: the fork() trick

When you call fork(), the kernel does NOT copy the parent's entire memory. That would be absurdly expensive. Instead:

  1. Clone the page tables — child gets the same virtual-to-physical mapping as parent
  2. Mark ALL pages read-only in both parent and child
  3. Both processes continue running, sharing the exact same physical frames

When either process writes to a page: 4. The CPU raises a page fault (the page is marked read-only) 5. The kernel sees it's a copy-on-write page 6. The kernel copies just that one page to a new physical frame 7. Updates the writer's page table to point to the new copy, marks it writable 8. The other process still points to the original frame

    After fork() — before any writes:

    Parent page table          Physical RAM            Child page table
    ┌────────────┐          ┌──────────────┐          ┌────────────┐
    │ 0x4000 ────┼────RO───►│ Frame 0x1A2  │◄───RO────┼──── 0x4000 │
    │ 0x5000 ────┼────RO───►│ Frame 0x1A3  │◄───RO────┼──── 0x5000 │
    │ 0x6000 ────┼────RO───►│ Frame 0x1A4  │◄───RO────┼──── 0x6000 │
    └────────────┘          └──────────────┘          └────────────┘

    Child writes to 0x5000:

    Parent page table          Physical RAM            Child page table
    ┌────────────┐          ┌──────────────┐          ┌────────────┐
    │ 0x4000 ────┼────RO───►│ Frame 0x1A2  │◄───RO────┼──── 0x4000 │
    │ 0x5000 ────┼────RO───►│ Frame 0x1A3  │          │ 0x5000 ────┼──┐
    │ 0x6000 ────┼────RO───►│ Frame 0x1A4  │◄───RO────┼──── 0x6000 │  │
    └────────────┘          │ Frame 0x2B7  │◄───RW────────────────────┘
                            └──────────────┘
                              (copied page)

Only the page that was written gets duplicated. If a child process calls exec() immediately after fork() (which is common), most pages are never written — so almost nothing gets copied.


Rust's perspective

Rust doesn't have fork() in its standard library — partly because fork is fundamentally unsafe in multithreaded programs. But the virtual memory system works identically underneath.

// Rust uses the same virtual address space layout
use std::alloc::{alloc, Layout};

fn main() {
    let stack_var = 42;
    let heap_var = Box::new(99);

    println!("Stack: {:p}", &stack_var);    // High address
    println!("Heap:  {:p}", &*heap_var);    // Lower address
    println!("Code:  {:p}", main as *const ());  // Low address

    // Same /proc/self/maps underneath
    let maps = std::fs::read_to_string("/proc/self/maps").unwrap();
    for line in maps.lines().take(5) {
        println!("{}", line);
    }
}

Every concept in this chapter — page tables, the MMU, isolation between processes — applies to Rust programs identically. Rust's safety guarantees operate on top of virtual memory, not instead of it.


🔧 Task: Observe copy-on-write in action

Write this program in C. Before running, predict what you'll see:

// save as cow.c — compile: gcc -o cow cow.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>

int main() {
    int *data = malloc(sizeof(int));
    *data = 42;

    printf("Before fork: &data[0] = %p, value = %d\n", (void *)data, *data);

    pid_t pid = fork();
    if (pid == 0) {
        // Child
        printf("[child] &data[0] = %p, value = %d\n", (void *)data, *data);
        *data = 99;  // This triggers copy-on-write!
        printf("[child] &data[0] = %p, value = %d (modified)\n", (void *)data, *data);
        free(data);
        _exit(0);
    } else {
        wait(NULL);
        printf("[parent] &data[0] = %p, value = %d (still original!)\n",
               (void *)data, *data);
        free(data);
    }
    return 0;
}

What to observe:

  1. The address printed by parent and child is identical — same virtual address.
  2. The child modifies *data to 99, but the parent still sees 42.
  3. The addresses never change. Only the physical backing changes, invisibly.

Now try adding this before and after the child's modification:

char cmd[64];
snprintf(cmd, sizeof(cmd), "grep -A1 'heap' /proc/%d/smaps", getpid());
system(cmd);

Watch the Private_Dirty field increase after the write — that's the copy-on-write page becoming a private copy.

Pages and Page Tables

Type this right now

// save as pagemath.c — compile: gcc -o pagemath pagemath.c
#include <stdio.h>
#include <stdint.h>

int main() {
    uint64_t addr = 0x00007FFE12345678ULL;

    uint64_t offset  = addr & 0xFFF;           // Bits [11:0]
    uint64_t pt_idx  = (addr >> 12) & 0x1FF;   // Bits [20:12]
    uint64_t pd_idx  = (addr >> 21) & 0x1FF;   // Bits [29:21]
    uint64_t pdpt_idx = (addr >> 30) & 0x1FF;  // Bits [38:30]
    uint64_t pml4_idx = (addr >> 39) & 0x1FF;  // Bits [47:39]

    printf("Virtual address:  0x%016lx\n", addr);
    printf("PML4  index:      %3lu  (0x%03lx)\n", pml4_idx, pml4_idx);
    printf("PDPT  index:      %3lu  (0x%03lx)\n", pdpt_idx, pdpt_idx);
    printf("PD    index:      %3lu  (0x%03lx)\n", pd_idx, pd_idx);
    printf("PT    index:      %3lu  (0x%03lx)\n", pt_idx, pt_idx);
    printf("Page  offset:     %3lu  (0x%03lx)\n", offset, offset);

    return 0;
}
$ gcc -o pagemath pagemath.c && ./pagemath
Virtual address:  0x00007ffe12345678
PML4  index:      255  (0x0ff)
PDPT  index:      504  (0x1f8)
PD    index:      145  (0x091)
PT    index:      837  (0x345)
Page  offset:     1656  (0x678)

You just decomposed a virtual address into the exact indices the CPU uses to walk the page table. Every memory access your program makes involves this decomposition — in hardware, in nanoseconds.


What is a page?

Memory is managed in fixed-size chunks called pages. On x86-64, the standard page size is 4 KB (4096 bytes, or 0x1000 in hex).

    One page = 4096 bytes = 4 KB

    ┌─────────────────────────────────┐  Byte 0
    │                                 │
    │         4096 bytes              │
    │    of contiguous memory         │
    │                                 │
    └─────────────────────────────────┘  Byte 4095

Why 4 KB? It's a hardware compromise:

  • Too small (e.g., 64 bytes): you'd need billions of page table entries. The tables themselves would consume all your RAM.
  • Too big (e.g., 16 MB): you'd waste memory. Allocating 1 byte would reserve 16 MB. ("Internal fragmentation.")
  • 4 KB: a reasonable middle ground, chosen in the 1970s and hardened into silicon ever since.

Virtual memory and physical memory are both divided into pages. Virtual pages are just called "pages." Physical pages are called frames. The page table maps pages to frames.


How a virtual address is split

A 48-bit virtual address isn't treated as one big number. It's split into fields, each serving as an index into a different level of the page table.

    63        48 47    39 38    30 29    21 20    12 11       0
    ┌──────────┬────────┬────────┬────────┬────────┬──────────┐
    │  Sign    │ PML4   │ PDPT   │  PD    │  PT    │  Offset  │
    │ extension│ Index  │ Index  │ Index  │ Index  │ (12 bits)│
    │ (16 bits)│(9 bits)│(9 bits)│(9 bits)│(9 bits)│          │
    └──────────┴────────┴────────┴────────┴────────┴──────────┘
         │         │        │        │        │          │
         │         │        │        │        │          └─► Byte within
         │         │        │        │        │              the 4KB page
         │         │        │        │        │
         │         │        │        │        └─► Which entry in the
         │         │        │        │            Page Table (PT)
         │         │        │        │
         │         │        │        └─► Which entry in the
         │         │        │            Page Directory (PD)
         │         │        │
         │         │        └─► Which entry in the
         │         │            Page Directory Pointer Table (PDPT)
         │         │
         │         └─► Which entry in the
         │             Page Map Level 4 (PML4)
         │
         └─► Must be copies of bit 47 (canonical address requirement)

9 bits = 512 possible values. So each table level has 512 entries. 12 bits of offset = 2^12 = 4096 = one page.

💡 Fun Fact: Intel originally designed x86-64 with 48 bits of virtual address (256 TB). Recent processors support 5-level paging with 57 bits (128 PB). The structure is identical — they just add a PML5 level with another 9-bit index.


The page table entry (PTE)

Each entry in a page table is 8 bytes (64 bits). It contains the physical frame address plus control flags:

    63  62       52 51                       12 11  9 8 7 6 5 4 3 2 1 0
    ┌──┬──────────┬───────────────────────────┬─────┬─┬─┬─┬─┬─┬───┬─┬─┐
    │NX│ Available │  Physical Frame Address   │ Avl │G│ │D│A│ │U/S│R│P│
    │  │ (for OS) │       (40 bits)            │     │ │ │ │ │ │   │W│ │
    └──┴──────────┴───────────────────────────┴─────┴─┴─┴─┴─┴─┴───┴─┴─┘
     │                      │                              │   │  │  │
     │                      │                              │   │  │  └─ Present: page in RAM?
     │                      │                              │   │  └─── Read/Write: writable?
     │                      │                              │   └───── User/Supervisor: user-
     │                      │                              │          accessible?
     │                      │                              └───── Accessed: been read?
     │                      │                                     Dirty: been written?
     │                      │
     │                      └─ Physical frame number (left-shifted by 12
     │                         to get the physical address of the frame)
     │
     └─ NX (No Execute): if set, CPU will refuse to execute
        code from this page — defense against code injection

The Present bit is crucial. If it's 0, the page is not in RAM. Accessing it triggers a page fault. The kernel then decides what to do — load from swap, demand-allocate, or kill the process with SIGSEGV.


Why a single-level page table is impossible

Let's do the math for one flat table.

    48-bit virtual address space
    ÷ 4 KB page size (12-bit offset)
    = 2^36 virtual pages
    × 8 bytes per PTE
    = 2^39 bytes = 512 GB per process

    That's the size of the table alone. For ONE process.
    With 100 processes, you'd need 50 TB of RAM just for tables.

Clearly, this doesn't work. The solution: multi-level tables — only allocate the table pages that actually contain mappings.


The four-level page table walk

Here's the complete walk. The CPU's CR3 register holds the physical address of the PML4 table for the currently running process.

    CR3 register
    ┌──────────────────┐
    │ PML4 phys addr   │
    └────────┬─────────┘
             │
             ▼
    ┌──────────────────────────────────────────────┐
    │ PML4 Table (512 entries × 8 bytes = 4 KB)    │
    │                                              │
    │  [0]  [1]  [2] ... [pml4_idx] ... [511]     │
    │                        │                     │
    └────────────────────────┼─────────────────────┘
                             │  bits [47:39] of VA
                             ▼
    ┌──────────────────────────────────────────────┐
    │ PDPT Table (512 entries × 8 bytes = 4 KB)    │
    │                                              │
    │  [0]  [1]  [2] ... [pdpt_idx] ... [511]     │
    │                        │                     │
    └────────────────────────┼─────────────────────┘
                             │  bits [38:30] of VA
                             ▼
    ┌──────────────────────────────────────────────┐
    │ Page Directory (512 entries × 8 bytes = 4 KB)│
    │                                              │
    │  [0]  [1]  [2] ... [pd_idx]   ... [511]     │
    │                        │                     │
    └────────────────────────┼─────────────────────┘
                             │  bits [29:21] of VA
                             ▼
    ┌──────────────────────────────────────────────┐
    │ Page Table (512 entries × 8 bytes = 4 KB)    │
    │                                              │
    │  [0]  [1]  [2] ... [pt_idx]   ... [511]     │
    │                        │                     │
    └────────────────────────┼─────────────────────┘
                             │  bits [20:12] of VA
                             ▼
    ┌──────────────────────────────────────────────┐
    │ Physical Frame (4096 bytes)                  │
    │                                              │
    │  byte[0] ... byte[offset] ... byte[4095]    │
    │                    │                         │
    └────────────────────┼─────────────────────────┘
                         │  bits [11:0] of VA
                         ▼
                    Target Byte

Four memory reads to translate one virtual address. Each level is itself a 4 KB page in physical memory, containing 512 eight-byte entries.

🧠 What do you think happens?

At level 2 (PD), the CPU reads entry pd_idx and finds the Present bit is 0. What happens? Does the CPU try the next entry? Does it guess? Or does it immediately trap to the kernel?


Why this saves memory

A process that uses only a small range of addresses needs very few table pages:

    A process using addresses 0x400000 - 0x410000 (64 KB):

    PML4: 1 page  (always exists, pointed to by CR3)
    PDPT: 1 page  (only entry [0] is populated)
    PD:   1 page  (only entry [2] is populated)
    PT:   1 page  (only entries [0]-[15] are populated)

    Total: 4 pages × 4 KB = 16 KB of page tables for a 64 KB mapping

    A flat table would need: 512 GB
    Savings: 99.9999997%

Levels that aren't needed simply don't exist. Their parent's entry has Present=0, and that's that.


The TLB: making it fast

Four memory accesses per translation would be crippling. At ~100 ns per RAM access, that's 400 ns per instruction — you'd run at about 2.5 million instructions per second. A modern CPU does 4+ billion.

The fix: the Translation Lookaside Buffer (TLB). It's a small, fast cache of recent virtual-to-physical translations.

    Virtual address 0x7FFE12345678
           │
           ▼
    ┌─────────────────────────────────────┐
    │              TLB                     │
    │  ┌──────────────┬─────────────────┐ │
    │  │ Virtual Page │ Physical Frame  │ │
    │  ├──────────────┼─────────────────┤ │
    │  │ 0x7FFE12345  │ 0x3A201  ──────────── HIT! (~1 cycle)
    │  │ 0x55A3B2C02  │ 0x1F800        │ │
    │  │ 0x7F8A12040  │ 0x22100        │ │
    │  │     ...      │    ...         │ │
    │  └──────────────┴─────────────────┘ │
    │                                     │
    │  ~64-2048 entries (varies by CPU)   │
    └─────────────────────────────────────┘
           │
           │ MISS → full 4-level walk (~100 ns)
           ▼

TLB hit: ~1 cycle. Translation is essentially free. TLB miss: full 4-level page table walk. Hundreds of cycles.

Most programs have excellent TLB hit rates (>99%) because of locality — they access the same pages repeatedly.


Context switches flush the TLB

When the OS switches from Process A to Process B, it loads B's page table base into CR3. This invalidates the entire TLB — because A's translations are wrong for B.

    Running Process A:  TLB full of A's translations  (fast!)
           │
           │  Context switch → load B's CR3
           ▼
    Running Process B:  TLB is EMPTY  (cold start, slow!)
           │
           │  First ~1000 memory accesses → all TLB misses
           ▼
    Running Process B:  TLB warming up...              (getting faster)

This is one reason context switches are expensive. The process that resumes starts with a cold TLB and suffers many misses until it warms up. Modern CPUs have PCID (Process Context Identifiers) — tagging TLB entries with a process ID to avoid full flushes, but the TLB is still small.


Huge pages: fewer translations, fewer misses

Standard 4 KB pages mean a 1 GB dataset spans 262,144 pages. That's a lot of TLB entries. The TLB might only hold 2,048 — so you're constantly evicting and reloading translations.

Huge pages use larger page sizes:

    Page Size    Offset Bits    Levels Walked    TLB entries for 1 GB
    ─────────    ───────────    ─────────────    ────────────────────
    4 KB         12             4                262,144
    2 MB         21             3 (skip PT)      512
    1 GB         30             2 (skip PT+PD)   1

With 2 MB pages, the PD entry points directly to a 2 MB physical frame — the PT level is skipped. With 1 GB pages, even the PD level is skipped.

// Allocating huge pages in C (Linux)
#include <sys/mman.h>

void *p = mmap(NULL, 2 * 1024 * 1024,      // 2 MB
               PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
               -1, 0);

💡 Fun Fact: Database servers like PostgreSQL and Redis often use huge pages. A database with a 32 GB buffer pool would need 8 million TLB entries with 4 KB pages. With 2 MB huge pages, that drops to 16,384. The performance difference is measurable — 5-10% on memory-heavy workloads.


Rust: same hardware, same rules

fn main() {
    let addr: u64 = 0x00007FFE12345678;

    let offset   = addr & 0xFFF;
    let pt_idx   = (addr >> 12) & 0x1FF;
    let pd_idx   = (addr >> 21) & 0x1FF;
    let pdpt_idx = (addr >> 30) & 0x1FF;
    let pml4_idx = (addr >> 39) & 0x1FF;

    println!("Virtual address:  {:#018x}", addr);
    println!("PML4  index:      {:3} ({:#05x})", pml4_idx, pml4_idx);
    println!("PDPT  index:      {:3} ({:#05x})", pdpt_idx, pdpt_idx);
    println!("PD    index:      {:3} ({:#05x})", pd_idx, pd_idx);
    println!("PT    index:      {:3} ({:#05x})", pt_idx, pt_idx);
    println!("Page  offset:     {:3} ({:#05x})", offset, offset);
}

The page table structure doesn't care whether the process was written in C, Rust, Python, or assembly. It's hardware. Every byte your Rust program touches goes through the same four-level walk (or TLB hit).


🔧 Task: Walk a page table by hand

Take virtual address 0x00007FFE12345678. Work through the walk on paper:

  1. Split the address into its component fields:

    • Bits [47:39] → PML4 index = ?
    • Bits [38:30] → PDPT index = ?
    • Bits [29:21] → PD index = ?
    • Bits [20:12] → PT index = ?
    • Bits [11:0] → Page offset = ?
  2. Convert to binary first if it helps:

    0x00007FFE12345678
    = 0000 0000 0000 0000 0111 1111 1111 1110
      0001 0010 0011 0100 0101 0110 0111 1000
    
    Split:
    [47:39] = 0_1111_1111 = 0xFF = 255
    [38:30] = 1_1111_1000 = 0x1F8 = 504
    [29:21] = 0_1001_0001 = 0x091 = 145
    [20:12] = 1_0100_0101 = 0x345 = 837  (wait — is it 837 or 325?)
    [11:0]  = 0110_0111_1000 = 0x678 = 1656
    
  3. Verify your answers by running the C program from the top of this chapter.

  4. Try a different address: your own stack variable. Print its address, then decompose it. Does the PML4 index match what you'd expect for a stack address (high in the user-space range)?

  5. Check the real page tables (Linux, needs root):

    # Find a process's page table root
    $ sudo cat /proc/<pid>/pagemap | xxd | head
    

    The pagemap file lets you look up the physical frame for any virtual page. See Documentation/admin-guide/mm/pagemap.rst in the kernel source for the format.

Page Faults: When Things Get Interesting

Type this right now

// save as lazy.c — compile: gcc -o lazy lazy.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    // Ask for 100 MB. Does the OS actually give us 100 MB of RAM?
    size_t size = 100 * 1024 * 1024;
    char *p = malloc(size);

    printf("malloc returned: %p\n", (void *)p);
    printf("Now check: ps -o pid,vsz,rss -p %d\n", getpid());
    printf("Press Enter BEFORE touching memory...\n");
    getchar();

    // Touch every page (write one byte per 4 KB page)
    for (size_t i = 0; i < size; i += 4096) {
        p[i] = 'A';
    }

    printf("Memory touched. Press Enter to check again...\n");
    getchar();

    free(p);
    return 0;
}

Run it in one terminal. In another terminal, check RSS before and after:

$ ps -o pid,vsz,rss -p $(pgrep lazy)
  PID    VSZ   RSS
 1234 203456  2340      ← Before: VSZ is large, RSS is tiny!

# (press Enter in the first terminal)

$ ps -o pid,vsz,rss -p $(pgrep lazy)
  PID    VSZ   RSS
 1234 203456 104820     ← After: RSS jumped ~100 MB

VSZ (virtual size) was always large — the address space was mapped. RSS (resident set size) was tiny until you touched the pages. Physical RAM was allocated one page at a time, on demand, via page faults.


A page fault is NOT an error

This is the most misunderstood concept in systems programming. A page fault is a CPU exception that says: "I tried to translate this virtual address, and the page table says I can't proceed. Kernel, please help."

The kernel then decides what to do:

    CPU executes: mov [0x7F4000], eax
         │
         ▼
    MMU walks page table → PTE has Present=0 (or wrong permissions)
         │
         ▼
    CPU raises Page Fault Exception (#PF)
         │
         ▼
    Kernel's page fault handler runs
         │
         ├──► Minor fault?  → Map a physical frame, resume
         │
         ├──► Major fault?  → Load from disk, map frame, resume
         │
         └──► Invalid?      → Send SIGSEGV → process dies

Three types of page faults

1. Minor fault — demand paging

You called malloc(100 MB). The kernel said "sure" and set up virtual address mappings but left every PTE with Present=0. No physical RAM was allocated.

When you first touch a page:

    Your code:  p[0] = 'A';
                   │
                   ▼
    Virtual address 0x7F4000 → MMU walks table → Present=0
                   │
                   ▼
    Page Fault (minor) → kernel allocates a physical frame
                   │     from the free page pool, zeros it,
                   │     updates the PTE: Present=1, Frame=0x1A200
                   │
                   ▼
    CPU retries the instruction → MMU walks table → Present=1 → success!
    p[0] is now 'A' in frame 0x1A200

This happens once per page, then never again (for that page). The process never notices — the retry is automatic. This is why malloc(1 GB) succeeds on a 4 GB system — physical RAM is only committed when pages are actually accessed.

🧠 What do you think happens?

You call malloc(1 TB) on a machine with 16 GB of RAM. malloc returns a valid pointer. You then try to touch every page. At some point, the kernel runs out of physical frames. What happens? (Hint: look up the "OOM killer.")

2. Minor fault — copy-on-write

After fork(), parent and child share pages marked read-only. When either writes:

    Child writes to shared page at 0x5000
         │
         ▼
    MMU: page is present but marked read-only → Page Fault
         │
         ▼
    Kernel: "Ah, this is a copy-on-write page."
         │
         ├── Allocate new physical frame
         ├── Copy contents from original frame
         ├── Update child's PTE: point to new frame, mark writable
         └── Resume child's instruction
         │
         ▼
    Child's write succeeds. Parent's page is unaffected.

Still a minor fault — no disk I/O. Just a memory copy and a PTE update.

3. Major fault — loading from disk

The page was once in RAM but got swapped out to disk (because the system was low on memory). The PTE has Present=0 but contains a swap entry telling the kernel where on disk the data lives.

    Access to page at 0x8000 → Present=0, swap entry = disk sector 42501
         │
         ▼
    Kernel: "This page was swapped out."
         │
         ├── Allocate a free physical frame
         ├── Read 4 KB from swap disk into the frame     ◄── SLOW! ~5-10 ms
         ├── Update PTE: Present=1, Frame=new_frame
         └── Resume instruction
         │
         ▼
    Access succeeds. But it cost milliseconds, not nanoseconds.
    Speed comparison:

    TLB hit:           ~1 ns
    Minor page fault:  ~1-10 μs    (1,000× slower)
    Major page fault:  ~5-10 ms    (5,000,000× slower than TLB hit)
                                    (1,000× slower than minor fault)

Major faults are the reason your system feels sluggish when it starts swapping. A program that would normally take 1 second can take hours if most of its accesses trigger major faults.

4. Invalid fault — you messed up

The virtual address has no mapping at all. No PTE. No swap entry. Nothing.

    Access to 0xDEADBEEF → no mapping exists
         │
         ▼
    Kernel: "This address is not valid for this process."
         │
         ▼
    Kernel sends SIGSEGV to the process
         │
         ▼
    Default handler: print "Segmentation fault", dump core, exit

This is the one that kills your program. We'll cover it in detail in Chapter 18.


Watching page faults happen

Linux tracks page faults per process. You can see them:

$ /usr/bin/time -v ./lazy 2>&1 | grep -i fault
    Minor (reclaiming a frame): 25,612
    Major (requiring I/O): 0

25,612 minor faults for 100 MB makes sense: 100 MB / 4 KB = 25,600 pages (plus a few for the program itself, stack, libraries).

You can also watch in real time with perf:

$ perf stat -e page-faults,minor-faults,major-faults ./lazy
     25,614      page-faults
     25,614      minor-faults
          0      major-faults

mmap: the Swiss army knife

mmap() is the system call that creates virtual address mappings. Everything runs through it:

    malloc (large allocs)  → calls mmap(MAP_ANONYMOUS | MAP_PRIVATE)
    Loading shared libs    → kernel calls mmap(MAP_PRIVATE, fd)
    Reading files          → you call mmap(MAP_PRIVATE, fd)
    Shared memory          → mmap(MAP_SHARED | MAP_ANONYMOUS)
    Copy-on-write fork     → kernel manipulates existing mappings

Here's mmap used to read a file:

// save as mmapread.c — compile: gcc -o mmapread mmapread.c
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main() {
    int fd = open("/etc/passwd", O_RDONLY);
    struct stat st;
    fstat(fd, &st);

    // Map the entire file into our address space
    char *data = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    close(fd);  // We can close fd — the mapping keeps the file accessible

    // Access the file like a normal array
    printf("First 80 bytes:\n%.80s\n", data);

    munmap(data, st.st_size);
    return 0;
}

And in Rust:

use std::fs::File;
use std::os::unix::io::AsRawFd;

fn main() {
    let file = File::open("/etc/passwd").unwrap();
    let len = file.metadata().unwrap().len() as usize;

    // Using the memmap2 crate (add to Cargo.toml: memmap2 = "0.9")
    // let mmap = unsafe { memmap2::Mmap::map(&file).unwrap() };
    // println!("{}", std::str::from_utf8(&mmap[..80]).unwrap());

    // Or with raw mmap:
    let ptr = unsafe {
        libc::mmap(
            std::ptr::null_mut(),
            len,
            libc::PROT_READ,
            libc::MAP_PRIVATE,
            file.as_raw_fd(),
            0,
        )
    };
    let data = unsafe { std::slice::from_raw_parts(ptr as *const u8, len) };
    println!("First 80 bytes:\n{}", std::str::from_utf8(&data[..80]).unwrap());
    unsafe { libc::munmap(ptr, len); }
}

💡 Fun Fact: When the kernel loads your ELF binary at exec() time, it doesn't read the whole file into RAM. It mmaps the segments. Your .text section is demand-paged — functions that are never called are never loaded from disk.


The page fault flow in full

Here's the complete picture of what happens when the CPU can't translate an address:

    CPU executes instruction that accesses virtual address VA
         │
         ▼
    MMU checks TLB ─── Hit? ──► Translate, access physical memory. Done.
         │
         No (TLB miss)
         │
         ▼
    MMU walks 4-level page table
         │
         ├── Present=1 and permissions OK? ──► Load into TLB, access memory. Done.
         │
         └── Present=0 or permission violation?
              │
              ▼
         CPU pushes fault address to CR2 register
         CPU raises #PF exception (interrupt 14)
         CPU switches to kernel mode
              │
              ▼
         Kernel page fault handler (arch/x86/mm/fault.c)
              │
              ├── Is VA in a valid VMA? (vm_area_struct)
              │    │
              │    No ──► Send SIGSEGV (invalid access)
              │    │
              │    Yes
              │    ▼
              ├── Was it a write to a read-only COW page?
              │    │
              │    Yes ──► Allocate frame, copy page, update PTE, resume
              │    │
              │    No
              │    ▼
              ├── Is there a swap entry?
              │    │
              │    Yes ──► Read from swap (major fault), map frame, resume
              │    │
              │    No
              │    ▼
              ├── Is it a demand-zero page (anonymous)?
              │    │
              │    Yes ──► Allocate zeroed frame, map it, resume (minor fault)
              │    │
              │    No
              │    ▼
              └── Is it a file-backed mapping?
                   │
                   Yes ──► Read from file (major fault), map frame, resume
                   │
                   No ──► Send SIGSEGV

🔧 Task: Watch demand paging in /proc/self/smaps

// save as demand.c — compile: gcc -o demand demand.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

void show_rss() {
    char path[64];
    snprintf(path, sizeof(path), "/proc/%d/status", getpid());
    FILE *f = fopen(path, "r");
    char line[256];
    while (fgets(line, sizeof(line), f)) {
        if (strncmp(line, "VmRSS", 5) == 0 || strncmp(line, "VmSize", 6) == 0) {
            printf("  %s", line);
        }
    }
    fclose(f);
}

int main() {
    printf("Before malloc:\n");
    show_rss();

    char *p = malloc(100 * 1024 * 1024);  // 100 MB
    printf("\nAfter malloc (before touching):\n");
    show_rss();

    // Touch every page
    for (size_t i = 0; i < 100 * 1024 * 1024; i += 4096) {
        p[i] = 1;
    }
    printf("\nAfter touching every page:\n");
    show_rss();

    free(p);
    printf("\nAfter free:\n");
    show_rss();

    return 0;
}
$ ./demand
Before malloc:
  VmSize:     2580 kB
  VmRSS:      1024 kB

After malloc (before touching):
  VmSize:   105060 kB     ← Virtual size jumped 100 MB
  VmRSS:      1040 kB     ← Physical memory: basically unchanged!

After touching every page:
  VmSize:   105060 kB     ← Virtual size: same
  VmRSS:   103420 kB     ← RSS jumped ~100 MB. NOW the RAM is used.

After free:
  VmSize:     2580 kB     ← Virtual mapping released
  VmRSS:      1040 kB     ← Physical RAM returned to the OS

Key insight: VmSize reflects the virtual address space. VmRSS reflects physical RAM. malloc only affects VmSize. Actually touching the memory triggers page faults, which allocate physical frames and increase VmRSS.

Now run it again under perf stat:

$ perf stat -e minor-faults,major-faults ./demand
     25,630      minor-faults     ← ~25,600 pages = 100 MB / 4 KB
          0      major-faults

Every single one of those 25,600 minor faults was the kernel giving you one physical frame. No disk I/O. Just pure demand paging.

Segmentation Faults: The Complete Guide

Type this right now

// save as crash.c — compile: gcc -g -O0 -o crash crash.c
#include <stdio.h>

int main() {
    int *p = NULL;
    printf("About to dereference NULL...\n");
    *p = 42;
    printf("This line never runs.\n");
    return 0;
}
$ ./crash
About to dereference NULL...
Segmentation fault (core dumped)

$ dmesg | tail -1
crash[12345]: segfault at 0 ip 00005555555551a2 sp 00007fffffffde10 error 6 in crash[555555555000+1000]

That kernel log line tells you everything: the fault happened at address 0 (NULL), the instruction pointer was 0x5555555551a2, and error code 6 means "user-mode write to a non-present page."

Now you know where it crashed, what it was doing, and why.


What ACTUALLY happens

A segfault is not magic. It's a precise chain of hardware and software events:

    1. Your code: *p = 42;  (where p = NULL = 0x0)
           │
           ▼
    2. CPU: "Load effective address 0x0, store value 42"
           │
           ▼
    3. MMU: Walk page table for address 0x0
           │
           ▼
    4. MMU: PTE for page 0 has Present=0 (page 0 is intentionally unmapped)
           │
           ▼
    5. CPU: Store fault address (0x0) in CR2 register
           CPU: Raise #PF exception (interrupt vector 14)
           CPU: Switch to Ring 0 (kernel mode)
           CPU: Jump to kernel's page fault handler
           │
           ▼
    6. Kernel: Check if address 0x0 belongs to any VMA for this process
           │
           ▼
    7. Kernel: No valid VMA → this is an invalid access
           │
           ▼
    8. Kernel: Send SIGSEGV (signal 11) to the process
           │
           ▼
    9. Process: Default SIGSEGV handler runs
           → Print "Segmentation fault"
           → Generate core dump (if ulimit allows)
           → Terminate with exit code 139 (128 + 11)

The CPU doesn't know what a segfault is. It only knows page faults. The kernel decides whether a page fault is recoverable (minor/major fault) or fatal (segfault).


Cause #1: NULL pointer dereference

The most common segfault. Page zero is intentionally unmapped on every modern OS. This turns NULL dereference from silent corruption into an immediate crash.

In C

// null.c — compile: gcc -g -O0 -o null null.c
#include <stdio.h>
#include <stdlib.h>

struct Node {
    int value;
    struct Node *next;
};

int main() {
    struct Node *head = NULL;
    // Forgot to allocate! Accessing through NULL pointer:
    head->value = 42;  // CRASH: writing to address 0x0
    return 0;
}
    Memory at dereference:

    head ──────► 0x0000000000000000 (NULL)
                        │
                        ▼
              ┌─────────────────────┐
              │    Page 0 (4 KB)    │
              │                     │
              │  NOT MAPPED         │  ◄── Present=0 in page table
              │  Access here =      │
              │  immediate #PF      │
              └─────────────────────┘

In Rust

fn main() {
    let head: Option<Box<i32>> = None;

    // This won't compile — Rust forces you to handle None:
    // let val = *head;  // ERROR: cannot dereference Option

    // You must explicitly handle it:
    match head {
        Some(val) => println!("Value: {}", val),
        None => println!("No value!"),
    }
}

Rust's Option<T> makes NULL impossible in safe code. There is no null pointer — there's Some(value) or None, and the compiler forces you to handle both. You literally cannot compile code that dereferences without checking.

💡 Fun Fact: The first 64 KB of address space (pages 0-15) are unmapped on Linux. This catches not just NULL dereference but also NULL + small_offset, like ((struct s*)NULL)->field. This range is controlled by /proc/sys/vm/mmap_min_addr.


Cause #2: Stack overflow

Every thread has a fixed-size stack (default: 8 MB on Linux). Below the stack is a guard page — an unmapped page that triggers a fault if the stack grows too far.

In C

// stackoverflow.c — compile: gcc -g -O0 -o stackoverflow stackoverflow.c
#include <stdio.h>

void recurse(int depth) {
    char buffer[4096];  // 4 KB per frame — eat stack fast
    buffer[0] = 'A';
    printf("Depth: %d, &buffer = %p\n", depth, (void *)buffer);
    recurse(depth + 1);
}

int main() {
    recurse(0);
    return 0;
}
    Stack layout during deep recursion:

    0x7FFFFFFFE000 ┌────────────────────┐  ◄── Stack top
                   │ main() frame       │
                   ├────────────────────┤
                   │ recurse(0) frame   │
                   │   buffer[4096]     │
                   ├────────────────────┤
                   │ recurse(1) frame   │
                   │   buffer[4096]     │
                   ├────────────────────┤
                   │        ...         │
                   │   ~2000 frames     │
                   │        ...         │
                   ├────────────────────┤
                   │ recurse(2047)      │
    0x7FFFFFF7E000 ├────────────────────┤  ◄── Stack bottom (8 MB limit)
                   │   GUARD PAGE       │  ◄── Unmapped! Touching = SIGSEGV
                   ├────────────────────┤
                   │   (unmapped)       │
                   └────────────────────┘

In Rust

fn recurse(depth: u64) {
    let buffer = [0u8; 4096];
    println!("Depth: {}, addr: {:p}", depth, &buffer);
    recurse(depth + 1);
}

fn main() {
    recurse(0);
}
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)

Rust detects the stack overflow and prints a clear message instead of just "Segmentation fault." But the underlying mechanism is identical — the guard page triggers a fault, and the runtime catches it.


Cause #3: Writing to read-only memory

String literals in C live in the .rodata section, which is mapped read-only (r--p).

In C

// rodata.c — compile: gcc -g -O0 -o rodata rodata.c
#include <stdio.h>

int main() {
    char *s = "Hello";  // s points into .rodata (read-only)
    s[0] = 'h';         // CRASH: writing to read-only page
    return 0;
}
    Memory layout:

    .rodata section (mapped with permissions: r--p)
    ┌───────────────────────────────┐
    │ 'H' 'e' 'l' 'l' 'o' '\0'   │
    │  ▲                           │
    │  │                           │
    │  s points here               │
    └───────────────────────────────┘

    PTE flags: Present=1, Read/Write=0 (read-only)

    CPU tries to WRITE → MMU: "Present=1 but Write=0" → #PF
    Kernel: "Valid mapping but wrong permissions" → SIGSEGV

In Rust

fn main() {
    let s = "Hello";  // &str — immutable by definition
    // s[0] = 'h';    // Won't compile: &str is not mutable
    // There's no way to accidentally write to a string literal in safe Rust.

    // Even with a mutable String, the original literal stays safe:
    let mut owned = String::from("Hello");
    owned.replace_range(0..1, "h");  // This modifies a HEAP copy
    println!("{}", owned);  // "hello"
}

Rust's type system distinguishes &str (immutable reference to string data) from String (owned, heap-allocated, mutable). You can't accidentally modify a string literal.


Cause #4: Use-after-free

In C

// uaf.c — compile: gcc -g -O0 -o uaf uaf.c
#include <stdio.h>
#include <stdlib.h>

int main() {
    int *p = malloc(sizeof(int));
    *p = 42;
    printf("Before free: *p = %d\n", *p);

    free(p);
    // p still holds the old address, but the memory may be:
    // - returned to the allocator's free list
    // - unmapped entirely (for large allocations)
    // - reused by a later malloc

    *p = 99;  // UNDEFINED BEHAVIOR
    // Might segfault (if page was unmapped)
    // Might silently corrupt other data (if reused)
    // Might appear to "work" (if page still mapped but unused)
    printf("After free: *p = %d\n", *p);
    return 0;
}
    Before free(p):

    p ──────► ┌──────────────┐ 0x55a000 (heap)
              │    42        │ ◄── valid, allocated
              └──────────────┘

    After free(p):

    p ──────► ┌──────────────┐ 0x55a000 (heap)
              │  free list   │ ◄── returned to allocator
              │  metadata    │    (or possibly unmapped)
              └──────────────┘

    Writing *p = 99 here either:
    - Overwrites free-list metadata → heap corruption
    - Hits an unmapped page → SIGSEGV
    - Appears to work → ticking time bomb

The scariest part: use-after-free might not crash immediately. It might corrupt the heap silently and crash minutes later in a completely unrelated function. This is why use-after-free is the #1 source of security vulnerabilities in C/C++ code.

In Rust

fn main() {
    let p = Box::new(42);
    drop(p);    // Explicitly free
    // println!("{}", p);  // COMPILE ERROR: value used after move
    // The compiler will not let this happen. Period.
}
error[E0382]: borrow of moved value: `p`
 --> src/main.rs:4:20
  |
2 |     let p = Box::new(42);
  |         - move occurs because `p` has type `Box<i32>`
3 |     drop(p);
  |          - value moved here
4 |     println!("{}", p);
  |                    ^ value borrowed here after move

The borrow checker tracks ownership. After drop(p), the variable p is consumed. Any attempt to use it is a compile-time error. Not a runtime check. Not a sanitizer. The program never compiles.


Cause #5: Buffer overflow

In C

// overflow.c — compile: gcc -g -O0 -o overflow overflow.c
#include <stdio.h>

int main() {
    int arr[10];
    // Write way past the end of the array
    arr[1000000] = 42;  // 4 MB past the end — likely in unmapped space
    return 0;
}
    Stack layout:

    0x7FFFFFFFDE00 ┌────────────────────┐
                   │ arr[0] ... arr[9]  │  40 bytes (10 × 4)
    0x7FFFFFFFDE28 ├────────────────────┤
                   │ (other stack data) │
                   ├────────────────────┤
                   │        ...         │
                   │                    │
    0x7FFFFFF9DE00 │  arr[1000000]      │  ◄── 4 MB below arr
                   │  THIS IS UNMAPPED  │  ◄── SIGSEGV
                   └────────────────────┘

Small overflows (e.g., arr[10] or arr[20]) might NOT segfault — they silently overwrite adjacent stack data. This is how stack buffer overflows lead to arbitrary code execution. Only when you go far enough to land in an unmapped page does the hardware catch it.

In Rust

fn main() {
    let arr = [0i32; 10];
    let idx = 1_000_000;
    println!("{}", arr[idx]);  // Panic at runtime (bounds check)
}
thread 'main' panicked at 'index out of bounds: the len is 10 but the index is 1000000'

Rust inserts bounds checks on every array/slice access. The program panics with a clear message instead of corrupting memory. The panic is not a segfault — it's a controlled unwinding or abort.

🧠 What do you think happens?

In C, arr[11] = 42; when arr has 10 elements. Does it always segfault? Usually? Rarely? Why is the answer "it depends"?


Cause #6: Wild / uninitialized pointer

In C

// wild.c — compile: gcc -g -O0 -o wild wild.c
int main() {
    int *p;     // Uninitialized — contains whatever was on the stack
    *p = 42;    // Dereferences a garbage address
    return 0;
}
    p contains random stack data:

    p ──────► 0x??????????? (whatever bytes were on the stack)
                    │
                    ▼
              Could be:
              • 0x0000000000000000 → NULL deref → SIGSEGV
              • 0x00007FFF12340000 → might be mapped → silent corruption!
              • 0x0000DEADBEEF0000 → unmapped → SIGSEGV
              • 0xFFFF800000000000 → kernel space → SIGSEGV

    It's random. The behavior changes between runs, compilers,
    and optimization levels. This is undefined behavior.

In Rust

fn main() {
    let p: *mut i32;
    // unsafe { *p = 42; }  // COMPILE ERROR: use of possibly uninitialized `p`
    // Rust requires all variables to be initialized before use.
    // Even raw pointers must be given a value.
}
error[E0381]: used binding `p` isn't initialized
 --> src/main.rs:3:16
  |
2 |     let p: *mut i32;
  |         - binding declared here but left uninitialized
3 |     unsafe { *p = 42; }
  |                ^ `p` used here but it isn't initialized

Rust's compiler tracks initialization. You cannot use a variable — of any type, including raw pointers — until it has been assigned a value.


Summary: six causes at a glance

    Cause                  C behavior            Rust behavior
    ─────────────────────  ────────────────────   ─────────────────────────
    1. NULL deref          SIGSEGV               Option<T> — compile error
    2. Stack overflow      SIGSEGV (guard page)  Detected, clear message
    3. Write to rodata     SIGSEGV               &str is immutable — compile error
    4. Use-after-free      SIGSEGV or corruption Borrow checker — compile error
    5. Buffer overflow     SIGSEGV or corruption Bounds check — panic
    6. Wild pointer        SIGSEGV or corruption Must initialize — compile error

Rust eliminates four of these at compile time, detects one at runtime with a clear message, and handles the last (stack overflow) with a runtime check. In safe Rust, segfaults from your code are essentially impossible.


Debugging segfaults

1. dmesg — what the kernel saw

$ dmesg | tail -3
[12345.678] crash[9876]: segfault at 0 ip 00005555555551a2
    sp 00007fffffffde10 error 6 in crash[555555555000+1000]

Fields:

  • at 0: the faulting virtual address (0 = NULL)
  • ip 00005555555551a2: instruction pointer — what instruction caused the fault
  • error 6: error code bits (6 = user-mode write to non-present page)

Error code bits:

    Bit 0: 0=non-present page, 1=protection violation
    Bit 1: 0=read, 1=write
    Bit 2: 0=kernel mode, 1=user mode
    Bit 3: 1=reserved bit violation
    Bit 4: 1=instruction fetch (NX violation)

    Error 6 = 0b110 = user-mode(1) + write(1) + non-present(0)
    Error 4 = 0b100 = user-mode(1) + read(0) + non-present(0)
    Error 7 = 0b111 = user-mode(1) + write(1) + protection(1)

2. Core dumps

$ ulimit -c unlimited       # Enable core dumps
$ ./crash
Segmentation fault (core dumped)
$ gdb ./crash core
(gdb) bt                    # Backtrace — where exactly it crashed
#0  0x00005555555551a2 in main () at crash.c:5
(gdb) info registers        # CPU state at crash time
(gdb) print p               # The guilty pointer
$1 = (int *) 0x0

3. addr2line

$ addr2line -e crash 0x00005555555551a2
/home/user/crash.c:5

Converts an instruction address to a source file and line number. Requires -g (debug info) during compilation.

4. AddressSanitizer (the nuclear option)

$ gcc -g -fsanitize=address -o crash crash.c
$ ./crash
=================================================================
==9876==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000
    #0 0x555555555192 in main /home/user/crash.c:5

ASan catches things the kernel can't — like small buffer overflows that stay within mapped pages. It adds ~2x memory overhead and ~2x slowdown, but it catches almost everything.


🔧 Task: Trigger and debug all six types

Create six C programs, one for each cause. For each one:

  1. Compile with gcc -g -O0
  2. Run it, confirm the segfault
  3. Check dmesg | tail -1 for the kernel's report
  4. Run under GDB:
    $ gdb ./program
    (gdb) run
    (gdb) bt
    (gdb) info registers
    (gdb) print <the pointer variable>
    
  5. Note the faulting address and error code
// 1_null.c
int main() { *(int *)0 = 42; return 0; }

// 2_stack.c
void f() { f(); }
int main() { f(); return 0; }

// 3_rodata.c
int main() { char *s = "hello"; s[0] = 'H'; return 0; }

// 4_uaf.c
#include <stdlib.h>
int main() { int *p = malloc(4); free(p); *p = 42; return 0; }

// 5_overflow.c
int main() { int a[10]; a[1000000] = 42; return 0; }

// 6_wild.c
int main() { int *p; *p = 42; return 0; }

Bonus: Compile each with -fsanitize=address and compare the output. ASan gives far more detail than a raw segfault.

Bonus 2: Write the Rust equivalent of each. See which ones the compiler refuses to compile and read the error messages carefully — they're telling you exactly what C doesn't.

Signals and Process Lifecycle

Type this right now

// save as catcher.c — compile: gcc -o catcher catcher.c
#include <stdio.h>
#include <signal.h>
#include <unistd.h>

void handler(int sig) {
    printf("\nCaught signal %d (SIGINT)! Not dying today.\n", sig);
}

int main() {
    signal(SIGINT, handler);
    printf("Try pressing Ctrl+C... (PID: %d)\n", getpid());

    for (int i = 0; i < 30; i++) {
        printf("Running... %d\n", i);
        sleep(1);
    }
    printf("Finished normally.\n");
    return 0;
}
$ ./catcher
Try pressing Ctrl+C... (PID: 12345)
Running... 0
Running... 1
^C
Caught signal 2 (SIGINT)! Not dying today.
Running... 2
Running... 3
^C
Caught signal 2 (SIGINT)! Not dying today.
Running... 4

You pressed Ctrl+C twice. Normally that kills the process. But your signal handler intercepted it. The process kept running. You just took control of how your program responds to external events.

Now try kill -9 12345 from another terminal. That sends SIGKILL. No handler can catch it. The process dies immediately.


What is a signal?

A signal is an asynchronous notification delivered to a process. It's the kernel's way of saying "something happened that you might care about."

    ┌──────────────────────────────────────────────────────┐
    │                    Sources of Signals                 │
    │                                                      │
    │  Keyboard:  Ctrl+C → SIGINT    Ctrl+\ → SIGQUIT     │
    │  Hardware:  Bad address → SIGSEGV    Bad math → SIGFPE│
    │  Kernel:    Child exited → SIGCHLD    Timer → SIGALRM│
    │  Other process: kill(pid, SIGTERM)                   │
    │  Your code: raise(SIGABRT), abort()                  │
    └──────────────────────────────────────────────────────┘
              │
              ▼
    ┌──────────────────────────────────────────────────────┐
    │              Kernel sets signal pending               │
    │              on the target process                    │
    └────────────────────────┬─────────────────────────────┘
                             │
                             ▼
    ┌──────────────────────────────────────────────────────┐
    │  Next time process returns to user space:            │
    │  → kernel checks pending signals                     │
    │  → delivers the signal by running the handler        │
    └──────────────────────────────────────────────────────┘

Signals are not interrupts. They don't execute immediately when sent. The kernel marks a signal as pending, and the process sees it the next time it transitions from kernel mode back to user mode (after a syscall, or after being scheduled).


The important signals

    Signal     Number   Default Action    Meaning
    ─────────  ──────   ──────────────    ─────────────────────────────
    SIGINT       2      Terminate         Ctrl+C — polite interrupt
    SIGQUIT      3      Core dump         Ctrl+\ — quit with dump
    SIGABRT      6      Core dump         abort() called
    SIGFPE       8      Core dump         Arithmetic error (div by 0)
    SIGKILL      9      Terminate         Unconditional kill (CAN'T CATCH)
    SIGSEGV     11      Core dump         Invalid memory access
    SIGPIPE     13      Terminate         Write to pipe with no reader
    SIGTERM     15      Terminate         Polite "please exit"
    SIGCHLD     17      Ignore            Child process changed state
    SIGSTOP     19      Stop process      Pause (CAN'T CATCH)
    SIGCONT     18      Continue          Resume stopped process
    SIGUSR1     10      Terminate         User-defined
    SIGUSR2     12      Terminate         User-defined
    SIGBUS       7      Core dump         Bus error (misaligned access)

Two signals cannot be caught or ignored: SIGKILL (9) and SIGSTOP (19). These are the kernel's absolute authority — no process can resist them.

💡 Fun Fact: kill -9 is called "kill dash nine" and has become part of programmer folklore. There's even a haiku: "No, your process is / not important. SIGKILL / does not negotiate."


Signal delivery mechanics

Here's what happens when signal delivery is triggered:

    Process in user space, executing normally
         │
         │  Syscall (read, write, etc.) or timer interrupt
         ▼
    Process enters kernel mode
         │
         │  Kernel does its work (I/O, scheduling, etc.)
         │
         │  Before returning to user space, kernel checks:
         │  "Are there any pending signals for this process?"
         │
         ├── No pending signals → return to user space normally
         │
         └── Yes, signal S is pending:
              │
              ├── Is there a custom handler for S?
              │    │
              │    Yes → Modify user-space stack:
              │    │     push signal frame (saved registers)
              │    │     set instruction pointer to handler function
              │    │     return to user space → handler runs
              │    │     handler returns → sigreturn syscall
              │    │     kernel restores original registers
              │    │     process resumes where it was interrupted
              │    │
              │    No → Execute default action:
              │         Terminate, core dump, stop, or ignore
              │
              └── Is S blocked (signal mask)?
                   Yes → remains pending, delivered later

The kernel modifies the process's user-space stack to run the handler. This is why signal handlers must be careful — they're running on the interrupted code's stack.


Writing a proper signal handler

signal() is the simple API, but sigaction() is what you should actually use:

// save as proper_handler.c — compile: gcc -o proper_handler proper_handler.c
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <stdlib.h>

volatile sig_atomic_t got_signal = 0;

void handler(int sig) {
    // RULE: only call async-signal-safe functions here!
    // printf is NOT safe. write() IS safe.
    got_signal = sig;
}

int main() {
    struct sigaction sa;
    sa.sa_handler = handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = 0;

    sigaction(SIGINT, &sa, NULL);
    sigaction(SIGTERM, &sa, NULL);

    printf("PID: %d — send me SIGINT or SIGTERM\n", getpid());

    while (!got_signal) {
        printf("Working...\n");
        sleep(1);
    }

    printf("Received signal %d. Cleaning up...\n", got_signal);
    // Do your cleanup here: close files, flush buffers, etc.
    return 0;
}

Critical rule: Inside a signal handler, you can only call async-signal-safe functions. printf(), malloc(), and most of the standard library are NOT safe. Use write() if you must output something. The safe pattern is: set a flag in the handler, check it in your main loop.


Rust and signals

Rust doesn't have built-in signal handling in the standard library. Panic is Rust's primary error mechanism for things like bounds check failures:

fn main() {
    let v = vec![1, 2, 3];
    println!("{}", v[10]);  // Panics — not a signal, a Rust panic
}
thread 'main' panicked at 'index out of bounds: the len is 3 but the index is 10'
note: run with `RUST_BACKTRACE=1` for a backtrace

For actual Unix signal handling, use the signal-hook crate:

// Cargo.toml: signal-hook = "0.3"
use signal_hook::consts::SIGINT;
use signal_hook::iterator::Signals;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut signals = Signals::new(&[SIGINT])?;

    println!("Press Ctrl+C...");
    for sig in signals.forever() {
        match sig {
            SIGINT => {
                println!("Caught SIGINT! Exiting gracefully.");
                break;
            }
            _ => unreachable!(),
        }
    }
    Ok(())
}

SIGSEGV from unsafe code still kills a Rust process the same way it kills a C process. The Rust runtime does not catch segfaults.


Process lifecycle

Every process on your system follows this lifecycle:

    Parent process
         │
         │  fork()
         ├──────────────────────────┐
         │                          │
         │ Parent continues         │ Child is a COPY of parent
         │                          │
         │                          │  exec() — optional
         │                          │  Replace child's memory with
         │                          │  a new program (e.g., /bin/ls)
         │                          │
         │                          │  ... child runs ...
         │                          │
         │                          │  exit(status)
         │                          │  Child terminates
         │                          │
         │                          ▼
         │                   ┌──────────────┐
         │                   │   ZOMBIE     │  Child's entry stays in
         │                   │  (defunct)    │  process table until parent
         │                   └──────┬───────┘  calls wait()
         │                          │
         │  wait(&status)           │
         │  Parent collects ◄───────┘
         │  child's exit status
         │
         ▼
    Child's process table entry is finally freed

fork(): clone the process

pid_t pid = fork();
// After this line, TWO processes are running
if (pid == 0) {
    // Child process — fork returned 0
    printf("I'm the child, PID %d\n", getpid());
} else {
    // Parent process — fork returned child's PID
    printf("I'm the parent, child is %d\n", pid);
}

exec(): replace with a new program

// In the child:
execvp("ls", (char *[]){"ls", "-la", NULL});
// If exec succeeds, this line NEVER runs — the entire address space is replaced
perror("exec failed");

wait(): collect the child's exit status

int status;
pid_t child = wait(&status);
if (WIFEXITED(status)) {
    printf("Child %d exited with code %d\n", child, WEXITSTATUS(status));
}

Zombie processes

If a child exits but the parent never calls wait(), the child becomes a zombie. It has no memory, no open files, no running code — but its process table entry remains so the parent can eventually collect the exit status.

$ ps aux | grep Z
USER  PID   ... STAT COMMAND
user  5678  ... Z    [child] <defunct>

Zombies consume almost no resources (just one entry in the process table), but if a parent spawns thousands of children without waiting, you can exhaust the PID space.

Solution: Call wait() or waitpid() for every child. Or set SIGCHLD to SIG_IGN — this tells the kernel to automatically reap children:

signal(SIGCHLD, SIG_IGN);  // Auto-reap children. No zombies.

🧠 What do you think happens?

If a parent process exits while a child is still running, what happens to the child? Who becomes its new parent? (Hint: it's PID 1.)


Core dumps

When a process crashes with SIGSEGV, SIGABRT, or SIGQUIT (among others), the kernel can write the process's entire memory image to a file: the core dump.

$ ulimit -c unlimited          # Enable core dumps
$ ./crash
Segmentation fault (core dumped)
$ file core
core: ELF 64-bit LSB core file, x86-64
$ gdb ./crash core
(gdb) bt                       # Full backtrace at crash time
(gdb) info registers           # All register values
(gdb) x/10x $rsp               # Stack contents
(gdb) print *pointer_var       # Examine variables

A core dump contains: all mapped memory regions (stack, heap, data), register values for every thread, signal information, and the memory map. It's a complete snapshot of the dying process.


🔧 Task: Build a Ctrl+C counter

Write a C program that:

  1. Installs a SIGINT handler using sigaction()
  2. Counts how many times Ctrl+C is pressed
  3. After 3 presses, prints "OK fine, exiting." and terminates
  4. Uses only volatile sig_atomic_t for the counter (signal safety)
  5. Uses write() in the handler, not printf()
// save as counter.c — compile: gcc -o counter counter.c
#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

volatile sig_atomic_t count = 0;

void handler(int sig) {
    count++;
    const char *msg = "Caught SIGINT!\n";
    write(STDOUT_FILENO, msg, 15);
}

int main() {
    struct sigaction sa = { .sa_handler = handler };
    sigemptyset(&sa.sa_mask);
    sigaction(SIGINT, &sa, NULL);

    printf("Press Ctrl+C three times to quit (PID: %d)\n", getpid());
    while (count < 3) {
        pause();  // Sleep until a signal arrives
    }
    printf("OK fine, exiting after %d SIGINTs.\n", count);
    return 0;
}

Bonus: Send other signals from another terminal and observe the behavior:

$ kill -SIGTERM $(pgrep counter)    # Not caught — default action kills
$ kill -SIGUSR1 $(pgrep counter)    # Not caught — default action kills
$ kill -SIGSTOP $(pgrep counter)    # Pauses the process (can't catch)
$ kill -SIGCONT $(pgrep counter)    # Resumes it

The Allocator: What malloc Really Does

Type this right now

$ ltrace -e malloc,free ./your_program 2>&1 | head -20

Don't have a program handy? Use this:

// save as allocs.c — compile: gcc -o allocs allocs.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main() {
    char *a = malloc(32);
    char *b = malloc(128);
    char *c = malloc(256 * 1024);  // 256 KB — big allocation!

    printf("a = %p (32 bytes)\n", (void *)a);
    printf("b = %p (128 bytes)\n", (void *)b);
    printf("c = %p (256 KB)\n", (void *)c);

    free(a);
    free(b);
    free(c);
    return 0;
}
$ ltrace -e malloc,free ./allocs
malloc(32)                         = 0x55a000
malloc(128)                        = 0x55a030
malloc(262144)                     = 0x7f4a00000000
free(0x55a000)
free(0x55a030)
free(0x7f4a00000000)

Notice something? a and b are close together (in the heap/brk region). But c is in a completely different address range. That's because the allocator uses two different strategies depending on the allocation size.

Now watch the system calls:

$ strace -e brk,mmap,munmap ./allocs 2>&1 | grep -E "^(brk|mmap|munmap)"
brk(NULL)                         = 0x55a000
brk(0x57b000)                     = 0x57b000     ← extend heap
mmap(NULL, 266240, ..., MAP_PRIVATE|MAP_ANONYMOUS, ...) = 0x7f4a00000000  ← big alloc
munmap(0x7f4a00000000, 266240)    = 0             ← big free

malloc called brk for the small ones. mmap for the big one. Two paths to the kernel.


malloc is NOT a system call

This surprises people. malloc() is a library function — it runs entirely in user space. It maintains its own data structures, its own free lists, its own bookkeeping. It only calls the kernel when it needs more memory from the OS.

    Your code
    ─────────
    ptr = malloc(32);
         │
         ▼
    ┌──────────────────────────────────────────────┐
    │           C Library (glibc, musl, etc.)       │
    │                                               │
    │  malloc() implementation:                     │
    │    1. Check free list — is there a free chunk │
    │       that fits?                              │
    │       YES → carve it out, return pointer      │
    │       NO  → ask the kernel for more memory    │
    │             (brk or mmap)                     │
    │                                               │
    │  free() implementation:                       │
    │    1. Mark chunk as free                      │
    │    2. Add to free list                        │
    │    3. Maybe coalesce with neighbors           │
    │    4. Maybe return memory to kernel            │
    └──────────────────────────────────────────────┘
         │
         │ Only when needed:
         ▼
    ┌──────────────────────────────────────────────┐
    │           Kernel                              │
    │   brk()  — extend the heap (contiguous)      │
    │   mmap() — map pages anywhere                │
    └──────────────────────────────────────────────┘

Two ways to get memory from the kernel

brk / sbrk: the heap

The classic heap is a contiguous region that grows upward. brk() moves the "program break" — the boundary between allocated and unallocated heap space.

    Before malloc:

    0x555555570000 ┌──────────────────┐
                   │ .data, .bss      │
    0x555555580000 ├──────────────────┤ ◄── program break (brk)
                   │                  │
                   │ (unallocated)    │
                   │                  │
                   └──────────────────┘

    After malloc(32) + malloc(128):

    0x555555570000 ┌──────────────────┐
                   │ .data, .bss      │
    0x555555580000 ├──────────────────┤ ◄── old brk
                   │ [chunk: 32 bytes]│
                   │ [chunk: 128 bytes│
    0x55555559B000 ├──────────────────┤ ◄── new brk (moved up)
                   │ (unallocated)    │
                   └──────────────────┘

brk is fast (just moving a pointer in kernel data structures) but only works for contiguous growth.

mmap anonymous: pages anywhere

For large allocations (typically >128 KB), the allocator uses mmap with MAP_ANONYMOUS to get pages at an arbitrary virtual address:

    mmap region (somewhere in the address space):

    0x7f4a00000000 ┌──────────────────────┐
                   │                      │
                   │  256 KB              │  ◄── mmap'd directly
                   │  (65 pages)          │
                   │                      │
    0x7f4a00040000 └──────────────────────┘

Advantages of mmap for large allocations:

  • Can be returned to the OS immediately via munmap(). With brk, you can only shrink the heap from the top — a free block in the middle stays allocated.
  • No fragmentation of the brk region — large blocks don't interfere with small ones.

Inside a heap chunk

When you malloc(32), the allocator actually allocates more than 32 bytes. Every chunk has a header containing metadata:

    What malloc returns          What's actually in memory
    ──────────────────           ────────────────────────────

    ptr ──────────────────────►  ┌─────────────────────────────┐
                                 │ prev_size (8 bytes)         │ ◄── only used if
                                 │                             │     previous chunk is free
                                 ├─────────────────────────────┤
                                 │ size + flags (8 bytes)      │ ◄── chunk header
                                 │ bits: [size | A | M | P]   │
    ptr points HERE ──────────►  ├─────────────────────────────┤
                                 │                             │
                                 │ User data (32 bytes)        │ ◄── what you requested
                                 │                             │
                                 ├─────────────────────────────┤
                                 │ (alignment padding)         │
                                 └─────────────────────────────┘

    Total chunk size: 48 bytes for a 32-byte request
    Overhead: 16 bytes (prev_size + size header)

The flags in the size field:

  • P (PREV_INUSE): is the previous chunk allocated?
  • M (IS_MMAPPED): was this chunk allocated via mmap?
  • A (NON_MAIN_ARENA): does this belong to a non-main arena? (threading)

The malloc pointer you get back is 16 bytes past the start of the actual chunk. When you call free(ptr), the allocator subtracts 16 to find the header.


The free list

When you free() a chunk, the allocator doesn't give the memory back to the kernel (usually). It adds the chunk to a free list — a linked list of available chunks, threaded through the chunks themselves.

    Heap after several malloc + free operations:

    ┌──────────────────────────────────────────────────────────┐
    │                                                          │
    │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐        │
    │  │ALLOC   │  │ FREE   │  │ALLOC   │  │ FREE   │        │
    │  │ 64 B   │  │ 128 B  │  │ 32 B   │  │ 256 B  │        │
    │  │        │  │  ┌──┐  │  │        │  │  ┌──┐  │        │
    │  │        │  │  │FD├──┼──┼────────┼──┼─►│FD├──┼──► ... │
    │  │        │  │  │BK│◄─┼──┼────────┼──┼──│BK│  │        │
    │  │        │  │  └──┘  │  │        │  │  └──┘  │        │
    │  └────────┘  └────────┘  └────────┘  └────────┘        │
    │                                                          │
    └──────────────────────────────────────────────────────────┘

    FD = forward pointer (next free chunk)
    BK = backward pointer (previous free chunk)

    The free list is a doubly-linked list INSIDE the free chunks.
    No extra memory needed — the user data area stores the pointers.

malloc(32): the full sequence

    1. malloc(32) called

    2. Round up to minimum chunk size: 32 + 16 (header) = 48 bytes
       Align to 16 bytes: 48 bytes

    3. Check fastbin for 48-byte chunks
       ├── Found? → unlink from fastbin, return user pointer
       └── Not found? → continue

    4. Check small bin for 48-byte chunks
       ├── Found? → unlink from bin, return user pointer
       └── Not found? → continue

    5. Check unsorted bin — scan for any chunk that fits
       ├── Exact match? → return it
       ├── Larger chunk? → split it, return the piece, put remainder in bin
       └── Nothing? → continue

    6. Check larger bins for a bigger chunk to split
       ├── Found? → split, return piece, put remainder back
       └── Not found? → continue

    7. Ask the kernel for more memory
       brk() to extend the heap by at least 128 KB
       Carve a 48-byte chunk from the new space
       Return user pointer

free(ptr): what happens

    1. free(ptr) called

    2. Subtract 16 bytes to find chunk header
       Read the size field → this chunk is 48 bytes

    3. Is previous chunk free? (check PREV_INUSE bit)
       ├── Yes → coalesce: merge with previous chunk, update size
       └── No → skip

    4. Is next chunk free? (check next chunk's PREV_INUSE)
       ├── Yes → coalesce: merge with next chunk, update size
       └── No → set next chunk's PREV_INUSE = 0

    5. Add the (possibly coalesced) chunk to the appropriate free list
       Small chunk (< 512 bytes) → fastbin or smallbin
       Larger chunk → unsorted bin

    6. Was this chunk mmap'd? (IS_MMAPPED flag)
       ├── Yes → munmap() it immediately — returns pages to kernel
       └── No → it stays in the heap free list

Coalescing is critical. Without it, you'd end up with many small free chunks that can't satisfy larger requests, even though the total free space is large. This is external fragmentation.


Why double-free is catastrophic

int *p = malloc(32);
free(p);
free(p);  // DOUBLE FREE — disaster
    After first free(p):

    Free list:  HEAD → [chunk at p] → ...
                         │
                         └─ FD points to next free chunk

    After second free(p):

    Free list:  HEAD → [chunk at p] → [chunk at p] → ...
                         │                  │
                         └──────────────────┘
                         (circular! chunk points to itself)

    Now malloc(32) returns p.
    Then malloc(32) returns p AGAIN.
    Two different parts of your program think they own the same memory.
    They overwrite each other's data. Heap corruption. Potential code execution.

This is why double-free is a security vulnerability, not just a bug.

🧠 What do you think happens?

You malloc(32), write to it, free() it, then immediately malloc(32) again. Do you get the same pointer back? Why or why not? (Try it.)


Thread safety: arenas

In a multithreaded program, multiple threads call malloc() simultaneously. A global lock would be a bottleneck. The solution: arenas.

    ┌────────────────────────────────────────────────────┐
    │                  Process                            │
    │                                                    │
    │   Thread 1          Thread 2          Thread 3     │
    │      │                 │                 │         │
    │      ▼                 ▼                 ▼         │
    │  ┌────────┐       ┌────────┐       ┌────────┐     │
    │  │Arena 0 │       │Arena 1 │       │Arena 2 │     │
    │  │(main)  │       │        │       │        │     │
    │  │ lock   │       │ lock   │       │ lock   │     │
    │  │ heap   │       │ heap   │       │ heap   │     │
    │  │ bins   │       │ bins   │       │ bins   │     │
    │  └────────┘       └────────┘       └────────┘     │
    │                                                    │
    │  Each arena has its own lock and free lists.       │
    │  Threads pick an arena (usually round-robin).      │
    │  Contention is reduced: threads rarely fight       │
    │  for the same lock.                                │
    └────────────────────────────────────────────────────┘

glibc creates up to 8 * num_cores arenas. Each thread is assigned to an arena and sticks with it (usually). The main arena uses brk(); secondary arenas use mmap().


Rust: same allocator underneath

Rust's default allocator is the system allocator — it literally calls malloc and free from your platform's C library.

fn main() {
    // Box::new calls the global allocator (malloc underneath)
    let x = Box::new(42);
    println!("x = {} at {:p}", x, &*x);
    // x dropped here → allocator calls free()

    // Vec grows by calling realloc or malloc+copy
    let mut v = Vec::new();
    for i in 0..10 {
        v.push(i);
        println!("len={}, cap={}, ptr={:p}", v.len(), v.capacity(), v.as_ptr());
    }
}
len=1, cap=4, ptr=0x55a020       ← initial allocation
len=2, cap=4, ptr=0x55a020
len=3, cap=4, ptr=0x55a020
len=4, cap=4, ptr=0x55a020
len=5, cap=8, ptr=0x55a040       ← doubled! Allocated new buffer, freed old
len=6, cap=8, ptr=0x55a040
len=7, cap=8, ptr=0x55a040
len=8, cap=8, ptr=0x55a040
len=9, cap=16, ptr=0x55a060      ← doubled again!
len=10, cap=16, ptr=0x55a060

You can swap in a different allocator with #[global_allocator]:

#![allow(unused)]
fn main() {
use std::alloc::System;

#[global_allocator]
static GLOBAL: System = System;  // Explicitly use system allocator (the default)

// Or use jemalloc, mimalloc, etc. for better multithreaded performance
}

💡 Fun Fact: Firefox switched from the system allocator to jemalloc and saw measurable performance improvements. The allocator you choose matters — it affects fragmentation, multithreaded scaling, and memory overhead. For most programs, the default is fine. For high-performance servers, it's worth benchmarking alternatives.


🔧 Task: Watch the allocator in action

Step 1: ltrace — see malloc/free calls:

$ ltrace -e malloc,calloc,realloc,free ./allocs 2>&1 | head -20

Step 2: strace — see when it talks to the kernel:

$ strace -e brk,mmap,munmap ./allocs 2>&1 | head -20

Step 3: Write a program that does many small allocations, then many large ones, and observe the difference:

// save as alloc_patterns.c — compile: gcc -o alloc_patterns alloc_patterns.c
#include <stdio.h>
#include <stdlib.h>

int main() {
    printf("=== Small allocations (brk region) ===\n");
    void *ptrs[10];
    for (int i = 0; i < 10; i++) {
        ptrs[i] = malloc(64);
        printf("  malloc(64) = %p\n", ptrs[i]);
    }

    printf("\n=== Large allocation (mmap region) ===\n");
    void *big = malloc(1024 * 1024);  // 1 MB
    printf("  malloc(1MB) = %p\n", big);

    // Free in reverse order — watch coalescing
    printf("\n=== Freeing ===\n");
    free(big);
    for (int i = 9; i >= 0; i--) {
        free(ptrs[i]);
    }
    printf("Done.\n");
    return 0;
}

Run with strace -e brk,mmap,munmap and observe: small allocations trigger one brk() call (to extend the heap). The large allocation triggers mmap(). free(big) triggers munmap(). The small free()s trigger nothing — the memory stays in the process's free list.

Rust's Memory Story

Type this right now

// save as ownership.rs — compile: rustc ownership.rs
fn main() {
    let s1 = String::from("hello");
    let s2 = s1;  // s1 is MOVED to s2

    // println!("{}", s1);  // Uncomment this — the compiler will refuse.
    println!("{}", s2);     // Only s2 is valid now.
}
$ rustc ownership.rs
$ ./ownership
hello

Now uncomment the s1 line:

error[E0382]: borrow of moved value: `s1`
 --> ownership.rs:5:20
  |
2 |     let s1 = String::from("hello");
  |         -- move occurs because `s1` has type `String`
3 |     let s2 = s1;
  |              -- value moved here
4 |
5 |     println!("{}", s1);
  |                    ^^ value borrowed here after move

That's the borrow checker. It just prevented a use-after-free at compile time. In C, you'd get undefined behavior. In Rust, you get a compiler error with an explanation of exactly what went wrong.


Ownership: one owner, one lifetime

Every value in Rust has exactly one owner. When the owner goes out of scope, the value is dropped (freed). No garbage collector. No manual free. The compiler inserts drop calls at exactly the right places.

fn main() {
    {
        let s = String::from("hello");  // s owns the String
        println!("{}", s);
    }   // s goes out of scope → String::drop() called → heap memory freed

    // s does not exist here. No dangling pointer possible.
}

Compare with C:

#include <stdlib.h>
#include <string.h>

int main() {
    {
        char *s = malloc(6);
        strcpy(s, "hello");
        printf("%s\n", s);
        // Forgot free(s)?  → memory leak
        // Remembered free(s), then used s later? → use-after-free
        free(s);
    }
    // s still exists as a dangling pointer. C doesn't care.
    return 0;
}
    Rust ownership model:

    ┌─────────────┐     owns     ┌──────────────────┐
    │ Variable s  │─────────────►│ Heap: "hello"    │
    │ (on stack)  │              │ (5 bytes + null)  │
    └─────────────┘              └──────────────────┘
         │                              │
         │ s goes out of scope          │
         ▼                              ▼
    s is gone              memory is freed (drop)
    No dangling ref        No leak

Move semantics

When you assign a String to another variable, ownership moves. The original variable becomes invalid. This prevents double-free.

fn main() {
    let s1 = String::from("hello");
    let s2 = s1;  // MOVE — s1 is invalidated

    // In memory:
    // s1's stack data (ptr, len, cap) was COPIED to s2
    // But s1 is now considered uninitialized by the compiler
    // The heap data was NOT copied — same pointer, one owner

    println!("{}", s2);
}
    Before move:
    ┌──── s1 ────┐                ┌──────────────────┐
    │ ptr ───────────────────────►│ 'h' 'e' 'l' 'l' │
    │ len: 5     │                │ 'o'              │
    │ cap: 5     │                └──────────────────┘
    └────────────┘

    After move (let s2 = s1):
    ┌──── s1 ────┐                ┌──────────────────┐
    │ INVALID    │                │ 'h' 'e' 'l' 'l' │
    │ (compiler  │    ┌──────────►│ 'o'              │
    │  forbids)  │    │           └──────────────────┘
    └────────────┘    │
                      │
    ┌──── s2 ────┐    │
    │ ptr ────────────┘
    │ len: 5     │
    │ cap: 5     │
    └────────────┘

For types that are cheap to copy (i32, f64, bool, char), Rust uses Copy instead of move. The value is duplicated, and both variables remain valid:

#![allow(unused)]
fn main() {
let x = 42;
let y = x;  // COPY — both x and y are valid
println!("{} {}", x, y);  // Fine! Integers implement Copy.
}

💡 Fun Fact: The distinction between Copy and Move is a zero-cost abstraction. At the machine code level, both are a memcpy of the stack data. The difference is purely in what the compiler permits afterward. Move adds no runtime cost — it's just a compile-time rule.


Borrowing: references without ownership

Sometimes you want to use a value without taking ownership. That's borrowing.

fn print_length(s: &String) {  // borrows s — does not own it
    println!("Length: {}", s.len());
}   // s (the reference) goes out of scope, but the String is NOT dropped

fn main() {
    let s = String::from("hello");
    print_length(&s);   // lend s to the function
    println!("{}", s);  // s is still valid!
}

Two kinds of references:

    &T  — shared reference (read-only)
    • Multiple &T can exist simultaneously
    • Cannot modify the data
    • Like a "read lock"

    &mut T — exclusive reference (read-write)
    • Only ONE &mut T can exist at a time
    • No &T can coexist with &mut T
    • Like a "write lock"
fn main() {
    let mut s = String::from("hello");

    let r1 = &s;      // OK — shared reference
    let r2 = &s;      // OK — multiple shared refs allowed
    println!("{} {}", r1, r2);

    let r3 = &mut s;  // OK — r1 and r2 are no longer used after this point
    r3.push_str(" world");
    println!("{}", r3);
}

The compiler enforces these rules at compile time. No runtime overhead. No data races possible in safe code.

🧠 What do you think happens?

#![allow(unused)]
fn main() {
let mut v = vec![1, 2, 3];
let first = &v[0];   // Borrow an element
v.push(4);           // Modify the vector
println!("{}", first);
}

Does this compile? Why or why not? (Hint: what does push do if the vector needs to grow?)


Lifetimes: how long references are valid

The compiler tracks the lifetime of every reference — how long the borrowed data is valid.

#![allow(unused)]
fn main() {
fn longest<'a>(s1: &'a str, s2: &'a str) -> &'a str {
    if s1.len() > s2.len() { s1 } else { s2 }
}
}

The 'a says: "the returned reference lives as long as the shorter of s1 and s2."

#![allow(unused)]
fn main() {
fn broken() -> &str {
    let s = String::from("hello");
    &s  // ERROR: returning reference to local variable
}       // s is dropped here — reference would dangle
}
error[E0106]: missing lifetime specifier
error[E0515]: cannot return reference to local variable `s`

In C, this compiles silently and causes a use-after-free:

char *broken() {
    char s[] = "hello";  // Stack-allocated
    return s;            // Returns pointer to stack frame that's about to be freed
}                        // s is gone. Caller has a dangling pointer.

Smart pointers: Box, Vec, String

Box<T>: single heap allocation

fn main() {
    let x = Box::new(42);  // Allocates 4 bytes on the heap
    println!("x = {} at {:p}", x, &*x);
}   // x dropped → heap memory freed
    Stack                    Heap
    ┌──────────┐            ┌──────┐
    │ x: *ptr ─┼───────────►│  42  │
    │ (8 bytes)│            │(4 B) │
    └──────────┘            └──────┘

Box<T> is exactly one pointer wide. It's Rust's equivalent of malloc + free, but with automatic cleanup.

Vec<T>: growable array

fn main() {
    let mut v: Vec<i32> = Vec::new();
    println!("Empty: len={}, cap={}", v.len(), v.capacity());  // 0, 0

    v.push(1);  // Allocates
    v.push(2);
    v.push(3);
    v.push(4);
    println!("After 4: len={}, cap={}", v.len(), v.capacity());  // 4, 4

    v.push(5);  // Capacity exceeded → reallocate (double)
    println!("After 5: len={}, cap={}", v.len(), v.capacity());  // 5, 8
}
    Vec<i32> on the stack:       Heap buffer:
    ┌───────────────────┐       ┌───┬───┬───┬───┬───┬───┬───┬───┐
    │ ptr ──────────────┼──────►│ 1 │ 2 │ 3 │ 4 │ 5 │   │   │   │
    │ len: 5            │       └───┴───┴───┴───┴───┴───┴───┴───┘
    │ cap: 8            │         used (5)        unused (3)
    └───────────────────┘
         24 bytes on stack

When Vec grows past capacity, it allocates a new buffer (typically 2x), copies elements, and frees the old buffer. This is exactly what C's realloc does, but Rust's borrow checker ensures no references to the old buffer survive the reallocation.

String: it's a Vec<u8>

fn main() {
    let s = String::from("hello");
    // String is literally: struct String { vec: Vec<u8> }
    println!("len={}, cap={}, size_of={}", s.len(), s.capacity(),
             std::mem::size_of::<String>());
    // len=5, cap=5, size_of=24 (same as Vec: ptr + len + cap)

    let slice: &str = &s;  // &str is just (pointer, length) — no allocation
    println!("size_of &str = {}", std::mem::size_of::<&str>());  // 16
}

&str is a fat pointer: a pointer to UTF-8 bytes plus a length. It doesn't own anything. It can point into a String, a string literal (in .rodata), or a slice of any UTF-8 bytes.


Reference counting: Rc and Arc

When you need multiple owners, Rust provides reference-counted pointers:

use std::rc::Rc;

fn main() {
    let a = Rc::new(String::from("shared data"));
    let b = Rc::clone(&a);  // Increments reference count
    let c = Rc::clone(&a);  // Increments again

    println!("References: {}", Rc::strong_count(&a));  // 3
    println!("a = {}", a);
    println!("b = {}", b);
    println!("c = {}", c);
}   // c dropped (count→2), b dropped (count→1), a dropped (count→0) → data freed
    Stack                           Heap
    ┌─────┐                       ┌────────────────────────┐
    │  a ─┼──────────────────────►│ refcount: 3            │
    │     │                  ┌───►│ String: "shared data"  │
    │  b ─┼──────────────────┘ ┌─►│                        │
    │     │                    │  └────────────────────────┘
    │  c ─┼────────────────────┘
    └─────┘

Rc<T> is single-threaded only. For thread-safe reference counting, use Arc<T> (Atomic Rc):

use std::sync::Arc;
use std::thread;

fn main() {
    let data = Arc::new(vec![1, 2, 3]);

    let handles: Vec<_> = (0..3).map(|i| {
        let data = Arc::clone(&data);
        thread::spawn(move || {
            println!("Thread {}: {:?}", i, data);
        })
    }).collect();

    for h in handles { h.join().unwrap(); }
}

Arc uses atomic operations for the reference count, making it safe to share across threads. The cost: atomic increments/decrements are slower than normal increments (~5-20 ns vs ~1 ns).


The global allocator

Under the hood, Box::new, Vec::push, and String::from all call the global allocator. By default, this is your system's malloc/free.

// You can verify this: Rust's alloc calls end up as malloc
use std::alloc::{GlobalAlloc, Layout, System};

fn main() {
    unsafe {
        let layout = Layout::new::<[u8; 64]>();
        let ptr = System.alloc(layout);  // This calls malloc(64)
        println!("Allocated at: {:p}", ptr);
        System.dealloc(ptr, layout);     // This calls free(ptr)
    }
}

You can replace the allocator entirely:

#![allow(unused)]
fn main() {
// Using jemalloc (add jemallocator = "0.5" to Cargo.toml)
// use jemallocator::Jemalloc;
// #[global_allocator]
// static GLOBAL: Jemalloc = Jemalloc;
}

No-alloc: embedded Rust

For embedded systems (microcontrollers, kernels), you can't use a heap at all. Rust supports this with #![no_std]:

#![allow(unused)]
#![no_std]
#![no_main]

fn main() {
// No Vec, no String, no Box — nothing that allocates
// Use fixed-size arrays, stack allocation, and the heapless crate

// use heapless::Vec;  // Fixed-capacity Vec, stored on the stack
// let mut v: Vec<i32, 16> = Vec::new();  // Max 16 elements, no heap

use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}
}

This connects directly to embedded programming — when you're writing firmware for an STM32 or an ESP32, you have no OS, no heap, and every byte is precious. Rust's ownership model still works perfectly: stack allocation, static references, and compile-time guarantees.

💡 Fun Fact: The Linux kernel is starting to accept Rust code. Kernel code uses no heap allocator in the traditional sense — memory is managed through slab allocators and page allocators. Rust's #![no_std] + custom allocator support makes this possible.


🔧 Task: Trigger every borrow checker error

Write a Rust program (or multiple small programs) that intentionally triggers each of these compile errors. Read each error message carefully — they're some of the best error messages in any compiler.

// 1. Use after move
fn ex1() {
    let s = String::from("hello");
    let t = s;
    println!("{}", s);  // E0382: borrow of moved value
}

// 2. Multiple mutable references
fn ex2() {
    let mut s = String::from("hello");
    let r1 = &mut s;
    let r2 = &mut s;  // E0499: cannot borrow `s` as mutable more than once
    println!("{} {}", r1, r2);
}

// 3. Mutable + immutable reference
fn ex3() {
    let mut s = String::from("hello");
    let r1 = &s;
    let r2 = &mut s;  // E0502: cannot borrow as mutable because also borrowed as immutable
    println!("{} {}", r1, r2);
}

// 4. Dangling reference
fn ex4() -> &'static str {
    let s = String::from("hello");
    &s  // E0515: cannot return reference to local variable
}

// 5. Move out of borrowed content
fn ex5() {
    let v = vec![String::from("hello")];
    let s = v[0];  // E0507: cannot move out of index of `Vec<String>`
}

fn main() {
    // Uncomment each one at a time, try to compile, read the error.
    // ex1();
    // ex2();
    // ex3();
    // println!("{}", ex4());
    // ex5();
}

For each error:

  1. Read the error code (e.g., E0382)
  2. Run rustc --explain E0382 for a detailed explanation
  3. Fix the error using the compiler's suggestion
  4. Understand why the rule exists — what bug would it cause in C?

Data Structure Layout in Memory

Type this right now

// save as layout.c — compile: gcc -o layout layout.c
#include <stdio.h>
#include <stddef.h>

struct Bad {
    char a;     // 1 byte
    int b;      // 4 bytes
    char c;     // 1 byte
};

struct Good {
    int b;      // 4 bytes
    char a;     // 1 byte
    char c;     // 1 byte
};

int main() {
    printf("struct Bad:  sizeof = %zu\n", sizeof(struct Bad));
    printf("  offset of a: %zu\n", offsetof(struct Bad, a));
    printf("  offset of b: %zu\n", offsetof(struct Bad, b));
    printf("  offset of c: %zu\n", offsetof(struct Bad, c));

    printf("\nstruct Good: sizeof = %zu\n", sizeof(struct Good));
    printf("  offset of b: %zu\n", offsetof(struct Good, b));
    printf("  offset of a: %zu\n", offsetof(struct Good, a));
    printf("  offset of c: %zu\n", offsetof(struct Good, c));

    return 0;
}
$ gcc -o layout layout.c && ./layout
struct Bad:  sizeof = 12
  offset of a: 0
  offset of b: 4
  offset of c: 8

struct Good: sizeof = 8
  offset of b: 0
  offset of a: 4
  offset of c: 5

Same three fields. Different order. 4 bytes smaller. If you have a million of these structs, that's 4 MB wasted on invisible padding bytes. The compiler doesn't reorder C struct fields — it lays them out exactly as you declared them. It's on you.


Why alignment matters

Modern CPUs don't read arbitrary bytes from memory. They read in aligned chunks. A 4-byte int must start at an address divisible by 4. An 8-byte double must start at an address divisible by 8.

    Memory addresses:
    0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15
    │           │           │            │            │
    └── 4-byte ─┘           └── 4-byte  ─┘

    int at address 0: ✓ aligned (0 % 4 == 0)
    int at address 4: ✓ aligned (4 % 4 == 0)
    int at address 1: ✗ misaligned! (1 % 4 ≠ 0)

What happens on misaligned access?

  • x86-64: works, but slower (may need two cache line reads instead of one)
  • ARM (older): hardware exception — your program crashes
  • RISC-V: implementation-defined — may work, may trap

The C compiler adds padding bytes between fields to ensure every field is properly aligned. These padding bytes contain garbage and waste space.


struct Bad: the layout problem

struct Bad {
    char a;     // 1 byte, alignment 1
    int b;      // 4 bytes, alignment 4
    char c;     // 1 byte, alignment 1
};
    Byte:  0     1     2     3     4     5     6     7     8     9    10    11
          ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
          │  a  │ pad │ pad │ pad │  b  │  b  │  b  │  b  │  c  │ pad │ pad │ pad │
          │     │     │     │     │     │     │     │     │     │     │     │     │
          └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
          ▲                       ▲                       ▲
          │                       │                       │
          a at offset 0           b at offset 4           c at offset 8
          (align 1: OK)           (align 4: 4%4=0 ✓)     (align 1: OK)

    Total: 12 bytes. But actual data is only 6 bytes.
    Waste: 6 bytes of padding (50%!)

    Why padding after c? The struct's alignment is max(1,4,1) = 4.
    sizeof must be a multiple of 4 so arrays of structs stay aligned.
    8 + 1 = 9, round up to 12.

struct Good: reordered fields

struct Good {
    int b;      // 4 bytes, alignment 4
    char a;     // 1 byte, alignment 1
    char c;     // 1 byte, alignment 1
};
    Byte:  0     1     2     3     4     5     6     7
          ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
          │  b  │  b  │  b  │  b  │  a  │  c  │ pad │ pad │
          │     │     │     │     │     │     │     │     │
          └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
          ▲                       ▲     ▲
          │                       │     │
          b at offset 0           a     c
          (align 4: 0%4=0 ✓)

    Total: 8 bytes. Same data, 4 bytes smaller.
    The trick: put larger-aligned fields first.

The golden rule in C: sort fields from largest alignment to smallest. double first, then int/long, then short, then char. This minimizes internal padding.

🧠 What do you think happens?

struct Mystery {
    char a;
    double b;
    char c;
    int d;
};

What's sizeof(struct Mystery)? Work it out by hand before compiling. (Hint: double has alignment 8. The struct's alignment is also 8.)


Rust: the compiler reorders for you

Rust's default struct layout (repr(Rust)) allows the compiler to reorder fields for optimal packing:

struct Bad {
    a: u8,      // 1 byte
    b: u32,     // 4 bytes
    c: u8,      // 1 byte
}

fn main() {
    println!("Size of Bad: {}", std::mem::size_of::<Bad>());
    println!("Align of Bad: {}", std::mem::align_of::<Bad>());
}
$ rustc layout.rs && ./layout
Size of Bad: 8
Align of Bad: 4

Even though the fields are declared in the "bad" order, Rust produces an 8-byte struct. The compiler silently reordered b before a and c.

    C layout (struct Bad):            Rust layout (same fields):
    ┌───┬───────┬───┬───────┐        ┌───────────┬───┬───┬─────┐
    │ a │padding│ b │c+pad  │        │     b     │ a │ c │ pad │
    └───┴───────┴───┴───────┘        └───────────┴───┴───┴─────┘
         12 bytes                          8 bytes

    Same fields. Same alignment rules. Smaller struct.

#[repr(C)]: when you need C-compatible layout

Sometimes you need the fields in a specific order:

  • FFI (Foreign Function Interface) — calling C from Rust or vice versa
  • Hardware register mappings — the bytes must match the hardware's expectation
  • Network protocols — the bytes go on the wire in a specific order
  • Memory-mapped I/O — addresses are fixed
#[repr(C)]
struct CCompatible {
    a: u8,
    b: u32,
    c: u8,
}

fn main() {
    println!("repr(C) size: {}", std::mem::size_of::<CCompatible>());   // 12
    println!("repr(Rust) size: {}", std::mem::size_of::<Bad>());        // 8
}

#[repr(C)] tells the compiler: "Lay out fields in declaration order, with C-style padding rules. Do not reorder." Now the Rust struct has the same layout as the C struct, byte for byte.

💡 Fun Fact: The Linux kernel's structures are defined in C with precise layouts that match hardware expectations. Any Rust code in the kernel that interacts with these structures must use #[repr(C)] to guarantee layout compatibility. Getting this wrong means reading the wrong field at the wrong offset — silent data corruption.


Rust enum layout

Rust enums are tagged unions: a discriminant (tag) that identifies the variant, plus the data for the active variant.

enum Message {
    Quit,                       // No data
    Move { x: i32, y: i32 },   // 8 bytes of data
    Write(String),              // 24 bytes of data (ptr + len + cap)
}

fn main() {
    println!("Size of Message: {}", std::mem::size_of::<Message>());
    println!("Size of String: {}", std::mem::size_of::<String>());
}
Size of Message: 32
Size of String: 24
    Enum layout (conceptual):

    ┌──────────────┬──────────────────────────────────┐
    │ Discriminant │           Payload                 │
    │   (tag)      │  (large enough for biggest variant)│
    ├──────────────┼──────────────────────────────────┤
    │    0 (Quit)  │  (unused — 24 bytes of nothing)  │
    │    1 (Move)  │  x: i32, y: i32, (16B padding)  │
    │    2 (Write) │  String (ptr, len, cap) = 24B    │
    └──────────────┴──────────────────────────────────┘

    Total = align(discriminant) + size(largest variant)
          = 8 + 24 = 32 bytes

The size of the enum is the size of the discriminant plus the size of the largest variant. Every variant uses the same amount of space, even Quit which has no data. This is the cost of a tagged union.


Niche optimization: zero-cost Option

Here's where Rust gets clever. Option<&T> is the same size as &T:

fn main() {
    println!("Size of &i32:           {}", std::mem::size_of::<&i32>());
    println!("Size of Option<&i32>:   {}", std::mem::size_of::<Option<&i32>>());
    println!();
    println!("Size of Box<i32>:       {}", std::mem::size_of::<Box<i32>>());
    println!("Size of Option<Box<i32>>: {}", std::mem::size_of::<Option<Box<i32>>>());
}
Size of &i32:           8
Size of Option<&i32>:   8   ← SAME SIZE! No extra discriminant byte.

Size of Box<i32>:       8
Size of Option<Box<i32>>: 8   ← Also the same!

How? Niche optimization. A reference (&T) can never be null. So the compiler uses the null bit pattern (all zeros) to represent None. No extra tag needed.

    Option<&i32> layout:

    Some(&val):  ┌────────────────────────────────────┐
                 │  0x00007FFF12340000 (valid pointer) │
                 └────────────────────────────────────┘

    None:        ┌────────────────────────────────────┐
                 │  0x0000000000000000 (null = None)   │
                 └────────────────────────────────────┘

    Same 8 bytes. The "impossible" value (null) serves as the discriminant.

This works for any type that has an "impossible" bit pattern:

use std::num::NonZeroU32;

fn main() {
    println!("Size of u32:              {}", std::mem::size_of::<u32>());
    println!("Size of Option<NonZeroU32>: {}", std::mem::size_of::<Option<NonZeroU32>>());
    // Both are 4 bytes! Value 0 represents None.
}

In C, you'd represent an "optional pointer" as NULL — but nothing stops you from accidentally dereferencing it. In Rust, Option<&T> has the same representation as a C pointer (8 bytes, with null for "absent"), but the compiler forces you to check for None before accessing the value.


Examining layout: the tools

C

#include <stdio.h>
#include <stddef.h>
#include <stdalign.h>

struct Example {
    char a;
    double b;
    int c;
    char d;
};

int main() {
    printf("sizeof:  %zu\n", sizeof(struct Example));
    printf("alignof: %zu\n", alignof(struct Example));
    printf("offsetof a: %zu\n", offsetof(struct Example, a));
    printf("offsetof b: %zu\n", offsetof(struct Example, b));
    printf("offsetof c: %zu\n", offsetof(struct Example, c));
    printf("offsetof d: %zu\n", offsetof(struct Example, d));
    return 0;
}
sizeof:  24
alignof: 8
offsetof a: 0
offsetof b: 8
offsetof c: 16
offsetof d: 20
    Byte: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
         ┌──┬─────────────────────┬──────────────────────┬───────────┬──┬────────┐
         │a │      padding        │         b            │     c     │d │ padding│
         │  │  (7 bytes!)         │    (8 bytes)         │ (4 bytes) │  │(3 bytes│
         └──┴─────────────────────┴──────────────────────┴───────────┴──┴────────┘

Rust

use std::mem;

struct Example {
    a: u8,
    b: f64,
    c: u32,
    d: u8,
}

fn main() {
    println!("size_of:  {}", mem::size_of::<Example>());
    println!("align_of: {}", mem::align_of::<Example>());

    // To see field offsets, we need a trick — create an instance:
    let e = Example { a: 0, b: 0.0, c: 0, d: 0 };
    let base = &e as *const _ as usize;
    println!("offset of a: {}", &e.a as *const _ as usize - base);
    println!("offset of b: {}", &e.b as *const _ as usize - base);
    println!("offset of c: {}", &e.c as *const _ as usize - base);
    println!("offset of d: {}", &e.d as *const _ as usize - base);
}
size_of:  16      ← Smaller! Rust reordered fields.
align_of: 8
offset of b: 0    ← b (8 bytes, align 8) placed first
offset of c: 8    ← c (4 bytes, align 4) placed second
offset of a: 12   ← a and d packed together
offset of d: 13
    Rust layout (compiler reordered):
    Byte: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
         ┌──────────────────────┬───────────┬──┬──┬──────┐
         │         b            │     c     │a │d │ pad  │
         │    (8 bytes)         │ (4 bytes) │  │  │(2 B) │
         └──────────────────────┴───────────┴──┴──┴──────┘

    C layout: 24 bytes. Rust layout: 16 bytes. Same fields.

Packed structs: removing all padding

Sometimes you want no padding at all — typically for wire protocols or file formats:

// C: using __attribute__((packed))
struct __attribute__((packed)) Packed {
    char a;
    int b;
    char c;
};
// sizeof = 6. No padding. But misaligned access to b!
#![allow(unused)]
fn main() {
// Rust: using repr(packed)
#[repr(packed)]
struct Packed {
    a: u8,
    b: u32,
    c: u8,
}
// size_of = 6. No padding. Accessing b requires care.
}

Packed structs are dangerous: accessing b at an odd offset may cause a hardware exception on some architectures, or slow misaligned access on x86. Rust makes you use unsafe to take references to misaligned fields, or copy them to a local variable first.

🧠 What do you think happens?

You have a #[repr(packed)] struct in Rust and try to take &packed_struct.b where b is a u32 at offset 1. Does it compile? Does it crash? What does the compiler warn you about?


🔧 Task: Compare C and Rust layouts

Step 1: Create this struct in both C and Rust:

// C version
struct Record {
    char type_flag;      // 1 byte
    double value;        // 8 bytes
    short count;         // 2 bytes
    char active;         // 1 byte
    int id;              // 4 bytes
};
#![allow(unused)]
fn main() {
// Rust version
struct Record {
    type_flag: u8,
    value: f64,
    count: i16,
    active: u8,
    id: i32,
}
}

Step 2: Print sizeof / size_of and all field offsets in both languages.

Step 3: Draw the byte-level layout diagram for each, marking padding bytes.

Step 4: Reorder the C struct fields to minimize padding. Verify the new size matches (or approaches) Rust's automatically optimized layout.

Step 5: Add #[repr(C)] to the Rust struct and confirm the size matches your C struct.

Expected results:

    C (original order):   sizeof = 32
    C (reordered):        sizeof = 24 (or less)
    Rust (default):       size_of = 24
    Rust (repr(C)):       size_of = 32 (matches C original)

The compiler is a better struct packer than most humans — but only if you let it (Rust default) or think about it (C manual ordering).

Threads and Shared Memory

Type This First

Save this as race.c and compile with gcc -pthread -o race race.c:

#include <stdio.h>
#include <pthread.h>

int counter = 0;  // shared global

void *increment(void *arg) {
    for (int i = 0; i < 1000000; i++) {
        counter++;  // looks innocent...
    }
    return NULL;
}

int main(void) {
    pthread_t t1, t2;
    pthread_create(&t1, NULL, increment, NULL);
    pthread_create(&t2, NULL, increment, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    printf("Expected: 2000000\n");
    printf("Got:      %d\n", counter);
    return 0;
}

Run it five times. You will get a different wrong answer each time.


Threads vs Processes

A process is an isolated world. Its own address space, its own page tables, its own file descriptors. When you fork(), the child gets a copy of everything.

A thread is different. Threads live inside a process. They share the same address space. Same heap. Same globals. Same code. Same file descriptors.

What each thread gets of its own: a stack and a set of registers. That's it.

+--------------------------------------------------+
|                  Process (PID 42)                 |
|                                                   |
|   +----------+  +----------+  +----------+       |
|   | Thread 1 |  | Thread 2 |  | Thread 3 |       |
|   |  Stack   |  |  Stack   |  |  Stack   |       |
|   |  regs    |  |  regs    |  |  regs    |       |
|   +----+-----+  +----+-----+  +----+-----+       |
|        |              |              |             |
|        v              v              v             |
|   +------------------------------------------------+
|   |         Shared Address Space                   |
|   |                                                |
|   |   .text (code)    -- same code, all threads    |
|   |   .data (globals) -- same globals, all threads |
|   |   heap            -- same heap, all threads    |
|   |   mmap region     -- same mappings             |
|   +------------------------------------------------+
+--------------------------------------------------+

This sharing is what makes threads fast. No copying memory. No new page tables. Context-switching between threads in the same process is cheap.

It is also what makes threads dangerous.


The Data Race Problem

Look at counter++ in the program above. In C, that is one statement. But in machine code, it is three operations:

Thread A                    Thread B
--------                    --------
1. READ counter  (gets 5)
                            1. READ counter  (gets 5)
2. ADD 1         (now 6)
                            2. ADD 1         (now 6)
3. WRITE counter (writes 6)
                            3. WRITE counter (writes 6)

Result: counter = 6, not 7. One increment was LOST.

This is called a lost update. Two threads read the same value, compute the same result, and one write overwrites the other.

What do you think happens?

If you run the race program with 10 threads instead of 2, does the error get bigger or smaller? Why?

The counter++ operation is a read-modify-write. It is NOT atomic. The CPU does not execute it as one indivisible step. The OS can switch threads between any of those three micro-steps.


C Solution: Mutexes (Discipline-Based)

In C, you protect shared data with a mutex (mutual exclusion lock):

#include <stdio.h>
#include <pthread.h>

int counter = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

void *increment(void *arg) {
    for (int i = 0; i < 1000000; i++) {
        pthread_mutex_lock(&lock);
        counter++;
        pthread_mutex_unlock(&lock);
    }
    return NULL;
}

int main(void) {
    pthread_t t1, t2;
    pthread_create(&t1, NULL, increment, NULL);
    pthread_create(&t2, NULL, increment, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    printf("Expected: 2000000\n");
    printf("Got:      %d\n", counter);
    return 0;
}

Now the count is always exactly 2,000,000.

But notice the problem: the mutex and the counter are separate things. Nothing connects them. The compiler does not know that counter requires lock. You could forget to lock. You could lock the wrong mutex. You could access counter from a new function and never realize it needs protection.

The relationship between the lock and the data it protects exists only in the programmer's head.


Rust Solution: Mutex<T> Wraps the Data

In Rust, the mutex contains the data. You cannot touch the data without locking:

use std::sync::{Arc, Mutex};
use std::thread;

fn main() {
    let counter = Arc::new(Mutex::new(0));
    let mut handles = vec![];

    for _ in 0..2 {
        let counter = Arc::clone(&counter);
        let handle = thread::spawn(move || {
            for _ in 0..1_000_000 {
                let mut num = counter.lock().unwrap();
                *num += 1;
                // lock released here when `num` goes out of scope
            }
        });
        handles.push(handle);
    }

    for handle in handles {
        handle.join().unwrap();
    }

    println!("Expected: 2000000");
    println!("Got:      {}", *counter.lock().unwrap());
}

Mutex::new(0) wraps the integer. The only way to read or write the integer is to call .lock(), which returns a guard. When the guard is dropped, the lock is released.

You literally cannot access the data without locking. The type system enforces it.


Send and Sync: The Compiler Checks Your Threads

Rust has two marker traits that the compiler checks automatically:

  • Send: This type is safe to move to another thread.
  • Sync: This type is safe to share references between threads.

You never implement these by hand (usually). The compiler derives them from the types you use.

+-------------------+------+------+
| Type              | Send | Sync |
+-------------------+------+------+
| i32, String, Vec  |  Y   |  Y   |
| Mutex<T>          |  Y   |  Y   |  <-- designed for sharing
| Arc<T>            |  Y   |  Y   |  <-- atomic ref count
| Rc<T>             |  N   |  N   |  <-- NOT thread-safe
| Cell<T>           |  Y   |  N   |  <-- interior mutability, not Sync
| *mut T            |  N   |  N   |  <-- raw pointers
+-------------------+------+------+

Rc<T> uses a non-atomic reference count. If two threads increment it simultaneously, you get the same lost-update problem we saw with counter. So Rust marks Rc<T> as !Send. The compiler will refuse to let you move it to another thread.

Arc<T> uses atomic reference counting. It is safe to share. The compiler allows it.


The Compiler Catches Data Races

Try this in Rust — sharing an Rc across threads:

use std::rc::Rc;
use std::thread;

fn main() {
    let data = Rc::new(vec![1, 2, 3]);
    let data_clone = Rc::clone(&data);

    thread::spawn(move || {
        println!("{:?}", data_clone);
    });
}

This does NOT compile:

error[E0277]: `Rc<Vec<i32>>` cannot be sent between threads safely
   --> src/main.rs:8:5
    |
8   |     thread::spawn(move || {
    |     ^^^^^^^^^^^^^ `Rc<Vec<i32>>` cannot be sent between threads safely
    |
    = help: the trait `Send` is not implemented for `Rc<Vec<i32>>`

In C, the equivalent code compiles without any warning. It crashes at runtime. Maybe. Or worse — it corrupts memory silently and you only find out in production.

Fun Fact

The Rust compiler's thread-safety checks have no runtime cost. Send and Sync are "zero-sized" marker traits — they exist only at compile time. Your binary contains no trace of them.


Atomic Types: Lock-Free Concurrency

Sometimes a full mutex is overkill. For simple counters and flags, CPUs provide atomic instructions that complete as one indivisible step.

use std::sync::atomic::{AtomicU32, Ordering};
use std::sync::Arc;
use std::thread;

fn main() {
    let counter = Arc::new(AtomicU32::new(0));
    let mut handles = vec![];

    for _ in 0..2 {
        let counter = Arc::clone(&counter);
        let handle = thread::spawn(move || {
            for _ in 0..1_000_000 {
                counter.fetch_add(1, Ordering::Relaxed);
            }
        });
        handles.push(handle);
    }

    for handle in handles {
        handle.join().unwrap();
    }

    println!("Got: {}", counter.load(Ordering::Relaxed));
}

fetch_add compiles to a single lock xadd instruction on x86. No mutex, no waiting, no deadlock possible.

C11 has atomics too (<stdatomic.h>), but again — nothing stops you from mixing atomic and non-atomic access to the same variable.


The Spectrum of Concurrency Safety

    C                                       Rust
    |                                         |
    v                                         v
  No guardrails          ------>          Compiler-enforced
  Data races compile                      Data races don't compile
  Mutex is separate                       Mutex wraps data
  Human discipline                        Type system discipline
  Bugs found at runtime                   Bugs found at compile time
  (or never found)                        (before your code ships)

This does not mean Rust concurrency is easy. Deadlocks are still possible. Logic bugs are still possible. But an entire class of bugs — data races — is eliminated at compile time.


Task

  1. Compile and run the race.c program at the top of this chapter. Run it 10 times. Record the different results.
  2. Try the Rc example in Rust. Read the compiler error carefully.
  3. Replace Rc with Arc and vec![1,2,3] with Mutex::new(0). Make the threaded counter work.
  4. Modify the C version to use 10 threads. How wrong does the count get?
  5. Try the atomic Rust version. Verify the count is always exactly 2,000,000.
  6. Bonus: Use time to compare the mutex version vs the atomic version. Which is faster? Why?

Where C Shines, Where Rust Shines

Type This First

Save this as add.c:

// add.c
int add(int a, int b) {
    return a + b;
}

And this as main.rs:

extern "C" {
    fn add(a: i32, b: i32) -> i32;
}

fn main() {
    unsafe {
        println!("C says: {}", add(3, 4));
    }
}

Compile and link them together:

$ gcc -c -o add.o add.c
$ rustc main.rs -l static -L . --edition 2021
  ... wait, we need an archive:
$ ar rcs libadd.a add.o
$ rustc main.rs -l static=add -L . --edition 2021
$ ./main
C says: 7

C and Rust, cooperating. Same process. Same address space. Same binary.


This Is Not a Flame War

Both C and Rust compile to native machine code. Both give you direct memory access. Both have zero runtime overhead — no garbage collector, no virtual machine.

The question is not "which is better." The question is: which tradeoffs fit your project?


The Honest Comparison

+--------------------+---------------------------+----------------------------------+
| Dimension          | C                         | Rust                             |
+--------------------+---------------------------+----------------------------------+
| Kernel / OS dev    | THE standard              | Growing (Linux modules, Redox)   |
|                    | (Linux, Windows, macOS)   |                                  |
+--------------------+---------------------------+----------------------------------+
| ABI stability      | C ABI is THE universal    | No stable ABI; FFI goes          |
|                    | interface between langs   | through the C ABI                |
+--------------------+---------------------------+----------------------------------+
| Legacy codebases   | 50+ years of code         | Excellent C interop              |
|                    |                           | (bindgen, extern "C")            |
+--------------------+---------------------------+----------------------------------+
| Compile speed      | Fast                      | Slower (borrow checker,          |
|                    |                           | monomorphization, LLVM)          |
+--------------------+---------------------------+----------------------------------+
| Runtime overhead   | Zero                      | Zero (same as C, no GC)          |
+--------------------+---------------------------+----------------------------------+
| Memory safety      | Programmer discipline     | Compiler-enforced                |
+--------------------+---------------------------+----------------------------------+
| Concurrency safety | Discipline + sanitizers   | Compiler-enforced (Send/Sync)    |
+--------------------+---------------------------+----------------------------------+
| Tooling            | make, cmake, varied       | cargo (build+test+doc+publish    |
|                    | editors, gdb              | unified), clippy, rustfmt        |
+--------------------+---------------------------+----------------------------------+
| Embedded (common)  | Everywhere, mature        | Great and growing (Embassy,      |
|                    |                           | probe-rs, RTIC)                  |
+--------------------+---------------------------+----------------------------------+
| Embedded (exotic   | Often the only option     | Needs LLVM target support;       |
|   / 8-bit)         |                           | no AVR-8 stability yet           |
+--------------------+---------------------------+----------------------------------+
| Error handling     | errno, return codes (-1)  | Result<T,E>, Option<T>,          |
|                    | no enforcement            | ? operator, exhaustive matching  |
+--------------------+---------------------------+----------------------------------+
| Package management | manual / conan / vcpkg    | cargo + crates.io built-in       |
+--------------------+---------------------------+----------------------------------+
| Learning curve     | Small language, large     | Steeper (ownership, lifetimes),  |
|                    | footgun surface           | but compiler teaches you         |
+--------------------+---------------------------+----------------------------------+

Where C Shines

Operating system kernels. Linux is 30+ million lines of C. Windows kernel is C. macOS kernel is C. These are not being rewritten. When you write a Linux driver, you write C (or now, optionally Rust for new modules).

ABI stability. When Python calls a shared library, it uses the C ABI. When Java uses JNI, it uses the C ABI. When Rust calls foreign code, it uses the C ABI. C is the lingua franca of systems interfaces.

Existing codebases. SQLite: ~150,000 lines of carefully audited C. OpenSSL, zlib, libpng, curl — the infrastructure of the internet is C. You don't rewrite what works.

Exotic hardware. Writing firmware for an 8-bit PIC microcontroller? A DSP with a custom architecture? C has a compiler for it. Rust needs LLVM to support the target.

Team expertise. If your team has 20 years of C experience and deep knowledge of its pitfalls, that expertise is real and valuable.


Where Rust Shines

When correctness matters. Safety-critical systems. Financial software. Aerospace. Medical devices. The cost of a bug is not just a crash — it is lives or millions of dollars.

Concurrent code. The Send/Sync system catches data races at compile time. Chapter 23 showed you this. In C, concurrent bugs hide for years.

New projects. No legacy to maintain? No existing C codebase to integrate with? Rust gives you the same performance with a dramatically smaller bug surface.

When bugs are expensive. Google reported that ~70% of Chromium security bugs are memory safety issues. Microsoft reported the same for Windows. Each CVE costs investigation, patching, disclosure, and reputation. Rust eliminates the entire class.

Long-running services. A web server that runs for months. A database. Memory leaks that build up over days? Use-after-free that triggers once per million requests? Rust catches these before you deploy.


The Sharp Knife Metaphor

C gives you a sharp knife with no guard. Rust gives you the same sharp knife with a guard you can remove (unsafe) when needed. The blade is equally sharp.

Both produce the same machine code. Both give you the same control. The difference is in what the compiler checks before you run.

   C programmer's workflow:
   Write code -> Compile -> Run -> Test -> Find bug in prod -> Debug

   Rust programmer's workflow:
   Write code -> Compile (fight borrow checker) -> It compiles!
   -> Run -> Fewer bugs in prod

The borrow checker fight is real. It can be frustrating. But every error the borrow checker throws is a bug you did not ship.


FFI: They Interoperate, Not Compete

Calling C from Rust

// Declare the C function signature
extern "C" {
    fn strlen(s: *const u8) -> usize;
}

fn main() {
    let s = b"hello\0";
    let len = unsafe { strlen(s.as_ptr()) };
    println!("Length: {}", len);  // 5
}

The unsafe block is required because Rust cannot verify the C function's memory safety. You are telling the compiler: "I have checked this myself."

Calling Rust from C

#![allow(unused)]
fn main() {
// lib.rs
#[no_mangle]
pub extern "C" fn rust_add(a: i32, b: i32) -> i32 {
    a + b
}
}
// main.c
#include <stdio.h>

extern int rust_add(int a, int b);

int main(void) {
    printf("Rust says: %d\n", rust_add(10, 20));
    return 0;
}
$ rustc --crate-type=staticlib lib.rs -o librust_add.a
$ gcc main.c -L. -lrust_add -lpthread -ldl -o main
$ ./main
Rust says: 30

Same ABI. Same calling convention. Same registers. The CPU does not know which language produced the instructions.


Same Function, Same Assembly

Here is add in C and Rust:

int add(int a, int b) { return a + b; }
#![allow(unused)]
fn main() {
pub fn add(a: i32, b: i32) -> i32 { a + b }
}

Compile both with optimizations and look at the assembly:

; Both produce exactly this (x86-64, -O2):
add:
    lea  eax, [rdi+rsi]
    ret

Same instruction. Same registers. Same binary. The language is a compile-time concept. At runtime, there is only machine code.

Fun Fact

You can verify this yourself on godbolt.org. Type the C version in one pane and the Rust version in another. With optimizations enabled, the assembly is often instruction-for-instruction identical.


When to Choose What

  Choose C when:                      Choose Rust when:
  +-------------------------------+   +-------------------------------+
  | Extending a C codebase        |   | Starting a new project        |
  | Targeting exotic hardware     |   | Correctness is critical       |
  | Maximum ABI compatibility     |   | Heavy concurrency             |
  | OS kernel work (tradition)    |   | Bugs are very expensive       |
  | Team deeply knows C           |   | Long-running services         |
  | Interfacing with C-only libs  |   | Want unified tooling (cargo)  |
  +-------------------------------+   +-------------------------------+

  Choose BOTH when:
  +-------------------------------+
  | Wrapping C libs in safe Rust  |
  | Adding Rust to a C project    |
  | Performance-critical + safe   |
  +-------------------------------+

What do you think happens?

If you write a function in C and the same function in Rust, and both compile to the same assembly — what is the "cost" of Rust's safety? Where does the safety checking actually happen?

The answer: the cost is entirely at compile time. Zero runtime cost. The safety checks are erased before the binary is produced. This is Rust's core promise.


Task

  1. Write a function int square(int x) in C and pub fn square(x: i32) -> i32 in Rust.
  2. Compile both with -O2 / --release and compare the assembly (use objdump -d or godbolt.org).
  3. Use extern "C" to call your C square from Rust. Print the result.
  4. Use #[no_mangle] pub extern "C" to call your Rust square from C. Print the result.
  5. Bonus: Write a C function with a deliberate buffer overflow. Wrap it in Rust with a safe API that checks bounds before calling the C function. This is the pattern real-world Rust/C interop uses.

The Bug Hall of Fame

Type This First

This is the essence of Heartbleed in 15 lines. Save as heartbleed_demo.c:

#include <stdio.h>
#include <string.h>

char secret[] = "MY_SECRET_KEY_12345";
char buffer[64];

void heartbeat(int claimed_length) {
    // Copies 'claimed_length' bytes — but doesn't check
    // if the actual data is that long!
    char response[64];
    memcpy(response, buffer, claimed_length);
    printf("Response (%d bytes): ", claimed_length);
    for (int i = 0; i < claimed_length; i++)
        printf("%c", response[i] >= 32 ? response[i] : '.');
    printf("\n");
}

int main(void) {
    strcpy(buffer, "hi");            // actual payload: 2 bytes
    heartbeat(50);                   // but we claim 50 bytes!
    return 0;
}

Compile and run. You will see memory contents beyond what was sent — possibly including MY_SECRET_KEY_12345. This is exactly how Heartbleed worked.


Why This Chapter Exists

These are not theoretical bugs. These are real vulnerabilities that affected billions of devices, cost billions of dollars, and compromised millions of passwords.

Every single one is a memory safety bug. Every single one would not compile in safe Rust.

This is not about blaming C. C was designed in 1972 for a world where programs were small and programmers were few. The question for today is: should the compiler catch these mistakes, or should we rely on human discipline at scale?


Heartbleed (CVE-2014-0160)

What happened. OpenSSL's TLS heartbeat extension had a buffer over-read. A client sends a heartbeat message with a payload and a claimed length. The server echoes back claimed_length bytes — without checking that the actual payload is that long.

The root cause in C:

// Simplified from the actual OpenSSL code:
memcpy(response, payload, claimed_length);
//                        ^^^^^^^^^^^^^^
// Never verified: claimed_length <= actual_payload_size

Impact. Any server running affected OpenSSL would leak up to 64KB of process memory per request. That memory contained passwords, private keys, session tokens. An estimated 17% of the internet's secure servers were vulnerable.

How Rust prevents it. In Rust, slices carry their length. &[u8] knows how many bytes it contains. You cannot memcpy past the end — the runtime panics (bounds check) or the compiler prevents it entirely.

    Client sends:          Server does:
    payload = "hi"         memcpy(buf, payload, 500)
    claimed_len = 500      ^^^^^^^^^^^^^^^^^^^^^^^^
                           Copies 500 bytes starting at "hi"
                           Next 498 bytes: whatever's in memory
                           Private keys, passwords, sessions...

sudo Baron Samedit (CVE-2021-3156)

What happened. A heap-based buffer overflow in sudo — the program that grants root access. Present for almost 10 years before discovery.

The root cause. When parsing command-line arguments in sudoedit mode, a backslash at the end of a string caused a write past the end of a heap buffer.

// Simplified pattern:
while (*from) {
    if (from[0] == '\\' && from[1] != '\0')
        from++;  // skip backslash
    *to++ = *from++;  // write to heap buffer
}
// If the string ends with '\', from[1] reads past the null terminator
// and the loop keeps writing past the buffer boundary

Impact. Any local user could gain root access on nearly every Unix-like system. CVSSv3 score: 7.8.

How Rust prevents it. Ownership and bounds checking. A Vec<u8> in Rust will panic on out-of-bounds write, or better — the iterator-based approach would never produce an out-of-bounds index.


WannaCry / EternalBlue (CVE-2017-0144)

What happened. The EternalBlue exploit targeted a buffer overflow in Windows' SMBv1 implementation. The WannaCry ransomware used it to spread across networks.

The root cause. A buffer overflow in the Windows SMB driver (srv.sys). A specially crafted SMB transaction caused a pool buffer overflow in kernel memory.

Impact. Over 200,000 computers in 150 countries. Hospitals shut down. Factories stopped. Estimated damage: $4 billion or more.

How Rust prevents it. Safe Rust does not allow buffer overflows. Period. Slice access is bounds-checked. Vec access is bounds-checked. You would need unsafe to bypass this, and unsafe blocks are visible and auditable.


iOS Jailbreaks: Use-After-Free in WebKit

What happened. Many iOS jailbreaks (and zero-day exploits) have been based on use-after-free bugs in Safari's WebKit rendering engine.

The pattern:

Widget *w = create_widget();
destroy_widget(w);          // frees the memory
// ... more code ...
w->render();                // USE AFTER FREE
// The memory at w might now contain attacker-controlled data

Impact. Full device compromise. In the hands of nation-states, these exploits were used for surveillance. The Pegasus spyware used WebKit vulnerabilities.

How Rust prevents it. Ownership. When you free (drop) a value, the compiler will not let you use it again. The borrow checker ensures that no references outlive the data they point to.

#![allow(unused)]
fn main() {
let w = create_widget();
drop(w);
// w.render();  // COMPILE ERROR: use of moved value `w`
}

The Chromium Data

Google published the numbers for Chromium (Chrome's open-source base):

+---------------------------------------------+
| Chromium Security Bugs by Category           |
|                                              |
|  Memory safety:     ~70%  <-- THIS           |
|  Logic errors:      ~15%                     |
|  Other:             ~15%                     |
+---------------------------------------------+

Seventy percent. Not 7%. Seven-zero. The dominant source of security vulnerabilities in one of the most-used programs on Earth is memory safety.

Google's response: new Chromium components are increasingly written in Rust and memory-safe C++.


The Android Data

Android's security team reported a similar pattern:

Year    Memory Safety Bugs (%)   Rust Code (%)
2019           76%                   0%
2020           72%                   ~1%
2021           65%                   ~5%
2022           55%                  ~10%
2023           40%                  ~15%

As the percentage of new code written in Rust increased, the percentage of memory safety bugs decreased — even though the total codebase grew. The existing C/C++ code was not rewritten; new code was simply written in Rust.

Fun Fact

As of 2024, there have been ZERO memory safety vulnerabilities discovered in Android's Rust code. Not "few." Zero.


Linux Kernel: Rust for New Drivers

In 2022, Rust support was merged into the Linux kernel. The philosophy:

  • Existing C code stays as C. It works. It is well-tested.
  • New drivers and modules can be written in Rust.
  • The goal is preventing new bugs, not rewriting old code.

Linus Torvalds approved this not because C is bad, but because humans make mistakes, and the kernel cannot afford those mistakes.


The Bug Pattern Table

+---------------------+------------------+-----------------------------+
| Bug Type            | Frequency in     | Rust Prevention             |
|                     | C Codebases      |                             |
+---------------------+------------------+-----------------------------+
| Buffer overflow     | Very common      | Bounds checking on all      |
|   (read or write)   |                  | slice/array access          |
+---------------------+------------------+-----------------------------+
| Use-after-free      | Common           | Ownership: compiler tracks  |
|                     |                  | when values are dropped     |
+---------------------+------------------+-----------------------------+
| Double free         | Common           | Ownership: only one owner,  |
|                     |                  | dropped exactly once        |
+---------------------+------------------+-----------------------------+
| Null pointer deref  | Very common      | No null: Option<T> forces   |
|                     |                  | explicit handling            |
+---------------------+------------------+-----------------------------+
| Data race           | Common in        | Send/Sync: compiler refuses |
|                     | concurrent code  | unsafe sharing              |
+---------------------+------------------+-----------------------------+
| Uninitialized read  | Common           | All variables must be       |
|                     |                  | initialized before use      |
+---------------------+------------------+-----------------------------+
| Format string       | Occasional       | No format strings; macros   |
|                     |                  | are type-checked             |
+---------------------+------------------+-----------------------------+
| Integer overflow    | Common           | Panics in debug, wrapping   |
|                     |                  | explicit in release          |
+---------------------+------------------+-----------------------------+

Not Blaming C

C is one of the most important languages ever created. It powers operating systems, databases, embedded systems, and the infrastructure of the internet. Billions of devices run C code.

The bugs in this chapter are not C's fault. They are human mistakes. The question is not whether humans make mistakes — they do, always, inevitably — but whether the toolchain should catch those mistakes before they reach production.

What do you think happens?

If a large company rewrites all their C code in Rust, are they now bug-free? What kinds of bugs does Rust NOT prevent? (Hint: think about logic errors, deadlocks, and incorrect algorithms.)

Rust prevents memory safety bugs. It does not prevent wrong business logic, algorithmic errors, deadlocks, resource leaks from forgetting to close files, or plain old bad design.

But eliminating 70% of security vulnerabilities is a powerful argument.


Timeline of Awareness

2014  Heartbleed shocks the world
2017  WannaCry causes $4B+ damage
2019  Microsoft: 70% of CVEs are memory safety
2020  Google: 70% of Chromium bugs are memory safety
2021  Baron Samedit: sudo exploitable for 10 years
2022  Rust merged into Linux kernel
2023  Android: zero memory-safety bugs in Rust code
2024  White House recommends memory-safe languages

The industry is moving. Not because C is bad, but because the stakes are too high for human discipline alone.


Task

  1. Look up CVE-2014-0160 (Heartbleed) on any CVE database. Read the description.
  2. Find the actual OpenSSL patch that fixed it. The key change is a bounds check — identify the exact line.
  3. Write a 10-line Rust equivalent of the heartbeat function. Try to make it read out of bounds. Observe the panic.
  4. Look up CVE-2021-3156. Can you identify the buffer overflow pattern?
  5. Bonus: Search for "memory safety vulnerabilities" + your favorite open-source project. How many of the CVEs are buffer overflows or use-after-free?

The Toolbox

Type This First

Run this on any program from any previous chapter (or just use /bin/ls):

$ readelf -h /bin/ls
$ nm /bin/ls 2>/dev/null || echo "stripped"
$ file /bin/ls
$ size /bin/ls
$ strace -c ls /tmp 2>&1 | tail -20

Five tools, five different views of the same binary.


/proc: The Kernel Tells You Everything

Every running process has a directory under /proc/[pid]/. Not real files — the kernel generates them on demand.

/proc/[pid]/maps      Memory layout (virtual address ranges)
/proc/[pid]/smaps     Detailed per-mapping info (RSS, shared, private)
/proc/[pid]/status    Process summary (state, memory, threads)
/proc/[pid]/exe       Symlink to the actual executable
/proc/[pid]/fd/       Open file descriptors
$ sleep 1000 &
$ cat /proc/$!/maps
555555554000-555555556000 r--p 00000000 08:01 131074  /usr/bin/sleep
555555556000-555555558000 r-xp 00002000 08:01 131074  /usr/bin/sleep
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0       [stack]

Fun Fact

pmap is basically a pretty-printer for /proc/[pid]/maps.


Process Inspection: strace, ltrace, pmap

strace traces system calls — every interaction between your program and the kernel:

$ strace -e trace=write echo "hello"
write(1, "hello\n", 6)                 = 6

ltrace traces library calls (malloc, free, printf):

$ ltrace -e malloc+free ls /tmp
malloc(132)  = 0x55a1234
free(0x55a1234)

pmap shows the memory map with sizes and permissions: pmap -x <pid>.


Binary Analysis Tools

readelf -h a.out        ELF header (type, arch, entry point)
readelf -S a.out        Section headers (.text, .data, .bss...)
readelf -l a.out        Program headers (segments)
readelf -s a.out        Symbol table
readelf -d a.out        Dynamic section (.so dependencies)

objdump -d a.out        Disassembly
objdump -d -M intel a.out  Intel syntax (more readable)

nm a.out                Symbol list (T=text, D=data, B=bss, U=undefined)
size a.out              Section sizes (.text, .data, .bss)
strings a.out           Embedded string literals
file a.out              File type and architecture

Debugging: GDB Essentials

$ gcc -g -o prog prog.c && gdb ./prog
break main             Breakpoint at function
break prog.c:42        Breakpoint at line
run / run arg1         Start execution
next (n)               Step over
step (s)               Step into
continue (c)           Continue to next breakpoint
print x / print/x ptr  Print variable (decimal / hex)
bt                     Backtrace (call stack)
info registers         All register values
info proc mappings     Memory map
x/16xb 0x7fff...      Examine 16 bytes in hex
x/4xg $rsp            4 quad-words at stack pointer
watch counter          Break when variable changes

What do you think happens?

If you set a watchpoint with watch counter and two threads modify it, will GDB catch both? (Hint: hardware watchpoints are per-CPU.)


Memory Debugging

Valgrind (10-50x slowdown, very thorough):

$ valgrind --leak-check=full ./prog
==12345== Invalid read of size 4
==12345==    at 0x1091A2: main (prog.c:10)

Catches: leaks, use-after-free, buffer over-reads, uninitialized reads.

AddressSanitizer (2-3x slowdown, compile-time instrumentation):

$ gcc -g -fsanitize=address -o prog prog.c && ./prog
==12345==ERROR: AddressSanitizer: heap-buffer-overflow

Catches: buffer overflows, use-after-free, double-free.

UBSan (minimal overhead):

$ gcc -g -fsanitize=undefined -o prog prog.c && ./prog
prog.c:5: runtime error: signed integer overflow

Catches: signed overflow, null deref, misaligned access.


Rust-Specific Tools

cargo-geiger counts unsafe blocks in your code and dependencies:

$ cargo geiger
Functions  Expressions  Impls  Traits  Methods
2/5        14/60        0/0    0/0     1/3

Miri interprets your code and detects UB, even in unsafe:

$ cargo +nightly miri run
error: Undefined Behavior: dereferencing null pointer

cargo-bloat shows what makes your binary big:

$ cargo bloat --release
 5.3%  12.1%  3.2KiB  std::io::Write::write_fmt

godbolt.org — type C or Rust, see assembly instantly. Color-coded source-to-asm mapping. The single best tool for understanding compiler output.


Quick Reference: Problem to Tool

+-----------------------------------+---------------------------+
| Problem                           | Tool                      |
+-----------------------------------+---------------------------+
| "What's in this binary?"          | file, readelf -h          |
| "What sections does it have?"     | readelf -S, size          |
| "What symbols are exported?"      | nm, readelf -s            |
| "What does the assembly look like"| objdump -d, godbolt.org   |
| "What strings are embedded?"      | strings                   |
| "What syscalls does it make?"     | strace                    |
| "What libraries does it call?"    | ltrace, ldd               |
| "Where is its memory?"            | /proc/pid/maps, pmap      |
| "Why does it crash?"              | gdb, bt, info registers   |
| "Does it leak memory?"            | valgrind --leak-check     |
| "Does it overflow buffers?"       | -fsanitize=address        |
| "Does it have undefined behavior?"| -fsanitize=undefined, miri|
| "How much unsafe in my Rust?"     | cargo-geiger              |
| "Why is my Rust binary big?"      | cargo-bloat               |
+-----------------------------------+---------------------------+

Task

  1. Pick any program you compiled in a previous chapter.
  2. Run it through at least THREE tools from this chapter.
  3. For each tool, write down one thing you learned that you didn't know before.
  4. Bonus: Compile a buggy C program from Chapter 25 with -fsanitize=address. Does it catch the bug?
  5. Bonus: Run strace on ls, then on ls | cat. Does the write count change? Why?

Experiments

Ten guided labs. Each has C and Rust versions, clear steps, expected output. Do them. Reading about memory is not the same as seeing it.


Experiment 1: Print the Memory Layout

Verify the address space from Chapter 6.

// layout.c
#include <stdio.h>
#include <stdlib.h>
int global_init = 42;
int global_uninit;
int main(void) {
    int stack_var = 99;
    int *heap_var = malloc(sizeof(int));
    printf("Code  (main):     %p\n", (void *)main);
    printf("Data  (init):     %p\n", (void *)&global_init);
    printf("BSS   (uninit):   %p\n", (void *)&global_uninit);
    printf("Heap  (malloc):   %p\n", (void *)heap_var);
    printf("Stack (local):    %p\n", (void *)&stack_var);
    free(heap_var);
    system("cat /proc/self/maps");
    return 0;
}
static GLOBAL: i32 = 42;
fn main() {
    let stack_var: i32 = 99;
    let heap_var = Box::new(0i32);
    println!("Code:  {:p}", main as *const ());
    println!("Data:  {:p}", &GLOBAL);
    println!("Heap:  {:p}", &*heap_var);
    println!("Stack: {:p}", &stack_var);
}

Verify: Code < Data < BSS < Heap < ... gap ... < Stack.


Experiment 2: Stack Buffer Overflow

// overflow.c — compile: gcc -fno-stack-protector -g -o overflow overflow.c
#include <stdio.h>
#include <string.h>
void vulnerable(void) {
    char buffer[8];
    memset(buffer, 'A', 32);  // 32 bytes into 8-byte buffer
    printf("After overflow\n");
}
int main(void) { vulnerable(); return 0; }
$ ./overflow
Segmentation fault
$ gdb ./overflow -ex run -ex bt
#0  0x4141414141414141 in ?? ()    <-- return addr overwritten

Rust equivalent panics cleanly at the boundary:

fn main() {
    let mut buf = [0u8; 8];
    for i in 0..32 { buf[i] = b'A'; }  // panics at i=8
}

Experiment 3: Fork and Copy-on-Write

// cow.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
int main(void) {
    int *data = malloc(4096);
    *data = 42;
    printf("Before fork: %p = %d\n", (void*)data, *data);
    pid_t pid = fork();
    if (pid == 0) {
        printf("Child before write: %p = %d\n", (void*)data, *data);
        *data = 99;  // triggers copy-on-write
        printf("Child after write:  %p = %d\n", (void*)data, *data);
        free(data); _exit(0);
    }
    wait(NULL);
    printf("Parent after child:  %p = %d\n", (void*)data, *data);
    free(data);
}

Same virtual address in both, different values. Different physical pages after CoW.

What do you think happens?

Both print the same pointer. How can the values differ? (Different page tables, different physical frames.)


Experiment 4: mmap a File

// mmap_file.c
#include <stdio.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <string.h>
int main(void) {
    int fd = open("test.txt", O_RDWR | O_CREAT | O_TRUNC, 0644);
    write(fd, "Hello, mmap!\n", 13);
    char *m = mmap(NULL, 13, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    close(fd);
    printf("Via mmap: %s", m);
    memcpy(m, "ZZZZZ", 5);
    msync(m, 13, MS_SYNC);
    munmap(m, 13);
    system("cat test.txt");  // prints "ZZZZZ mmap!\n"
}

The file and the memory are the same thing. Writing to the pointer writes to disk.


Experiment 5: A 50-Line Bump Allocator

// bump.c — simplest possible malloc
#include <stdio.h>
#define HEAP_SIZE 1024
static char heap[HEAP_SIZE];
static size_t offset = 0;

void *bump_alloc(size_t size) {
    size_t aligned = (size + 7) & ~7;
    if (offset + aligned > HEAP_SIZE) return NULL;
    void *ptr = &heap[offset];
    offset += aligned;
    return ptr;
}

int main(void) {
    int *a = bump_alloc(sizeof(int)); *a = 42;
    int *b = bump_alloc(sizeof(int)); *b = 99;
    printf("a=%d at %p, b=%d at %p\n", *a, (void*)a, *b, (void*)b);
    printf("Used: %zu / %d bytes\n", offset, HEAP_SIZE);
}
struct Bump { heap: [u8; 1024], offset: usize }
impl Bump {
    fn new() -> Self { Bump { heap: [0; 1024], offset: 0 } }
    fn alloc(&mut self, size: usize) -> Option<&mut [u8]> {
        let aligned = (size + 7) & !7;
        if self.offset + aligned > 1024 { return None; }
        let start = self.offset;
        self.offset += aligned;
        Some(&mut self.heap[start..start + size])
    }
}
fn main() {
    let mut a = Bump::new();
    let x = a.alloc(4).unwrap();
    x.copy_from_slice(&42i32.to_ne_bytes());
    println!("Allocated: {:?}, used: {}/1024", x, a.offset);
}

Rust returns Option — no null pointers, no forgetting to check.


Experiment 6: Trigger All 6 Segfault Types

// segfaults.c — compile: gcc -g -fno-stack-protector -o segfaults segfaults.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void null_deref(void)    { int *p = NULL; *p = 42; }
void stack_blow(void)    { stack_blow(); }
void use_after_free(void){ int *p = malloc(4); free(p); *p = 42; }
void write_rodata(void)  { char *s = "hello"; s[0] = 'H'; }
void exec_stack(void)    { char c[]={0xc3}; ((void(*)(void))c)(); }
void unmapped(void)      { *(int*)0xDEADBEEF = 42; }

int main(int argc, char **argv) {
    if (argc != 2) { printf("Usage: %s <1-6>\n", argv[0]); return 1; }
    switch(argv[1][0]) {
        case '1': null_deref(); break;    case '2': stack_blow(); break;
        case '3': use_after_free(); break; case '4': write_rodata(); break;
        case '5': exec_stack(); break;     case '6': unmapped(); break;
    }
}

Debug each with GDB: gdb ./segfaults -ex "run 1" -ex bt -ex "info registers rip".

For each case, note: which address faulted, what bt shows, which permission was violated (R/W/X).


Experiment 7: Compare ELF — C vs Rust "Hello World"

$ echo '#include <stdio.h>
int main(){ puts("hello"); }' > hello.c && gcc -o hello_c hello.c
$ echo 'fn main(){ println!("hello"); }' > hello.rs && rustc -o hello_rust hello.rs
$ ls -la hello_c hello_rust
$ size hello_c hello_rust
$ readelf -S hello_c | wc -l
$ readelf -S hello_rust | wc -l

Rust binary: 1-4 MB. C binary: ~16 KB. The difference: panic handling, unwinding tables, println! formatting. Try rustc -O then strip:

$ rustc -O -o hello_opt hello.rs && strip hello_opt
$ ls -la hello_c hello_rust hello_opt

Fun Fact

Most of a Rust binary's size is not your code. It is the standard library support for panics and formatting. Set panic = "abort" in Cargo.toml and the binary shrinks dramatically.


Experiment 8: Manual Linking

// math.c
int add(int a, int b) { return a + b; }
int mul(int a, int b) { return a * b; }
// main.c
#include <stdio.h>
extern int add(int, int);
extern int mul(int, int);
int main(void) { printf("3+4=%d, 3*4=%d\n", add(3,4), mul(3,4)); }
$ gcc -c math.c && gcc -c main.c
$ nm math.o        # T add, T mul  (defined)
$ nm main.o        # U add, U mul  (undefined — need linking)
$ gcc -o prog main.o math.o
$ nm prog | grep -E 'add|mul|main'

What do you think happens?

Link main.o without math.o. The linker error shows exactly how symbol resolution works.


Experiment 9: Cache Performance

// cache.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N (16*1024*1024)
int main(void) {
    int *a = malloc(N * sizeof(int));
    for (int i = 0; i < N; i++) a[i] = i;
    clock_t t;
    volatile int sum = 0;
    t = clock();
    for (int i = 0; i < N; i++) sum += a[i];
    printf("Sequential: %.1f ms\n", 1000.0*(clock()-t)/CLOCKS_PER_SEC);

    srand(42);
    for (int i = N-1; i > 0; i--) {
        int j = rand() % (i+1);
        int tmp = a[i]; a[i] = a[j]; a[j] = tmp;
    }
    sum = 0; t = clock();
    int idx = 0;
    for (int i = 0; i < N; i++) { sum += a[idx]; idx = abs(a[idx]) % N; }
    printf("Random:     %.1f ms\n", 1000.0*(clock()-t)/CLOCKS_PER_SEC);
    free(a);
}
$ gcc -O2 -o cache cache.c && ./cache
Sequential:  7.0 ms
Random:     85.0 ms     <-- 10x+ slower, same data, same operations

The difference is cache misses. Use perf stat -e cache-misses,cache-references ./cache to see the numbers.


Experiment 10: Handwritten ELF (No Compiler)

A valid executable in 163 bytes. Just write(1, "Hi\n", 3) and exit(0).

#!/usr/bin/env python3
# tiny_elf.py
import struct, os
code = bytes([
    0x48,0xc7,0xc0,0x01,0x00,0x00,0x00,  # mov rax, 1 (write)
    0x48,0xc7,0xc7,0x01,0x00,0x00,0x00,  # mov rdi, 1 (stdout)
    0x48,0x8d,0x35,0x12,0x00,0x00,0x00,  # lea rsi, [rip+18]
    0x48,0xc7,0xc2,0x03,0x00,0x00,0x00,  # mov rdx, 3
    0x0f,0x05,                             # syscall
    0x48,0xc7,0xc0,0x3c,0x00,0x00,0x00,  # mov rax, 60 (exit)
    0x48,0x31,0xff,                        # xor rdi, rdi
    0x0f,0x05,                             # syscall
    0x48,0x69,0x0a,                        # "Hi\n"
])
LOAD = 0x400000; EH = 64; PH = 56
ENTRY = LOAD + EH + PH; FSIZE = EH + PH + len(code)
ehdr = struct.pack('<4sBBBBBxxxxxxx', b'\x7fELF', 2, 1, 1, 0, 0)
ehdr += struct.pack('<HHIQQQIHHHHHH', 2,0x3E,1,ENTRY,EH,0,0,EH,PH,1,0,0,0)
phdr = struct.pack('<IIQQQQQQ', 1, 5, 0, LOAD, LOAD, FSIZE, FSIZE, 0x1000)
with open('tiny','wb') as f: f.write(ehdr + phdr + code)
os.chmod('tiny', 0o755)
print(f"Created 'tiny' ({FSIZE} bytes). Run: ./tiny")
$ python3 tiny_elf.py && ./tiny
Created 'tiny' (163 bytes). Run: ./tiny
Hi
$ file tiny
tiny: ELF 64-bit LSB executable, x86-64, statically linked, no section header

No compiler, no libc, no linker. Pure bytes that the kernel understands.


Task

Complete at least 5 of these 10 experiments. For each:

  1. Run the code exactly as written.
  2. Modify one thing and predict the result before running.
  3. Write down what surprised you.

The goal is not to memorize. It is to build intuition. When you have seen the stack grow downward, you never forget which way it grows.

Appendix A: x86-64 Register Reference

General-Purpose Registers

+--------+--------+------+-------+----------------------------------+
| 64-bit | 32-bit | 16b  | 8-bit | Primary Purpose                  |
+--------+--------+------+-------+----------------------------------+
| rax    | eax    | ax   | al/ah | Return value, syscall number     |
| rbx    | ebx    | bx   | bl/bh | Callee-saved (preserved)         |
| rcx    | ecx    | cx   | cl/ch | 4th integer argument             |
| rdx    | edx    | dx   | dl/dh | 3rd integer argument             |
| rsi    | esi    | si   | sil   | 2nd integer argument             |
| rdi    | edi    | di   | dil   | 1st integer argument             |
| rbp    | ebp    | bp   | bpl   | Frame pointer (callee-saved)     |
| rsp    | esp    | sp   | spl   | Stack pointer                    |
| r8     | r8d    | r8w  | r8b   | 5th integer argument             |
| r9     | r9d    | r9w  | r9b   | 6th integer argument             |
| r10    | r10d   | r10w | r10b  | Caller-saved temporary           |
| r11    | r11d   | r11w | r11b  | Caller-saved temporary           |
| r12    | r12d   | r12w | r12b  | Callee-saved                     |
| r13    | r13d   | r13w | r13b  | Callee-saved                     |
| r14    | r14d   | r14w | r14b  | Callee-saved                     |
| r15    | r15d   | r15w | r15b  | Callee-saved                     |
+--------+--------+------+-------+----------------------------------+

Writing to a 32-bit sub-register (e.g., eax) zero-extends into the full 64-bit register. Writing to a 16-bit or 8-bit sub-register does NOT zero-extend — the upper bits are preserved.


Special Registers

+--------+----------------------------------------------------------+
| rip    | Instruction pointer — address of the NEXT instruction    |
|        | You cannot write to it directly (use jmp/call/ret)       |
+--------+----------------------------------------------------------+
| rflags | Status flags, set by arithmetic/comparison instructions  |
|        | CF (bit 0)  — Carry flag (unsigned overflow)             |
|        | ZF (bit 6)  — Zero flag (result was zero)                |
|        | SF (bit 7)  — Sign flag (result was negative)            |
|        | OF (bit 11) — Overflow flag (signed overflow)            |
+--------+----------------------------------------------------------+

System V AMD64 ABI Calling Convention

This is the calling convention used on Linux, macOS, and most Unix-like systems.

Integer / Pointer Arguments

Argument #:   1st    2nd    3rd    4th    5th    6th    7th+
Register:     rdi    rsi    rdx    rcx    r8     r9     stack

Arguments beyond the 6th are passed on the stack, pushed right-to-left.

Floating-Point Arguments

Argument #:   1st    2nd    3rd    4th    5th    6th    7th    8th    9th+
Register:     xmm0   xmm1   xmm2   xmm3   xmm4   xmm5   xmm6   xmm7   stack

Return Values

Integer / pointer:   rax  (and rdx for 128-bit returns)
Floating-point:      xmm0 (and xmm1 for complex returns)

Callee-Saved Registers (Must Be Preserved Across Calls)

rbx, rbp, r12, r13, r14, r15

If a function uses any of these, it must save and restore them.

Caller-Saved Registers (May Be Destroyed By Calls)

rax, rcx, rdx, rsi, rdi, r8, r9, r10, r11

If you need values in these registers to survive a call, save them yourself.


Linux Syscall Convention

Syscall number:   rax
Arguments:        rdi    rsi    rdx    r10    r8     r9
Return value:     rax    (negative = -errno)
Instruction:      syscall

Clobbered:        rcx, r11 (overwritten by the kernel)

Note: the 4th argument uses r10, not rcx — this differs from the normal calling convention because syscall clobbers rcx (it stores the return address there).

Common Syscall Numbers (x86-64 Linux)

0   read        1   write       2   open        3   close
9   mmap        11  munmap      12  brk         57  fork
59  execve      60  exit        62  kill        231 exit_group

Full list: /usr/include/asm/unistd_64.h or ausyscall --dump.


Quick Reference Diagram

  Argument passing (System V AMD64):

  my_function(a, b, c, d, e, f, g, h)
              |  |  |  |  |  |  |  |
              v  v  v  v  v  v  v  v
             rdi rsi rdx rcx r8 r9 [stack] [stack]

  Return:    rax

Appendix B: Page Table Entry Bitfields

x86-64 Page Table Entry (PTE)

Each page table entry is 64 bits (8 bytes). Not all bits are used — the CPU ignores some, and the OS can repurpose them.

 63  62..52  51..12                11..9  8   7   6   5   4   3   2   1   0
+---+-------+----------------------+-----+---+---+---+---+---+---+---+---+---+
|NX | Avail | Physical Page Number | AVL | G |PAT| D | A |PCD|PWT|U/S|R/W| P |
+---+-------+----------------------+-----+---+---+---+---+---+---+---+---+---+

Bit-by-Bit Reference

+------+--------+-----------------------------------------------------------+
| Bit  | Name   | Meaning                                                   |
+------+--------+-----------------------------------------------------------+
|  0   | P      | Present. 1 = page is in physical memory.                  |
|      |        | 0 = not present; access triggers page fault (#PF).        |
+------+--------+-----------------------------------------------------------+
|  1   | R/W    | Read/Write. 1 = writable. 0 = read-only.                  |
|      |        | Write to read-only page triggers page fault.              |
+------+--------+-----------------------------------------------------------+
|  2   | U/S    | User/Supervisor. 1 = user-mode (ring 3) can access.       |
|      |        | 0 = kernel-only. User access triggers page fault.         |
+------+--------+-----------------------------------------------------------+
|  3   | PWT    | Page-level Write-Through. 1 = write-through caching.      |
|      |        | 0 = write-back caching (normal).                          |
+------+--------+-----------------------------------------------------------+
|  4   | PCD    | Page-level Cache Disable. 1 = caching disabled.           |
|      |        | Used for memory-mapped I/O (device registers).            |
+------+--------+-----------------------------------------------------------+
|  5   | A      | Accessed. Set by CPU when page is read or written.        |
|      |        | OS clears it periodically to track working set.           |
+------+--------+-----------------------------------------------------------+
|  6   | D      | Dirty. Set by CPU when page is written to.                |
|      |        | OS uses this to know which pages need writing to disk.    |
+------+--------+-----------------------------------------------------------+
|  7   | PS/PAT | Page Size (in PDE): 1 = large page (2MB or 1GB).          |
|      |        | PAT (in PTE): Page Attribute Table index bit.             |
+------+--------+-----------------------------------------------------------+
|  8   | G      | Global. 1 = don't flush from TLB on CR3 switch.           |
|      |        | Used for kernel pages shared across all processes.        |
+------+--------+-----------------------------------------------------------+
| 9-11 | AVL    | Available for OS use. Linux uses these for swap info,     |
|      |        | soft-dirty tracking, and other bookkeeping.               |
+------+--------+-----------------------------------------------------------+
|12-51 | PPN    | Physical Page Number. The upper bits of the physical      |
|      |        | address (shift left by 12 to get byte address).           |
+------+--------+-----------------------------------------------------------+
|52-62 | Avail  | Available / reserved. Some used by OS or hardware.        |
+------+--------+-----------------------------------------------------------+
|  63  | NX     | No-Execute. 1 = code execution forbidden on this page.    |
|      |        | Attempting to execute triggers page fault.                |
|      |        | Critical for W^X security (stack, heap are NX).          |
+------+--------+-----------------------------------------------------------+

How the OS Uses These Bits

Scenario                     Bits set
----------------------------------------------
Normal code page:            P=1, R/W=0, U/S=1, NX=0
Writable data page:          P=1, R/W=1, U/S=1, NX=1
Stack page:                  P=1, R/W=1, U/S=1, NX=1
Kernel code:                 P=1, R/W=0, U/S=0, NX=0
Copy-on-Write page:          P=1, R/W=0 (trap on write)
Demand-paged (not loaded):   P=0 (trap on any access)
Guard page (stack boundary): P=0 (trap = stack overflow)

ARM Comparison

ARM uses a different page table format, but the concepts are the same:

x86-64 Bit   ARM Equivalent     Notes
-----------  ----------------   --------------------------------
P (Present)  Valid bit          Same concept
R/W          AP[2:1]            Access Permission bits
U/S          AP[1], PXN, UXN   More granular in ARM
NX           XN / UXN / PXN    Separate execute control for
                                user and privileged modes
D (Dirty)    DBM (Dirty Bit    Hardware-managed on newer ARM
             Management)
A (Accessed) AF (Access Flag)   Same concept

ARM page tables can be 4KB, 16KB, or 64KB granules (x86-64 is always 4KB base). ARM also supports 3-level or 4-level page tables depending on the virtual address size configuration.

Appendix C: ELF Format Quick Reference

ELF Header Fields

Every ELF file starts with a 64-byte header (for 64-bit) at offset 0.

+----------------+--------+------------------------------------------------+
| Field          | Size   | Description                                    |
+----------------+--------+------------------------------------------------+
| e_ident[0..3]  | 4      | Magic: 0x7f 'E' 'L' 'F'                       |
| e_ident[4]     | 1      | Class: 1 = 32-bit, 2 = 64-bit                 |
| e_ident[5]     | 1      | Data: 1 = little-endian, 2 = big-endian        |
| e_ident[6]     | 1      | Version: 1 (always)                            |
| e_ident[7]     | 1      | OS/ABI: 0 = SYSV, 3 = Linux                   |
| e_ident[8..15] | 8      | Padding (zeroes)                               |
+----------------+--------+------------------------------------------------+
| e_type         | 2      | ET_REL (1) = relocatable (.o)                  |
|                |        | ET_EXEC (2) = executable (fixed address)       |
|                |        | ET_DYN (3) = shared object / PIE executable    |
+----------------+--------+------------------------------------------------+
| e_machine      | 2      | EM_X86_64 (0x3E), EM_AARCH64 (0xB7), etc.     |
| e_version      | 4      | 1 (current)                                    |
| e_entry        | 8      | Entry point virtual address (_start)           |
| e_phoff        | 8      | Program header table offset in file            |
| e_shoff        | 8      | Section header table offset in file            |
| e_flags        | 4      | Processor-specific flags                       |
| e_ehsize       | 2      | ELF header size (64 for 64-bit)                |
| e_phentsize    | 2      | Size of one program header entry (56)          |
| e_phnum        | 2      | Number of program header entries                |
| e_shentsize    | 2      | Size of one section header entry (64)          |
| e_shnum        | 2      | Number of section header entries                |
| e_shstrndx     | 2      | Index of section name string table              |
+----------------+--------+------------------------------------------------+

Common Sections

+-------------+-----------------------------------------------------------+
| Section     | Contents                                                  |
+-------------+-----------------------------------------------------------+
| .text       | Executable machine code                                   |
| .data       | Initialized global/static variables                       |
| .bss        | Uninitialized globals (zero-filled at load, no file space)|
| .rodata     | Read-only data (string literals, constants)               |
| .symtab     | Symbol table (functions, globals) — for linking/debugging |
| .strtab     | String table for symbol names                             |
| .shstrtab   | String table for section names                            |
| .rel / .rela| Relocation entries (fixups for the linker)                |
| .plt        | Procedure Linkage Table (lazy binding stubs)              |
| .got        | Global Offset Table (resolved dynamic addresses)         |
| .got.plt    | GOT entries specifically for PLT                         |
| .dynamic    | Dynamic linking info (needed libraries, symbol tables)    |
| .interp     | Path to dynamic linker (/lib64/ld-linux-x86-64.so.2)    |
| .init/.fini | Constructor/destructor code                              |
| .debug_*    | DWARF debug information (line numbers, types, variables)  |
| .eh_frame   | Exception/stack unwinding tables                         |
| .note.*     | Build ID, ABI tags                                       |
| .comment    | Compiler version string                                   |
+-------------+-----------------------------------------------------------+

Common Segment Types (Program Headers)

Segments are the runtime view. The kernel reads these to load the program.

+----------------+----------------------------------------------------------+
| Type           | Purpose                                                  |
+----------------+----------------------------------------------------------+
| PT_LOAD        | Loadable segment. Mapped into memory. Usually two:       |
|                |   1) r-x: .text, .rodata (code + constants)              |
|                |   2) rw-: .data, .bss (writable data)                    |
+----------------+----------------------------------------------------------+
| PT_DYNAMIC     | Points to .dynamic section. Used by the dynamic linker   |
|                | to find shared libraries and resolve symbols.            |
+----------------+----------------------------------------------------------+
| PT_INTERP      | Path to the dynamic linker (e.g., /lib64/ld-linux...)    |
|                | Kernel reads this to know which loader to invoke.        |
+----------------+----------------------------------------------------------+
| PT_NOTE        | Auxiliary info: build ID, ABI version.                   |
|                | Not loaded into process memory.                          |
+----------------+----------------------------------------------------------+
| PT_GNU_STACK   | Stack executability. If absent or flags=RW, stack is NX.  |
|                | If flags=RWX, stack is executable (rare, insecure).       |
+----------------+----------------------------------------------------------+
| PT_GNU_RELRO   | Read-only after relocation. The dynamic linker resolves   |
|                | GOT entries, then marks this region read-only.           |
+----------------+----------------------------------------------------------+
| PT_PHDR        | Points to the program header table itself.               |
+----------------+----------------------------------------------------------+

Quick Inspection Commands

readelf -h binary      # ELF header
readelf -S binary      # Section headers
readelf -l binary      # Program headers (segments)
readelf -s binary      # Symbol table
readelf -d binary      # Dynamic section
objdump -d binary      # Disassemble .text
hexdump -C binary | head  # Raw bytes (look for 7f 45 4c 46)

Appendix D: Signal Reference

Signal Table (x86-64 Linux)

+--------+---------+---------+----------------------------------------------+
| Number | Name    | Default | Common Cause                                 |
+--------+---------+---------+----------------------------------------------+
|   1    | SIGHUP  | Term    | Terminal closed, or controlling process died  |
|   2    | SIGINT  | Term    | Ctrl+C from terminal                         |
|   3    | SIGQUIT | Core    | Ctrl+\ from terminal (quit + core dump)      |
|   4    | SIGILL  | Core    | Illegal instruction (corrupt code, bad jump) |
|   5    | SIGTRAP | Core    | Breakpoint hit (used by debuggers, int3)     |
|   6    | SIGABRT | Core    | abort() called (failed assert, double free)  |
|   7    | SIGBUS  | Core    | Bus error: misaligned access, bad mmap       |
|   8    | SIGFPE  | Core    | Arithmetic error: divide by zero, overflow   |
|   9    | SIGKILL | Term    | Unconditional kill (CANNOT be caught)        |
|  10    | SIGUSR1 | Term    | User-defined signal 1                        |
|  11    | SIGSEGV | Core    | Segmentation fault: invalid memory access    |
|  12    | SIGUSR2 | Term    | User-defined signal 2                        |
|  13    | SIGPIPE | Term    | Write to pipe/socket with no reader          |
|  14    | SIGALRM | Term    | Timer from alarm() expired                   |
|  15    | SIGTERM | Term    | Polite termination request (what kill sends) |
|  17    | SIGCHLD | Ignore  | Child process stopped or terminated          |
|  18    | SIGCONT | Cont    | Resume stopped process (sent by fg, kill -18)|
|  19    | SIGSTOP | Stop    | Unconditional stop (CANNOT be caught)        |
|  20    | SIGTSTP | Stop    | Ctrl+Z from terminal                         |
+--------+---------+---------+----------------------------------------------+

Default Actions

Term  = Terminate the process
Core  = Terminate + generate core dump (if enabled: ulimit -c unlimited)
Stop  = Suspend the process (can resume with SIGCONT)
Cont  = Resume a stopped process
Ignore = Do nothing by default

Uncatchable Signals

Only two signals cannot be caught, blocked, or ignored:

  • SIGKILL (9): Always terminates. The process gets no chance to clean up.
  • SIGSTOP (19): Always suspends. The process cannot prevent it.

Every other signal can be caught with signal() or sigaction().

Sending Signals

kill -SIGTERM 1234       # send SIGTERM to PID 1234
kill -9 1234             # send SIGKILL (cannot be caught)
kill -STOP 1234          # suspend process
kill -CONT 1234          # resume process
Ctrl+C                   # sends SIGINT to foreground process
Ctrl+Z                   # sends SIGTSTP to foreground process
Ctrl+\                   # sends SIGQUIT (core dump)

The Signals You Will See Most

If your program crashes, check the signal:

  • SIGSEGV (11): You accessed memory you should not have. See Chapter 18.
  • SIGABRT (6): Something called abort() — often a failed assertion or detected heap corruption.
  • SIGFPE (8): Division by zero or integer overflow trap.
  • SIGPIPE (13): You wrote to a closed pipe. Common in shell pipelines and network code.
  • SIGBUS (7): Misaligned memory access, or mmap beyond file size.

Appendix E: Glossary

Address Space -- The range of virtual addresses a process can use. On x86-64 Linux, user space is typically 0 to 0x7FFFFFFFFFFF (128 TB). Each process has its own address space, isolated by the MMU.

ASLR (Address Space Layout Randomization) -- The kernel loads the stack, heap, shared libraries, and PIE executables at randomized addresses each run. Makes exploitation harder because attackers cannot predict where code or data will be.

brk / sbrk -- System calls that move the "program break" — the boundary of the heap. malloc uses brk for small allocations and mmap for large ones.

BSS (Block Started by Symbol) -- The section of an ELF file for uninitialized global and static variables. Takes no space in the file — the kernel zero-fills it when loading.

Cache Line -- The unit of transfer between cache and main memory. Typically 64 bytes on x86-64. When you access one byte, the CPU loads the entire 64-byte line into cache.

Calling Convention -- The rules for how functions receive arguments and return values. On x86-64 Linux (System V ABI): first 6 integer args in rdi, rsi, rdx, rcx, r8, r9; return in rax.

Context Switch -- When the kernel suspends one process/thread and resumes another. Saves and restores registers, updates page table base (CR3), flushes parts of the TLB.

Copy-on-Write (CoW) -- After fork(), parent and child share the same physical pages marked read-only. On first write, the kernel copies the page so each process gets its own. Saves memory and makes fork fast.

Core Dump -- A file containing the memory image of a crashed process. Generated by signals like SIGSEGV and SIGABRT. Open with gdb program core to inspect the crash state.

Demand Paging -- Pages are not loaded from disk until they are actually accessed. The first access triggers a page fault; the kernel then loads the page from the ELF file or swap.

ELF (Executable and Linkable Format) -- The standard binary format on Linux. Contains headers, sections (compile-time view), and segments (runtime view). See Appendix C.

Frame (Page Frame) -- A physical memory page. The MMU maps virtual pages to physical frames. On x86-64, a standard frame is 4096 bytes.

Frame (Stack Frame) -- The region of the stack belonging to one function call. Contains local variables, saved registers, and the return address. Bounded by rbp (base) and rsp (top).

GOT (Global Offset Table) -- A table of addresses filled in at runtime by the dynamic linker. Used for position-independent access to global variables and functions in shared libraries.

Heap -- The region of memory used for dynamic allocation (malloc/free in C, Box/Vec in Rust). Grows upward from the program break toward higher addresses.

MMU (Memory Management Unit) -- Hardware in the CPU that translates virtual addresses to physical addresses using page tables. Enforces permissions (read/write/execute) and traps invalid access.

mmap -- System call that maps files or anonymous memory into the process address space. Used for shared libraries, large allocations, file I/O, and shared memory between processes.

Page -- The unit of virtual memory. 4096 bytes (4 KB) on x86-64 by default. The MMU translates addresses at page granularity. Large pages (2 MB, 1 GB) are also available.

Page Fault -- A CPU exception triggered when accessing a virtual address that is not currently mapped to a physical frame. The kernel handles it by loading the page, allocating a frame, or killing the process (segfault).

Page Table -- A hierarchical data structure (4 levels on x86-64) that maps virtual pages to physical frames. One per process. See Appendix B for bit-level details.

PTE (Page Table Entry) -- A single entry in a page table. Contains the physical frame number and permission bits (present, read/write, user/kernel, no-execute). See Appendix B.

PIE (Position-Independent Executable) -- An executable compiled so it can be loaded at any address. All modern Linux executables are PIE by default, enabling full ASLR.

PLT (Procedure Linkage Table) -- A set of small stubs that redirect calls to shared library functions. On the first call, the PLT invokes the dynamic linker to resolve the address (lazy binding). Subsequent calls go directly.

Process -- An instance of a running program. Has its own address space, page tables, file descriptors, and one or more threads. Created by fork() or execve().

Register -- A small, fast storage location inside the CPU. x86-64 has 16 general-purpose registers (rax through r15), the instruction pointer (rip), flags (rflags), and SIMD registers (xmm0-15). See Appendix A.

Relocation -- The process of adjusting addresses in object files when the linker combines them into an executable. Necessary because the compiler does not know final addresses when compiling individual files.

Segfault (Segmentation Fault) -- A SIGSEGV signal sent to a process when it accesses memory it is not allowed to (null pointer, freed memory, read-only pages, unmapped addresses). The most common crash in C programs.

Signal -- An asynchronous notification sent to a process. Can be generated by the kernel (SIGSEGV on bad memory access), by another process (kill), or by the terminal (Ctrl+C = SIGINT). See Appendix D.

Stack -- A LIFO memory region used for function calls. Grows downward (toward lower addresses on x86-64). Each function call pushes a stack frame; each return pops it. Each thread has its own stack.

Symbol -- A named entity in an object file: a function, a global variable, or a label. The linker uses symbols to connect references across object files. Inspect with nm or readelf -s.

Syscall (System Call) -- The interface between user-space programs and the kernel. Triggered by the syscall instruction on x86-64. Examples: read, write, mmap, fork, exit.

TLB (Translation Lookaside Buffer) -- A hardware cache inside the CPU that stores recent virtual-to-physical address translations. Avoids walking the page table on every memory access. A TLB miss requires a page table walk.

Undefined Behavior (UB) -- Code whose behavior the language standard does not define. In C: signed overflow, null dereference, buffer overflow, use-after-free. The compiler may assume UB never happens, leading to surprising optimizations.

Virtual Memory -- The abstraction that gives each process its own address space. Virtual addresses are translated to physical addresses by the MMU. Enables isolation, sharing, demand paging, and memory-mapped I/O.