How Programs Really Run: From CPU to Process Memory
Type this right now
// save as where.c, compile: gcc -g -o where where.c
#include <stdio.h>
#include <stdlib.h>
int global = 99;
int main() {
int stack_var = 42;
int *heap_var = malloc(sizeof(int));
*heap_var = 7;
printf("Code: %p\n", (void *)main);
printf("Global: %p\n", (void *)&global);
printf("Stack: %p\n", (void *)&stack_var);
printf("Heap: %p\n", (void *)heap_var);
free(heap_var);
return 0;
}
Run it. You'll see four wildly different addresses. Those addresses tell a story — a story about how your operating system, your CPU, and your compiler conspire to make your program work.
By the end of this book, you'll know exactly why each of those addresses is where it is.
What this book is about
When you write int x = 42; in C or let x: i32 = 42; in Rust, a staggering amount of
machinery activates. The compiler translates your intent into machine instructions. The linker
stitches object files into a binary. The OS loader maps that binary into virtual memory. The CPU
fetches instructions one by one, reading and writing to registers and RAM.
Most programmers treat all of that as a black box. This book opens the box.
We won't hand-wave. We won't say "the OS handles it" and move on. We'll show you the actual mechanism — the CPU instruction, the kernel data structure, the page table entry — and then give you a tool to observe it yourself on a real, running system.
Who this is for
You should already know how to write code in C or Rust (or both). You don't need to be an expert. If you can write a function, allocate memory, and compile a program, you have enough background.
What you probably don't know yet:
- Why your stack variable lives at
0x7ffd...but your heap variable lives at0x55a... - What the CPU actually does with
mov rax, [rbp-8] - Why a segfault happens at the hardware level
- What
straceis showing you and why it matters - How Rust's borrow checker maps to the same physical reality as C's raw pointers
That's what we're here for.
C and Rust, side by side
Every concept in this book is demonstrated in both C and Rust. Not because one is better — because seeing the same low-level reality from two different languages makes both clearer.
C gives you no guardrails. You see the raw mechanism. Rust adds compile-time guarantees. You see how safety is enforced without runtime cost.
Same CPU. Same instructions. Same virtual memory. Different contracts with the programmer.
How to read this book
Every chapter follows the same structure:
- Something you can type and run in 30 seconds. Seeing is believing.
- The concept, explained with ASCII diagrams. No hand-waving.
- The mechanism. What the CPU/OS/compiler actually does.
- Code in C and Rust. Side by side.
- A tool to observe it live. GDB, strace, /proc, objdump, readelf.
- A hands-on task. You learn by doing, not by reading.
You can read front-to-back, or jump to any chapter that interests you. But Part I (The Machine) is worth reading first — everything else builds on it.
What you'll need
- A Linux system (native, WSL2, or a VM all work)
- GCC and Rust (rustc/cargo) installed
- GDB, strace, objdump, readelf (standard on most Linux distros)
- A text editor and a terminal
That's it. No special hardware. No expensive tools. Everything we use is free and open source.
The journey
Part I: The Machine — CPU, memory, instructions, privilege
Part II: The Illusion — How your program sees memory
Part III: The Binary — ELF files, compilation, linking, loading
Part IV: The Mechanism — Virtual memory, page tables, page faults
Part V: Allocation — malloc, Rust's allocator, data layout
Part VI: Threads and Safety — Shared memory, C vs Rust tradeoffs
Part VII: Observe and Build — Tools and experiments
By the end, int x = 42 won't be magic anymore. You'll know the register it passes through,
the cache line it occupies, the page table entry that maps it, and the virtual address the OS
assigned to it.
Let's start with the CPU.
The CPU in 20 Minutes
Type this right now
// save as step.c, compile: gcc -g -O0 -o step step.c
#include <stdio.h>
int main() {
int a = 10;
int b = 32;
int x = a + b;
printf("x = %d\n", x);
return 0;
}
Now run it under GDB:
$ gdb ./step
(gdb) break main
(gdb) run
(gdb) info registers
(gdb) si
(gdb) info registers
Watch the rip register change. That register is the CPU's bookmark — it just moved to the next instruction. You're watching the CPU work, one step at a time.
The only thing a CPU does
A CPU does exactly one thing, over and over, billions of times per second:
┌──────────────────────────┐
│ │
│ 1. FETCH instruction │◄──── Read bytes at address in rip
│ │ │
│ ▼ │
│ 2. DECODE instruction │◄──── Figure out what the bytes mean
│ │ │
│ ▼ │
│ 3. EXECUTE instruction │◄──── Do the thing (add, move, jump...)
│ │ │
│ ▼ │
│ 4. Advance rip │◄──── Point to the next instruction
│ │ │
│ └────────────────┘
└──────────────────────────┘
Fetch. Decode. Execute. Advance. That's it. Every program you've ever used — your web browser, your OS, a video game — is just this loop running at 3+ billion iterations per second.
Registers: the CPU's own storage
Before the CPU can add two numbers, those numbers need to be inside the CPU. RAM is far away (relatively speaking). So the CPU has its own tiny, blazing-fast storage: registers.
On x86-64, you have 16 general-purpose registers, each 64 bits (8 bytes) wide:
┌─────────────────────────────────────────────┐
│ x86-64 Registers │
├──────────┬──────────────────────────────────-┤
│ rax │ Accumulator, return values │
│ rbx │ General purpose (callee-saved) │
│ rcx │ Counter, 4th argument │
│ rdx │ Data, 3rd argument │
│ rsi │ Source index, 2nd argument │
│ rdi │ Destination index, 1st argument │
│ rbp │ Base pointer (frame pointer) │
│ rsp │ Stack pointer ◄── top of stack │
│ r8-r15 │ Additional general-purpose regs │
├──────────┼───────────────────────────────────┤
│ rip │ Instruction pointer (CPU's │
│ │ bookmark, points to NEXT instr.) │
│ rflags │ Status flags (zero, carry, etc.) │
└──────────┴───────────────────────────────────┘
When GDB shows you info registers, you're seeing the CPU's entire working state.
Fun Fact: All 16 general-purpose registers together hold just 128 bytes. Your smallest source file is probably bigger. Yet these 128 bytes are where all the real work happens.
Two special registers you must know
rip — the instruction pointer. The CPU is reading a list of instructions, one after another. rip tells it where it currently is.
Address Instruction
─────────────────────────────
0x401000 mov rax, 10 ◄── rip points here
0x401007 mov rbx, 32
0x40100e add rax, rbx
After executing mov rax, 10, the CPU advances rip to 0x401007. Jump instructions (jmp, je, call) change rip to a different address — that's how if, for, and function calls work.
rsp — the stack pointer. Always points to the top of the current function's stack frame. Moves down on function calls, back up on returns.
Inside the CPU: the big picture
┌──────────────────────────────────────────────────┐
│ CPU │
│ │
│ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Control Unit │ │ Registers │ │
│ │ │ │ rax rbx rcx rdx │ │
│ │ Fetches and │ │ rsi rdi rbp rsp │ │
│ │ decodes │ │ r8-r15 │ │
│ │ instructions │ │ rip rflags │ │
│ └──────┬───────┘ └──────────┬──────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ ALU (Arithmetic Logic Unit) │ │
│ │ add, sub, mul, and, or, xor, cmp │ │
│ └──────────────────┬───────────────────┘ │
└──────────────────────┼───────────────────────────┘
┌──────┴──────┐
│ Memory Bus │
└──────┬──────┘
┌──────┴──────┐
│ RAM │
└─────────────┘
Control Unit fetches and decodes. ALU does math. Registers hold the data. Memory bus connects to RAM — fast, but much slower than registers.
How x = a + b really executes
C and Rust side by side:
int a = 10;
int b = 32;
int x = a + b;
#![allow(unused)] fn main() { let a: i32 = 10; let b: i32 = 32; let x: i32 = a + b; }
Both compile to something like this (simplified, -O0):
mov DWORD PTR [rbp-4], 10 ; store 10 on the stack (a)
mov DWORD PTR [rbp-8], 32 ; store 32 on the stack (b)
mov eax, DWORD PTR [rbp-4] ; load a into eax
add eax, DWORD PTR [rbp-8] ; add b to eax
mov DWORD PTR [rbp-12], eax ; store result as x
What do you think happens? Why
rbp-4,rbp-8,rbp-12? Why negative offsets?Reveal: The stack grows downward.
rbpis the base of the current frame. Local variables go at decreasing addresses below it.
Trace the CPU's state at each step:
Instruction │ eax │ [rbp-4] │ [rbp-8] │ [rbp-12]
─────────────────────────────┼───────┼─────────┼─────────┼─────────
mov DWORD PTR [rbp-4], 10 │ ??? │ 10 │ ??? │ ???
mov DWORD PTR [rbp-8], 32 │ ??? │ 10 │ 32 │ ???
mov eax, DWORD PTR [rbp-4] │ 10 │ 10 │ 32 │ ???
add eax, DWORD PTR [rbp-8] │ 42 │ 10 │ 32 │ ???
mov DWORD PTR [rbp-12], eax │ 42 │ 10 │ 32 │ 42
Five instructions. That's what x = a + b actually is.
Clock speed: how fast is fast?
3 GHz = 3,000,000,000 ticks/second
One tick = 0.33 nanoseconds ≈ time for light to travel 10 cm
A simple add takes 1 cycle. A memory load from RAM can take hundreds of cycles. This is why caches matter — next chapter.
Fun Fact: At 3 GHz, your CPU executes roughly 100 billion instructions during a 30-second YouTube ad. Each one went through fetch-decode-execute.
Seeing it with your own eyes: GDB
$ gcc -g -O0 -o step step.c
$ gdb ./step
(gdb) break main
(gdb) run
(gdb) info registers rip rsp rax
rip 0x401136 0x401136 <main+4>
rsp 0x7fffffffde10 0x7fffffffde10
(gdb) si
(gdb) info registers rip
rip 0x40113d 0x40113d <main+11>
rip moved from 0x401136 to 0x40113d — 7 bytes forward. The instruction was 7 bytes long. The CPU read it, executed it, and advanced rip by exactly that amount.
(gdb) si
(gdb) info registers rip
rip 0x401144 0x401144 <main+18>
You're watching fetch-decode-execute happen in real time.
The same works in Rust:
fn main() { let a: i32 = 10; let b: i32 = 32; let x: i32 = a + b; println!("x = {}", x); }
$ rustc -g -o step_rs step.rs
$ gdb ./step_rs
(gdb) break step::main
(gdb) run
(gdb) si
(gdb) info registers rip rsp
Same CPU. Same registers. Same loop. The language is different; the CPU doesn't care.
Recap
┌─────────────────────────────────────────────────┐
│ The CPU does ONE thing: │
│ Fetch → Decode → Execute → Advance rip │
│ │
│ It works with: │
│ • Registers (tiny, fast, inside the CPU) │
│ • RAM (huge, slow, outside the CPU) │
│ │
│ Key registers: │
│ • rip = where am I in the instruction list? │
│ • rsp = where is the top of the stack? │
│ • rax = general workhorse, return values │
│ │
│ Clock speed (3 GHz) = 3 billion cycles/sec │
│ Simple instruction = ~1 cycle = ~0.3 ns │
└─────────────────────────────────────────────────┘
Task
- Compile
step.cwithgcc -g -O0 -o step step.c. - Open it in GDB:
gdb ./step. - Set a breakpoint on
main, run, and usesito step through 10 instructions. - After each step, run
info registers rip rax rbp rsp. - Write down how
ripchanges. Is each increment the same? (No — instructions have different lengths.) - Find the instruction where
eaxbecomes 42. That's theadd.
Bonus: Run objdump -d step | less and find the main function. Match each instruction address with what GDB showed you. They should be identical.
In the next chapter, we'll find out why loading data from memory is 100x to 1000x slower than working with registers — and how caches save us from that penalty.
The Memory Hierarchy: Why Speed Has Layers
Type this right now
// save as cache_test.c, compile: gcc -O2 -o cache_test cache_test.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SIZE (64 * 1024 * 1024) // 64 million ints = 256 MB
int main() {
int *arr = malloc(SIZE * sizeof(int));
for (int i = 0; i < SIZE; i++) arr[i] = i;
clock_t start;
long long sum;
// Sequential access
sum = 0; start = clock();
for (int i = 0; i < SIZE; i++) sum += arr[i];
double seq = (double)(clock() - start) / CLOCKS_PER_SEC;
// Strided access (every 16th element = skip one cache line)
sum = 0; start = clock();
for (int i = 0; i < SIZE; i += 16) sum += arr[i];
double stride = (double)(clock() - start) / CLOCKS_PER_SEC;
printf("Sequential: %.3f sec (sum=%lld)\n", seq, sum);
printf("Stride-16: %.3f sec (sum=%lld)\n", stride, sum);
printf("Stride does 16x fewer adds but is NOT 16x faster.\n");
free(arr);
}
Run it. The stride-16 loop does 16 times fewer additions but is nowhere near 16x faster. Why? Because memory access time dominates, and every 16th element is a new cache line.
The brutal reality: speed vs. size
┌───────────────┬──────────────┬─────────────┬────────────────────┐
│ Level │ Latency │ Size │ Slower than regs? │
├───────────────┼──────────────┼─────────────┼────────────────────┤
│ Registers │ ~0.3 ns │ ~128 B │ 1x │
│ L1 Cache │ ~1 ns │ 32-64 KB │ 3x │
│ L2 Cache │ ~4 ns │ 256 KB │ 13x │
│ L3 Cache │ ~12 ns │ 8-32 MB │ 40x │
│ RAM (DDR5) │ ~100 ns │ 8-128 GB │ 300x │
│ SSD (NVMe) │ ~100,000 ns │ 0.5-4 TB │ 300,000x │
│ HDD │~10,000,000 ns│ 1-20 TB │ 30,000,000x │
└───────────────┴──────────────┴─────────────┴────────────────────┘
If a register access were 1 second, RAM would be 5 minutes. An SSD would be 3.5 days. A hard drive seek would be almost an entire year.
The pyramid
▲
/|\ Registers: 128 B, 0.3 ns
/ | \
/ | \ L1: 32-64 KB, 1 ns
/ | \
/ | \ L2: 256 KB, 4 ns
/ | \
/ | \ L3: 8-32 MB, 12 ns
/ | \
/ | \ RAM: 8-128 GB, 100 ns
/ | \
/ | \ SSD/HDD: TBs, us-ms
/───────────────────────\
FASTER, SMALLER ▲ ▼ SLOWER, BIGGER
Physics forces this tradeoff. Faster storage requires transistors closer to the CPU, which limits capacity. The speed of light itself imposes a minimum latency for reaching distant storage.
Cache lines: the unit of transfer
The CPU never reads a single byte from RAM. It reads in cache lines of 64 bytes.
When you access array[0], the CPU fetches a 64-byte block containing array[0] through array[15] (assuming 4-byte ints):
You access array[0]. RAM sends back a 64-byte cache line:
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ [0]│ [1]│ [2]│ [3]│ [4]│ [5]│ [6]│ [7]│ [8]│ [9]│[10]│[11]│[12]│[13]│[14]│[15]│
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
◄──────────────────── 64 bytes ─────────────────────►
Now array[1] through array[15] are in L1 cache: ~1 ns instead of ~100 ns.
This is spatial locality: data near what you just accessed is already cached.
Two kinds of locality
Spatial locality — accessing nearby addresses in sequence:
for (int i = 0; i < N; i++)
sum += array[i]; // Each access is 4 bytes after the last
You pay the ~100 ns RAM penalty once per cache line and get 15 more accesses essentially free.
Temporal locality — accessing the same data again soon:
int count = 0;
for (int i = 0; i < N; i++)
count++; // 'count' stays in L1 the entire loop
What do you think happens? Your local variables are on the stack. Is stack memory special hardware?
Reveal: Same RAM as everything else. But the top of the stack is accessed constantly (temporal locality) in a contiguous region (spatial locality). It's always in L1 cache. That's why local variables are fast.
Why linked lists are slow
Each node is somewhere on the heap. Following each next pointer is a potential cache miss:
Linked list (pointer chasing):
Node A @ 0x1000 Node B @ 0x5F00 Node C @ 0x3200
┌─────┬──────┐ ┌─────┬──────┐ ┌─────┬──────┐
│ val │ next─┼─────►│ val │ next─┼──────►│ val │ NULL │
└─────┴──────┘ └─────┴──────┘ └─────┴──────┘
Each arrow = potential cache miss = ~100 ns
Array with 3 elements @ 0x1000:
┌─────┬─────┬─────┐
│ [0] │ [1] │ [2] │ All in one cache line = ~100 ns total
└─────┴─────┴─────┘
This is why Vec<T> in Rust and arrays in C demolish linked lists in almost every benchmark.
Fun Fact: Bjarne Stroustrup (creator of C++) showed that even for insertion in the middle — the textbook linked-list use case —
std::vectorwas faster because the O(n) memmove is sequential and cache-friendly, while the O(1) pointer update causes a cache miss.
Cache-friendly data layout
A game with 10,000 entities. Array of Structs — the obvious way:
struct Entity { float x, y, z; float vx, vy, vz; int health; };
struct Entity entities[10000];
#![allow(unused)] fn main() { struct Entity { x: f32, y: f32, z: f32, vx: f32, vy: f32, vz: f32, health: i32 } let entities: Vec<Entity> = Vec::with_capacity(10000); }
Struct of Arrays — cache-friendly when processing one field at a time:
struct Entities { float x[10000], y[10000], z[10000];
float vx[10000], vy[10000], vz[10000]; int health[10000]; };
#![allow(unused)] fn main() { struct Entities { x: Vec<f32>, y: Vec<f32>, z: Vec<f32>, vx: Vec<f32>, vy: Vec<f32>, vz: Vec<f32>, health: Vec<i32> } }
If your physics update only touches positions and velocities:
AoS cache line (64 bytes):
│ x₀ y₀ z₀ vx₀ vy₀ vz₀ hp₀ pad │ x₁ y₁ z₁ vx₁ vy₁ vz₁ hp₁ pad │
2 entities per line, health + padding wasted
SoA cache line (64 bytes):
│ x₀ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ x₁₀ x₁₁ x₁₂ x₁₃ x₁₄ x₁₅ │
16 x-values per line, every byte useful
SoA can be 2-4x faster for batch processing.
The stride experiment
This program shows the cache hierarchy directly:
// save as stride.c, compile: gcc -O2 -o stride stride.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE (32 * 1024 * 1024)
int main() {
int *arr = malloc(ARRAY_SIZE * sizeof(int));
for (int i = 0; i < ARRAY_SIZE; i++) arr[i] = 1;
int strides[] = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024};
for (int s = 0; s < 11; s++) {
int stride = strides[s];
volatile long long sum = 0;
clock_t start = clock();
for (int iter = 0; iter < 10; iter++)
for (int i = 0; i < ARRAY_SIZE; i += stride) sum += arr[i];
double elapsed = (double)(clock() - start) / CLOCKS_PER_SEC;
long long accesses = (long long)10 * (ARRAY_SIZE / stride);
printf("Stride %5d: %.2f ns/access\n",
stride, elapsed * 1e9 / accesses);
}
free(arr);
}
Typical results:
Stride 1: 0.68 ns/access ◄── sequential, everything cached
Stride 16: 2.85 ns/access ◄── one access per cache line
Stride 256: 8.43 ns/access
Stride 1024: 9.02 ns/access ◄── dominated by RAM latency
The same program in Rust produces the same results — same hardware, same cache hierarchy:
use std::time::Instant; fn main() { let size: usize = 32 * 1024 * 1024; let arr: Vec<i32> = vec![1; size]; let strides = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]; for &stride in &strides { let mut sum: i64 = 0; let start = Instant::now(); for _ in 0..10 { let mut i = 0; while i < size { sum += arr[i] as i64; i += stride; } } let ns = start.elapsed().as_nanos() as f64 / (10 * size / stride) as f64; println!("Stride {:5}: {:.2} ns/access (sum={})", stride, ns, sum); } }
Recap
┌────────────────────────────────────────────────────────────┐
│ Memory is a hierarchy: fast-small at top, slow-huge below │
│ CPU fetches data in 64-byte cache lines │
│ │
│ Spatial locality: access nearby addresses (arrays win) │
│ Temporal locality: re-use recent data (loops, stack win) │
│ │
│ Stack is fast: always in L1 cache │
│ Linked lists are slow: every node is a cache miss │
│ SoA can be 2-4x faster than AoS for batch processing │
└────────────────────────────────────────────────────────────┘
Task
- Compile and run
stride.c(or the Rust version). Record ns/access for strides 1, 16, and 1024. - Calculate: how many times slower is stride-1024 vs. stride-1?
- Run
perf stat -e L1-dcache-load-misses ./strideand watch miss counts climb with stride. - Challenge: Modify the program to use random access — allocate an array of random indices, access
arr[indices[i]]. Compare ns/access to sequential. This is what pointer chasing looks like to the cache.
Next up: the actual instructions the CPU executes, and how your C and Rust code becomes assembly.
Instructions and the x86-64 ISA
Type this right now
// save as add.c
int add(int a, int b) { return a + b; }
$ gcc -S -O1 -o add.s add.c
$ cat add.s
You just turned C into assembly. It'll be about 5-10 lines of actual instructions. Everything else is metadata. Let's learn to read it.
What is an ISA?
An Instruction Set Architecture is the set of instructions a CPU understands. x86-64 (also called AMD64) is the ISA on virtually every desktop, laptop, and server from Intel and AMD.
It's not abstract. It's a specific, concrete list of operations encoded as bytes in memory. When you compile C or Rust, the compiler translates your code into these exact instructions.
The essential instructions
You don't need hundreds. These 16 cover 90% of compiler output:
┌──────────┬──────────────────────────────────────────────────────┐
│ Instr. │ What it does │
├──────────┼──────────────────────────────────────────────────────┤
│ mov a, b │ Copy value from b into a │
│ add a, b │ a = a + b │
│ sub a, b │ a = a - b │
│ lea a, b │ Load Effective Address: a = address of b │
│ │ (also used for quick math: lea rax, [rbx+rcx*4]) │
│ push val │ Subtract 8 from rsp, store val at [rsp] │
│ pop reg │ Load [rsp] into reg, add 8 to rsp │
│ call lbl │ Push rip (return address), jump to lbl │
│ ret │ Pop address from stack, jump to it │
│ cmp a, b │ Compute a - b, set flags (don't store result) │
│ jmp lbl │ Jump unconditionally (set rip = lbl) │
│ je lbl │ Jump if equal (zero flag set) │
│ jne lbl │ Jump if not equal │
│ jl lbl │ Jump if less than (signed) │
│ jg lbl │ Jump if greater than (signed) │
│ syscall │ Transfer control to the kernel │
│ nop │ Do nothing (used for alignment) │
└──────────┴──────────────────────────────────────────────────────┘
How a + b becomes assembly
int add(int a, int b) { return a + b; }
#![allow(unused)] fn main() { pub fn add(a: i32, b: i32) -> i32 { a + b } }
Both produce (at -O1 / -C opt-level=1):
add:
lea eax, [rdi+rsi]
ret
Two instructions. The System V AMD64 calling convention puts arg1 in rdi, arg2 in rsi. lea eax, [rdi+rsi] computes the sum in one instruction. Return value goes in eax.
Fun Fact:
leastands for "Load Effective Address." Designed for address math, but compilers abuse it as a fast calculator — addition and multiplication in one instruction without touching the flags register.
How if/else becomes assembly
int max(int a, int b) { return (a > b) ? a : b; }
#![allow(unused)] fn main() { pub fn max(a: i32, b: i32) -> i32 { if a > b { a } else { b } } }
At -O1:
max:
cmp edi, esi ; compare a and b
mov eax, esi ; assume b is the answer
cmovg eax, edi ; IF a > b, overwrite with a
ret
What do you think happens? The compiler didn't use
jmporje. Why not?Reveal: Branches cause pipeline stalls when mispredicted.
cmovg(conditional move) avoids the branch entirely — the CPU always executes it but conditionally writes the result.
At -O0 you'd see the explicit version: cmp sets flags, jle conditionally jumps to the else branch. That's how every if works at the hardware level.
How function calls work
call does two things in one instruction: push the return address, then jump.
Before CALL: After CALL:
rip = 0x401020 rip = 0x401200 (function start)
rsp = 0x7fff10 rsp = 0x7fff08 (moved down 8)
Stack: Stack:
┌──────────┐ 0x7fff10 ┌──────────┐ 0x7fff10
│ ... │ │ ... │
└──────────┘ ├──────────┤ 0x7fff08 ◄── rsp
│ 0x401025 │ return address
└──────────┘
ret does the reverse: pop the address, set rip to it.
The stack in assembly: prologue and epilogue
my_function:
push rbp ; save caller's base pointer
mov rbp, rsp ; set up our own base pointer
sub rsp, 32 ; room for local variables
; ... function body: locals at [rbp-4], [rbp-8], etc.
mov rsp, rbp ; deallocate locals
pop rbp ; restore caller's base pointer
ret
Higher addresses
┌─────────────────────┐
│ caller's frame │
├─────────────────────┤
│ return address │ ◄── pushed by 'call'
├─────────────────────┤
│ saved rbp │ ◄── rbp points here
├─────────────────────┤
│ local var 1 │ [rbp-4]
│ local var 2 │ [rbp-8]
├─────────────────────┤
│ │ ◄── rsp (top of stack)
└─────────────────────┘
Lower addresses
With optimization, the compiler often skips this entirely (frame pointer omission) — uses rsp directly, making code faster but debugging harder.
System calls: the ONLY door to the kernel
Your program (Ring 3) cannot access files, screens, or networks directly. The syscall instruction is the only way to ask the kernel.
Here's write(1, "hello\n", 6) in assembly:
mov rax, 1 ; syscall number: write
mov rdi, 1 ; fd: stdout
lea rsi, [msg] ; pointer to string
mov rdx, 6 ; byte count
syscall ; >>> kernel handles it, result in rax <<<
Convention: syscall number in rax, args in rdi, rsi, rdx, r10, r8, r9.
A complete "Hello, World" in x86-64 assembly
No C library. No runtime. Just raw syscalls.
; save as hello.s
; build: as -o hello.o hello.s && ld -o hello hello.o
.section .text
.global _start
_start:
mov rax, 1 ; syscall: write
mov rdi, 1 ; fd: stdout
lea rsi, [msg] ; buffer
mov rdx, 14 ; count
syscall
mov rax, 60 ; syscall: exit
mov rdi, 0 ; status: 0
syscall
.section .rodata
msg: .ascii "Hello, World!\n"
$ as -o hello.o hello.s && ld -o hello hello.o && ./hello
Hello, World!
Thirteen lines. Two syscalls. A complete program at the lowest level before raw bytes.
Godbolt: the compiler explorer
godbolt.org — paste C or Rust on the left, see assembly on the right, live. Try int square(int x) { return x * x; } in C and pub fn square(x: i32) -> i32 { x * x } in Rust. Both produce mov eax, edi; imul eax, edi; ret. Same CPU language, different programmer languages.
Fun Fact: Godbolt color-codes which assembly lines map to which source lines.
Reading compiler output: a walkthrough
long sum_array(int *arr, int len) {
long total = 0;
for (int i = 0; i < len; i++) total += arr[i];
return total;
}
At -O1:
sum_array:
mov eax, 0 ; total = 0
mov ecx, 0 ; i = 0
jmp .L2
.L3:
movsx rdx, DWORD PTR [rdi+rcx*4] ; rdx = arr[i]
add rax, rdx ; total += arr[i]
add ecx, 1 ; i++
.L2:
cmp ecx, esi ; i < len?
jl .L3 ; if yes, loop body
ret
rdi = arr (1st arg) esi = len (2nd arg)
rax = total (return) ecx = i (counter)
[rdi+rcx*4] = arr + i*4 = arr[i] (CPU does array indexing in hardware)
The *4 in [rdi+rcx*4] computes the byte offset — the CPU has built-in support for scaled-index addressing.
Recap
┌──────────────────────────────────────────────────────────┐
│ x86-64 is the concrete ISA on your CPU. │
│ │
│ ~16 instructions cover 90% of compiler output: │
│ mov, add, sub, lea, push, pop, call, ret, │
│ cmp, jmp, je/jne/jl/jg, syscall, nop │
│ │
│ if/else → cmp + conditional jump (or cmov) │
│ function call → call (push rip, jump) / ret (pop rip) │
│ syscall → the ONLY way to talk to the kernel │
│ │
│ C and Rust produce the same assembly at the same │
│ optimization level. │
└──────────────────────────────────────────────────────────┘
Task
- Write a simple C function (try
int factorial(int n)with a loop). - Compile with
gcc -S -O0 -o fact_O0.s fact.candgcc -S -O2 -o fact_O2.s fact.c. - In the
-O0version, identify: the prologue, where locals are stored, thecmp+jump for the loop, and the epilogue. - In
-O2, see how much the compiler eliminated. Can you still match assembly to source? - Bonus: Paste the same function in Godbolt with both GCC and rustc at
-O2. Compare.
Next: the wall between your code and the kernel — and why the CPU enforces it with hardware.
Privilege, Protection, and System Calls
Type this right now
$ cat > hello.c << 'EOF'
#include <stdio.h>
int main() { printf("Hello from C!\n"); return 0; }
EOF
$ gcc -o hello hello.c
$ strace ./hello 2>&1 | tail -20
You'll see a flood of output like execve(...), brk(...), mmap(...), write(1, "Hello from C!\n", 14). Those are system calls — every interaction between your program and the OS. Your simple printf triggered dozens of them.
The two worlds
Your CPU has (at least) two privilege levels, enforced by the hardware itself:
┌─────────────────────────────────────────────────────────┐
│ Ring 3: User Mode │
│ │
│ Your program lives here. │
│ CAN: execute instructions, read/write own memory │
│ CANNOT: access hardware, read other processes' memory, │
│ change page tables, halt the CPU │
├═════════════════════════════════════════════════════════╡
│ ▲▲▲ HARDWARE-ENFORCED WALL ▲▲▲ │
│ The ONLY way through: the syscall instruction. │
├═════════════════════════════════════════════════════════╡
│ Ring 0: Kernel Mode │
│ │
│ The OS kernel lives here. │
│ CAN: everything — all memory, all hardware, all regs │
└─────────────────────────────────────────────────────────┘
This isn't a software convention. The CPU has a register bit tracking the current privilege level. When it says "Ring 3," certain instructions are physically impossible. The CPU will fault if you try.
What do you think happens? What stops a program from changing that privilege bit to Ring 0?
Reveal: The instruction to change privilege level is itself privileged. You can't run it from Ring 3. Hardware catch-22, by design. The only way into Ring 0 is through controlled entry points the kernel sets up at boot time.
The x86 protection rings
x86 defines four rings, but modern OSes use only two:
┌───────────────────┐
│ Ring 0: Kernel │ Full access
├───────────────────┤
│ Ring 1 (unused) │
├───────────────────┤
│ Ring 2 (unused) │
├───────────────────┤
│ Ring 3: User │ Your program
└───────────────────┘
Why you can't read another process's memory
Two mechanisms work together:
1. Virtual memory. Each process has its own page table. Process A's address 0x4000 maps to a completely different physical location than Process B's 0x4000.
2. Privilege levels. Page tables live in kernel memory. Only Ring 0 can modify them. You can't remap your own pages because writing to cr3 (page table base register) is privileged.
Process A Process B
┌──────────┐ ┌──────────┐
│ 0x4000 ──┼──┐ │ 0x4000 ──┼──┐
└──────────┘ │ └──────────┘ │
▼ page table A ▼ page table B
┌──────────┐ ┌──────────┐
│ phys: │ │ phys: │
│ 0x1A000 │ │ 0x3F000 │
└──────────┘ └──────────┘
Same virtual address → different physical memory.
Page tables in kernel memory — Ring 3 can't touch them.
The CPU enforces this on every memory access. Hardware gate, not software check.
System calls: the controlled gateway
Your program can't read files, write to the screen, or allocate memory from the OS directly. It asks the kernel through system calls.
Your program (Ring 3) Kernel (Ring 0)
───────────────────── ────────────────
1. Set up arguments
rax = syscall number
rdi, rsi, rdx = args
2. Execute 'syscall' ──────► 3. CPU switches to Ring 0
Kernel validates args
Kernel does the work
5. Continue in Ring 3 ◄────── 4. Result in rax
CPU switches back to Ring 3
The syscall instruction atomically saves rip/rflags, switches to Ring 0, and jumps to the kernel's entry point. The kernel runs sysret to switch back.
Common system calls
┌───────────┬────────┬───────────────────────────────────────────┐
│ Syscall │ Number │ What it does │
├───────────┼────────┼───────────────────────────────────────────┤
│ read │ 0 │ Read bytes from a file descriptor │
│ write │ 1 │ Write bytes to a file descriptor │
│ open │ 2 │ Open a file, get a file descriptor │
│ close │ 3 │ Close a file descriptor │
│ mmap │ 9 │ Map a file or allocate memory pages │
│ mprotect │ 10 │ Change memory protection (r/w/x) │
│ brk │ 12 │ Expand/shrink the heap │
│ fork │ 57 │ Create a copy of the current process │
│ execve │ 59 │ Replace process with a new program │
│ exit │ 60 │ Terminate the process │
└───────────┴────────┴───────────────────────────────────────────┘
strace: watching syscalls in real time
C hello world
$ strace ./hello 2>&1 | grep -E "execve|write|exit|mmap|brk"
execve("./hello", ["./hello"], [...]) = 0
brk(NULL) = 0x55a8c3d2f000
mmap(NULL, 2228224, PROT_READ, ...) = 0x7f3a...
mmap(0x7f3a..., 1540096, PROT_READ|PROT_EXEC, ...) = 0x7f3a...
brk(0x55a8c3d50000) = 0x55a8c3d50000
write(1, "Hello from C!\n", 14) = 14
exit_group(0) = ?
Your one-line printf caused: execve (load program), mmap (load libc), brk (allocate buffers), write (the actual output), exit_group (terminate).
Rust hello world
fn main() { println!("Hello from Rust!"); }
$ rustc -o hello_rs hello.rs
$ strace ./hello_rs 2>&1 | grep -E "write|exit"
write(1, "Hello from Rust!\n", 17) = 17
exit_group(0) = ?
Same write syscall. Both languages eventually call write(1, ...) to put text on your screen.
Fun Fact: A statically linked C hello world can have as few as 3-4 syscalls total. A statically linked Rust hello world typically has 20-30 because Rust's runtime initializes signal handlers, thread-local storage, and a panic handler.
What happens when you break the rules
int main() {
int *p = (int *)0xDEADBEEF; // an address we don't own
return *p; // try to read it
}
Here's the chain of events:
1. CPU executes: mov eax, [0xDEADBEEF]
2. CPU consults page table → page not mapped
3. CPU generates PAGE FAULT exception (hardware, not software)
4. CPU switches to Ring 0, jumps to kernel's fault handler
5. Kernel: address 0xDEADBEEF is not valid for this process
6. Kernel sends SIGSEGV signal to the process
7. Default handler: terminate + core dump
8. You see: "Segmentation fault (core dumped)"
Verify with strace:
$ strace ./segfault 2>&1 | tail -3
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xdeadbeef} ---
+++ killed by SIGSEGV (core dumped) +++
SEGV_MAPERR = address not mapped. This is a hardware mechanism. The CPU caught it before the read completed.
Other protection violations
┌──────────────────┬──────────────────────────────────────────┐
│ Violation │ What happens │
├──────────────────┼──────────────────────────────────────────┤
│ Read unmapped │ Page fault → SIGSEGV (SEGV_MAPERR) │
│ memory │ │
│ Write to │ Page fault → SIGSEGV (SEGV_ACCERR) │
│ read-only page │ e.g., writing to string literals │
│ Execute non- │ Page fault → SIGSEGV (NX bit) │
│ executable page │ │
│ Privileged │ CPU exception → SIGILL │
│ instruction │ Ring 0 instr. from Ring 3 │
└──────────────────┴──────────────────────────────────────────┘
All follow the same pattern: CPU detects violation in hardware, generates exception, kernel converts to signal, process dies.
In Rust, most of these are prevented at compile time. But via unsafe or FFI, the hardware mechanism is identical.
The complete picture
┌──────────────────────────────────────────────────┐
│ Your Program (Ring 3) │
│ ┌─────────────┐ ┌──────────────┐ │
│ │ Your code │──►│ C lib / Rust │ │
│ │ (main, etc.) │ │ runtime │ │
│ └─────────────┘ └──────┬───────┘ │
│ printf / println │
│ malloc / Vec::push │
│ │ │
│ ┌──────▼──────┐ │
│ │ syscall │ │
├════════════════════╪════════════════════════════╡
│ Kernel (Ring 0) │ │
│ ┌─────▼──────┐ │
│ │ Syscall │ │
│ │ handler │ │
│ └──────┬─────┘ │
│ ┌─────────┼─────────┐ │
│ ▼ ▼ ▼ │
│ Filesystem Memory Network │
├══════════╪═════════╪═════════╪══════════════════╡
│ HW Disk RAM NIC │
└──────────────────────────────────────────────────┘
Your code calls a library. The library executes syscall. CPU switches to Ring 0. Kernel does the work. CPU switches back. Your program continues.
Every. Single. Time. No shortcuts, no backdoors.
Recap
┌──────────────────────────────────────────────────────────┐
│ Ring 0 / Ring 3 = HARDWARE-enforced privilege levels │
│ │
│ Your program cannot: │
│ • Access another process's memory (page tables) │
│ • Access hardware directly (privileged instructions) │
│ • Modify its own page tables (privileged registers) │
│ │
│ syscall = the ONLY door to Ring 0 │
│ Segfaults = hardware events: page fault → SIGSEGV │
│ strace shows every syscall your program makes │
└──────────────────────────────────────────────────────────┘
Task
- Run
strace -c ./hello(C) andstrace -c ./hello_rs(Rust). Compare total syscall counts. - Run
strace -e write ./helloto filter for justwritecalls. - Violations. Compile
int main() { return *(int *)0; }andint main() { char *s = "hello"; s[0] = 'H'; return 0; }. Strace each. NoteSEGV_MAPERRvs.SEGV_ACCERR. - Bonus: Write assembly that executes
hltfrom user mode. What signal do you get?
Next: we leave the hardware behind and enter the illusion — how your program thinks memory is laid out.
Your Program's View of Memory
Type this right now
Open a terminal and run:
$ cat /proc/self/maps
You just asked the cat process to show you its own memory map. You'll see something like this:
55a3b2c00000-55a3b2c02000 r--p 00000000 08:01 1234567 /usr/bin/cat
55a3b2c02000-55a3b2c07000 r-xp 00002000 08:01 1234567 /usr/bin/cat
55a3b2c07000-55a3b2c0a000 r--p 00007000 08:01 1234567 /usr/bin/cat
55a3b2c0a000-55a3b2c0b000 r--p 00009000 08:01 1234567 /usr/bin/cat
55a3b2c0b000-55a3b2c0c000 rw-p 0000a000 08:01 1234567 /usr/bin/cat
55a3b3e00000-55a3b3e21000 rw-p 00000000 00:00 0 [heap]
7f8a12000000-7f8a12200000 r--p 00000000 08:01 2345678 /usr/lib/libc.so.6
...
7ffd4a100000-7ffd4a122000 rw-p 00000000 00:00 0 [stack]
That is a real, live view of a process's memory. Every process on your system has one. Including the ones you write.
The flat address space illusion
Your program believes it has a simple, flat stretch of memory from address 0 all the way up.
On a 64-bit system, addresses are 64 bits wide, but only 48 bits are actually used:
0x0000_0000_0000_0000 ← Bottom of address space
|
| User space (your program lives here)
|
0x0000_7FFF_FFFF_FFFF ← Top of user space
|
| (non-canonical gap — addresses here cause a fault)
|
0xFFFF_8000_0000_0000 ← Bottom of kernel space
|
| Kernel space (off-limits to your code)
|
0xFFFF_FFFF_FFFF_FFFF ← Top of address space
Your program gets the lower half. The kernel takes the upper half. Try to touch kernel space and the CPU itself — not the OS, the hardware — blocks you.
🧠 What do you think happens?
If you write
*(int *)0xFFFF800000000000 = 42;in a C program, what happens? Does the compiler stop you? Does the OS stop you? Does the CPU stop you? Try it and observe the error message.
See the layout with your own code — C
This program prints the address of something from each major memory region:
// save as memview.c — compile: gcc -o memview memview.c
#include <stdio.h>
#include <stdlib.h>
int global_var = 42; // .data section
int main() {
int stack_var = 7; // stack
int *heap_var = malloc(64); // heap
printf("Code (main): %p\n", (void *)main);
printf("Global: %p\n", (void *)&global_var);
printf("Heap: %p\n", (void *)heap_var);
printf("Stack: %p\n", (void *)&stack_var);
free(heap_var);
return 0;
}
Run it:
$ gcc -o memview memview.c && ./memview
Code (main): 0x55a3b2c02149
Global: 0x55a3b2c04010
Heap: 0x55a3b3e002a0
Stack: 0x7ffd4a121a5c
Look at the addresses. Really look at them.
- Code (
0x55a...): relatively low - Global (
0x55a...): right next to code - Heap (
0x55a3b3...): higher, but still in the0x55...range - Stack (
0x7ffd...): way up high, near the top of user space
The layout reveals itself through raw addresses.
Same thing in Rust
// save as memview.rs — compile: rustc -o memview_rs memview.rs use std::boxed::Box; static GLOBAL_VAR: i32 = 42; fn main() { let stack_var: i32 = 7; let heap_var = Box::new(64); println!("Code (main): {:p}", main as fn() as *const ()); println!("Global: {:p}", &GLOBAL_VAR as *const i32); println!("Heap: {:p}", &*heap_var as *const i32); println!("Stack: {:p}", &stack_var as *const i32); }
$ rustc -o memview_rs memview.rs && ./memview_rs
Code (main): 0x55f8a1005b30
Global: 0x55f8a1009000
Heap: 0x55f8a1a2b9d0
Stack: 0x7ffc3b2e1d44
Same pattern. Same regions. Same operating system underneath. Rust's ownership model is a compile-time concept — at runtime, it's the same address space as C.
Parsing /proc/self/maps
Every line in /proc/self/maps has this format:
address perms offset dev inode pathname
55a3b2c02000-... r-xp 00002000 08:01 12345 /usr/bin/cat
| Column | Meaning |
|---|---|
address | Virtual address range (start-end) |
perms | r = read, w = write, x = execute, p = private, s = shared |
offset | Offset into the file (for file-backed mappings) |
dev | Device (major:minor) |
inode | Inode number of the file |
pathname | File path, or [heap], [stack], [vdso], etc. |
The permissions tell you what each region is:
r-xp→ code (read + execute, no write)rw-p→ data, heap, stack (read + write, no execute)r--p→ read-only data (constants, string literals)
💡 Fun Fact
The
[vdso]entry stands for "virtual dynamic shared object." It's a tiny piece of kernel code mapped into every process's address space so that certain system calls (likegettimeofday) can run without actually entering the kernel. It's a performance trick — the kernel pretends to be a shared library.
The big picture
High addresses
0x7FFF_FFFF_FFFF ┌─────────────────────────┐
│ Stack │ ← grows downward
│ (local vars, frames) │
├─────────────────────────┤
│ │
│ (unmapped gap) │
│ │
├─────────────────────────┤
│ Heap │ ← grows upward
│ (malloc, Box::new) │
├─────────────────────────┤
│ BSS (zeroed globals) │
├─────────────────────────┤
│ Data (init globals) │
├─────────────────────────┤
│ Text (your code) │
0x0000_0000_0000 └─────────────────────────┘
Low addresses
Every program you've ever run has this shape. The next chapter dissects each region in detail.
🔧 Task
Write a C program and a Rust program that each print the address of:
- A function (code region)
- A global/static variable (data region)
- A local variable (stack)
- A heap-allocated value
Run both. Compare the addresses. Do they fall in the same general ranges? Now run
cat /proc/<PID>/mapsfor each (usegetpid()in C orstd::process::id()in Rust to print the PID, then sleep so you can inspect the maps file before the process exits).Bonus: Pipe the output of
/proc/self/mapsthrough your own program. In C:FILE *f = fopen("/proc/self/maps", "r"); char line[256]; while (fgets(line, sizeof(line), f)) printf("%s", line); fclose(f);In Rust:
#![allow(unused)] fn main() { println!("{}", std::fs::read_to_string("/proc/self/maps").unwrap()); }
Anatomy of a Process Address Space
Type this right now
// save as regions.c — compile: gcc -g -o regions regions.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int initialized_global = 42; // .data
int uninitialized_global; // .bss
const char *string_lit = "I live in .rodata";
int main() {
int stack_var = 1;
int *heap_var = malloc(64);
printf("--- Memory Regions ---\n");
printf("Text (main): %p\n", (void *)main);
printf("Rodata (string): %p\n", (void *)string_lit);
printf("Data (init): %p\n", (void *)&initialized_global);
printf("BSS (uninit): %p\n", (void *)&uninitialized_global);
printf("Heap (malloc): %p\n", (void *)heap_var);
printf("Stack (local): %p\n", (void *)&stack_var);
printf("PID: %d (inspect /proc/%d/maps)\n", getpid(), getpid());
sleep(30); // time to inspect
free(heap_var);
return 0;
}
Compile and run it. While it sleeps, open another terminal and run
cat /proc/<PID>/maps. You'll see every region we're about to discuss.
THE Diagram
This is the memory layout of a running process on x86-64 Linux. Commit it to memory.
0xFFFF_FFFF_FFFF_FFFF ┌─────────────────────────────────────────────┐
│ │
│ Kernel Space │
│ (mapped into every process, but you │
│ can't touch it — ring 0 only) │
│ │
0xFFFF_8000_0000_0000 ├─────────────────────────────────────────────┤
│ │
│ (non-canonical address gap) │
│ │
0x0000_7FFF_FFFF_FFFF ├─────────────────────────────────────────────┤
│ │
│ Stack [rw-p] │
│ grows ↓ downward │
│ (local vars, return addrs, saved regs) │
│ │
├ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┤
│ Guard page [---p] (unmapped) │
├─────────────────────────────────────────────┤
│ │
│ Memory-mapped region │
│ (shared libraries: libc.so, ld-linux.so) │
│ (mmap'd files, anonymous mmap) │
│ │
├─────────────────────────────────────────────┤
│ │
│ (gap) │
│ │
├─────────────────────────────────────────────┤
│ │
│ Heap [rw-p] │
│ grows ↑ upward │
│ (malloc, calloc, Box::new, Vec::new) │
│ │
├─────────────────────────────────────────────┤
│ BSS [rw-p] │
│ (uninitialized globals, zeroed at load) │
├─────────────────────────────────────────────┤
│ Data [rw-p] │
│ (initialized globals: int x = 42) │
├─────────────────────────────────────────────┤
│ Rodata [r--p] │
│ (string literals, const arrays) │
├─────────────────────────────────────────────┤
│ Text [r-xp] │
│ (your compiled code — machine instr.) │
0x0000_0000_0000_0000 └─────────────────────────────────────────────┘
Now let's walk through each region from bottom to top.
Text: your compiled code
The .text section holds your program's machine instructions — the compiled output of every
function you wrote.
| Property | Value |
|---|---|
| Permissions | r-xp (read, execute, no write) |
| Source | Loaded from the ELF binary |
| Lifetime | Entire process lifetime |
| Who manages | OS loader maps it from disk |
Read and execute, but not writable. This is enforced by the hardware (page table permissions). If your code could rewrite itself, every buffer overflow would be an arbitrary code execution exploit. The CPU enforces W^X: a page is either writable or executable, never both.
🧠 What do you think happens?
What if you cast a function pointer to
int *and try to write to it?int *p = (int *)main; *p = 0x90909090; // NOP sled?Try it. The CPU will raise a fault before the write completes.
Data: initialized globals
int answer = 42; // C: goes in .data
#![allow(unused)] fn main() { static ANSWER: i32 = 42; // Rust: goes in .data }
This section holds global and static variables that have explicit initial values. The values are
stored in the ELF binary itself — when you cat the binary, the bytes 2a 00 00 00 (42 in
little-endian) are literally sitting in the file.
| Property | Value |
|---|---|
| Permissions | rw-p (read + write) |
| Source | Values loaded from ELF binary |
| Lifetime | Entire process lifetime |
| Who manages | OS loader |
BSS: uninitialized globals
int counter; // C: goes in .bss (implicitly zero)
static int buffer[4096]; // C: 16KB of zeros — in .bss
#![allow(unused)] fn main() { static mut COUNTER: i32 = 0; // Rust: goes in .bss (explicitly zero) }
BSS stands for "Block Started by Symbol" — an old assembler directive. What matters: the OS zeroes this memory at load time. The values are not stored in the binary.
| Property | Value |
|---|---|
| Permissions | rw-p (read + write) |
| Source | Zeroed by OS at load time — NOT stored on disk |
| Lifetime | Entire process lifetime |
| Who manages | OS loader |
💡 Fun Fact
If you declare
static int bigarray[1000000];in C, your binary does NOT grow by 4MB. The ELF file just records "I need 4,000,000 bytes of BSS." The OS allocates and zeroes them when the process starts. This is why BSS exists — it would be absurd to store millions of zeros on disk.
To see the savings yourself:
$ readelf -S regions | grep -E "\.data|\.bss"
[24] .data PROGBITS 0000000000004000 003000 000008 0 WA 0 0 8
[25] .bss NOBITS 0000000000004008 003008 000004 0 WA 0 0 4
Notice .bss is NOBITS. Zero bytes on disk. Full size in memory.
Heap: dynamic allocation
int *p = malloc(100); // C: heap allocation
#![allow(unused)] fn main() { let p = Box::new(42); // Rust: heap allocation let v = vec![1, 2, 3]; // Rust: heap allocation (via Vec) }
The heap is where dynamic allocations live. It starts just above BSS and grows upward toward higher addresses.
| Property | Value |
|---|---|
| Permissions | rw-p (read + write) |
| Source | Allocated at runtime via brk or mmap system calls |
| Lifetime | Until explicitly freed (C) or dropped (Rust) |
| Who manages | The allocator (malloc/free), kernel provides pages |
The heap is managed in two layers:
Your code malloc(100) / Box::new(42)
│
▼
Allocator glibc malloc / jemalloc / etc.
(user space) Maintains free lists, splits/merges blocks
│
▼
Kernel brk() for small allocations
mmap() for large allocations (>128KB)
We'll dissect the allocator in Chapter 20. For now, know that malloc doesn't call the kernel
every time — it maintains its own pool.
Stack: function call frames
void foo() {
int x = 10; // lives on the stack
int arr[100]; // 400 bytes on the stack
}
#![allow(unused)] fn main() { fn foo() { let x: i32 = 10; // lives on the stack let arr = [0i32; 100]; // 400 bytes on the stack } }
The stack starts near the top of user space and grows downward toward lower addresses.
| Property | Value |
|---|---|
| Permissions | rw-p (read + write, no execute) |
| Source | Allocated by the OS when the process starts |
| Lifetime | Until the function returns |
| Who manages | The CPU (rsp register), compiler (frame layout) |
| Size limit | Default 8MB (ulimit -s) |
Every function call pushes a frame onto the stack: return address, saved registers, local
variables. Every return pops it. The stack pointer (rsp) moves up and down — that's it. No
allocator, no free lists, no fragmentation. One register, one instruction to allocate, one
instruction to free.
That's why the stack is fast.
Memory-mapped regions
Between the heap and the stack, you'll find memory-mapped regions. These include:
- Shared libraries:
libc.so,ld-linux-x86-64.so,libpthread.so - Anonymous mappings: large
malloccalls (>128KB) usemmapinstead ofbrk - File mappings:
mmap()can map a file directly into your address space
7f8a12000000-7f8a12200000 r--p /usr/lib/x86_64-linux-gnu/libc.so.6
7f8a12200000-7f8a12395000 r-xp /usr/lib/x86_64-linux-gnu/libc.so.6
7f8a12395000-7f8a123ed000 r--p /usr/lib/x86_64-linux-gnu/libc.so.6
Notice libc has multiple entries — different sections (code, read-only data, writable data) are mapped with different permissions. Same library, different protection levels.
Kernel space: here be dragons
0xFFFF_8000_0000_0000 and above
The top half of the address space is reserved for the kernel. It's mapped into every process's
page table, but the page table entries are marked supervisor only. The CPU checks your current
privilege level (ring 3 for user code) against the page permissions (ring 0 required for kernel
pages). If you try to access kernel memory from user code, the CPU raises a page fault. The kernel
handles it by sending your process a SIGSEGV.
You interact with kernel space only through system calls — read, write, mmap, brk. Those
switch the CPU to ring 0, run kernel code, then switch back. That boundary is absolute.
Rust: same layout, different guarantees
Here's the key insight: Rust programs have the exact same memory layout as C programs.
static GLOBAL: i32 = 42; // .data — same as C static UNINIT: std::sync::atomic::AtomicI32 = std::sync::atomic::AtomicI32::new(0); // .bss (zero-initialized) fn main() { // .text — same as C let local = 10; // stack — same as C let boxed = Box::new(20); // heap — same as C }
Ownership, borrowing, lifetimes — they exist only at compile time. The generated machine code uses
the same stack, the same heap, the same text/data/bss sections. rustc doesn't invent a new memory
model. It enforces rules about how you use the one that already exists.
💡 Fun Fact
You can link Rust and C code together into a single binary. They share the same address space, the same heap, the same stack. A Rust function can call a C function (via
extern "C") and the stack frames interleave seamlessly. There's no boundary at runtime — only at compile time.
🔧 Task
Write a program (in C or Rust — or both) that places data in every region:
- A function → text
- An initialized global → data
- An uninitialized global → BSS
- A string literal → rodata
- A
malloc/Box::new→ heap- A local variable → stack
Print the address of each. Then, while the program is sleeping, run:
$ cat /proc/<PID>/mapsFor each printed address, find the corresponding line in the maps output. Verify:
- The address falls within the range on that line
- The permissions match what you'd expect (code is
r-xp, globals arerw-p, etc.)- The pathname column tells you whether it's from your binary, a library, or anonymous
Bonus: Use
readelf -S ./regionsto list all sections. Find.text,.data,.bss, and.rodata. Compare their sizes with what you'd predict from your code.
The Stack: Disciplined and Fast
Type this right now
// save as stacktrace.c — compile: gcc -g -o stacktrace stacktrace.c
#include <stdio.h>
void c() {
int z = 3;
printf("c: &z = %p\n", (void *)&z);
}
void b() {
int y = 2;
printf("b: &y = %p\n", (void *)&y);
c();
}
void a() {
int x = 1;
printf("a: &x = %p\n", (void *)&x);
b();
}
int main() {
a();
return 0;
}
$ gcc -g -o stacktrace stacktrace.c && ./stacktrace
a: &x = 0x7ffd3a1b1c5c
b: &y = 0x7ffd3a1b1c3c
c: &z = 0x7ffd3a1b1c1c
Each address is lower than the last. The stack grows downward. Every function call you've ever made used this mechanism. Let's see exactly how.
What is a stack frame?
When you call a function, the CPU creates a stack frame — a block of memory on the stack that holds everything that function needs:
High addresses (top of stack at process start)
┌──────────────────────────┐
│ main's frame │
│ ┌────────────────────┐ │
│ │ (main's locals) │ │
│ │ saved rbp │ │
│ │ return address │ │ ← where to go when main returns
│ └────────────────────┘ │
├──────────────────────────┤
│ a's frame │
│ ┌────────────────────┐ │
│ │ x = 1 │ │ [rbp - 4]
│ │ saved rbp ─────────│──┘ points to main's rbp
│ │ return address │ addr in main after call a()
│ └────────────────────┘ │
├──────────────────────────┤
│ b's frame │
│ ┌────────────────────┐ │
│ │ y = 2 │ │ [rbp - 4]
│ │ saved rbp ─────────│──┘ points to a's rbp
│ │ return address │ addr in a after call b()
│ └────────────────────┘ │
├──────────────────────────┤
│ c's frame │
│ ┌────────────────────┐ │
│ │ z = 3 │ │ [rbp - 4]
│ │ saved rbp ─────────│──┘ points to b's rbp
│ │ return address │ addr in b after call c()
rsp ──────► │ └────────────────────┘ │
└──────────────────────────┘
Low addresses (stack grows this direction)
Two registers drive the whole thing:
rsp(stack pointer): always points to the top of the stack (the lowest used address)rbp(base pointer): points to the base of the current frame — a stable reference point
How call and ret work
When the CPU executes call some_function:
call some_function
; is equivalent to:
push rip ; push address of next instruction onto stack
jmp some_function ; jump to the function
When the CPU executes ret:
ret
; is equivalent to:
pop rip ; pop address from stack into instruction pointer
; execution continues at that address
That's it. call = push return address + jump. ret = pop return address + jump back.
The function prologue and epilogue
Every compiled function begins with a prologue and ends with an epilogue. This is the bookkeeping that creates and destroys stack frames.
Prologue (at the start of every function):
push rbp ; save caller's base pointer
mov rbp, rsp ; set our base pointer to current stack top
sub rsp, N ; reserve N bytes for local variables
Epilogue (at the end of every function):
leave ; equivalent to: mov rsp, rbp; pop rbp
ret ; pop return address, jump to it
Let's trace through calling a():
Before call a():
rsp → [... main's stuff ...]
1. call a → push return address, jump to a
rsp → [ret_addr | ... main's stuff ...]
2. push rbp → save main's rbp
rsp → [main_rbp | ret_addr | ... main's stuff ...]
3. mov rbp, rsp → rbp now points here
rbp → [main_rbp | ret_addr | ... main's stuff ...]
4. sub rsp, 16 → reserve space for locals
rsp → [ | x = ? | padding | main_rbp | ret_addr | ...]
^^^^^^^^^^^^^^^^
a's local variables
🧠 What do you think happens?
If you never execute the epilogue — say, you
longjmpout of a function or a signal handler interrupts you — what happens to the stack? The space is still "allocated" (rsp was never restored). Think about what this means for stack usage in signal handlers.
Local variables: just offsets from rbp
The compiler doesn't give your local variables names at runtime. They're just offsets:
void example() {
int a = 10; // [rbp - 4]
int b = 20; // [rbp - 8]
long c = 30; // [rbp - 16]
}
The assembly:
example:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 10 ; a
mov DWORD PTR [rbp-8], 20 ; b
mov QWORD PTR [rbp-16], 30 ; c
leave
ret
No names. No types. Just memory locations relative to rbp. The "variable" is a fiction that
exists only in your source code and your debugger's symbol table.
C: extra stack tricks
C gives you a few extra ways to use the stack:
alloca — allocate on the stack dynamically:
#include <alloca.h>
void risky() {
int n = 1000;
char *buf = alloca(n); // n bytes on the stack, freed on return
buf[0] = 'A';
}
This just moves rsp down by n bytes. No malloc, no free, no heap involvement. Fast, but
dangerous — if n is too large, you blow through the stack limit with no warning.
Variable-Length Arrays (VLAs):
void process(int n) {
int arr[n]; // VLA: n elements on the stack
arr[0] = 42;
}
Same idea as alloca — the compiler adjusts rsp at runtime. Same risks.
Rust: stack discipline
Rust uses the same stack mechanics but gives you fewer footguns:
#![allow(unused)] fn main() { fn example() { let a: i32 = 10; // stack — same as C let b = [0u8; 100]; // stack — fixed-size array, size known at compile time let c = Box::new(42); // heap — when you need dynamic or large allocation // no alloca, no VLAs } }
Rust doesn't have alloca or VLAs. If you need a runtime-sized buffer, you use Vec<T>, which
allocates on the heap. This is a deliberate choice — the Rust team decided that stack overflows
from unchecked dynamic stack allocation are too dangerous to allow in safe code.
If you really need stack allocation with a runtime size, you'd use a crate or unsafe code.
But the standard path is: fixed sizes on the stack, dynamic sizes on the heap.
Stack size and limits
Check your current stack size limit:
$ ulimit -s
8192
That's 8192 KB = 8 MB. This is a per-thread limit. Every thread gets its own stack.
What this means in practice:
8 MB = 8,388,608 bytes
int takes 4 bytes
Maximum ints on the stack ≈ 2,000,000
But stack frames have overhead (return address, saved rbp, alignment),
and you have many nested function calls, so the practical limit is lower.
You can change it:
$ ulimit -s 16384 # 16 MB stack
$ ulimit -s unlimited # no limit (dangerous)
💡 Fun Fact
Threads created with
pthread_createget their own stack (default 2MB on most systems, not 8MB). You can configure this withpthread_attr_setstacksize. Rust'sstd::thread::Builderexposesstack_size()for the same purpose.
Guard pages: the safety net
The OS doesn't just give you 8MB of stack and hope for the best. It places a guard page at the bottom of the stack — an unmapped page (usually 4KB) that exists solely to catch stack overflows:
┌───────────────────────┐
│ Stack │ 8 MB of rw-p pages
│ (used portion) │
│ │
│ (unused portion) │
├───────────────────────┤
rsp could │ Guard Page │ ---p (no permissions)
reach here → │ (unmapped, 4KB) │
├───────────────────────┤
│ Other mappings... │
└───────────────────────┘
When rsp drops into the guard page, the CPU triggers a page fault. The kernel sees that the
faulting address is in the guard region and sends SIGSEGV to the process. Game over.
Stack overflow: what really happens
// save as overflow.c — compile: gcc -o overflow overflow.c
#include <stdio.h>
void recurse(int depth) {
char buffer[4096]; // 4KB per frame, eats the stack fast
buffer[0] = 'A'; // touch it so compiler doesn't optimize it away
printf("depth: %d, &buffer = %p\n", depth, (void *)buffer);
recurse(depth + 1);
}
int main() {
recurse(0);
return 0;
}
$ ./overflow
depth: 0, &buffer = 0x7ffd3a1b0c50
depth: 1, &buffer = 0x7ffd3a1afbe0
depth: 2, &buffer = 0x7ffd3a1aeb70
...
depth: 2040, &buffer = 0x7ffd39924f50
depth: 2041, &buffer = 0x7ffd39923ee0
Segmentation fault (core dumped)
Each frame is ~4KB. 8MB / 4KB = ~2048 frames. The numbers check out.
The same thing in Rust:
fn recurse(depth: u32) { let buffer = [0u8; 4096]; // Use black_box to prevent optimization std::hint::black_box(&buffer); println!("depth: {}, &buffer = {:p}", depth, &buffer); recurse(depth + 1); } fn main() { recurse(0); }
Rust's behavior is the same — a stack overflow triggers a SIGSEGV. The Rust runtime catches it
and prints a friendlier message (thread 'main' has overflowed its stack), but the underlying
mechanism is identical: rsp hits the guard page, CPU faults, kernel delivers a signal.
Why the stack is fast
Let's compare stack allocation to heap allocation:
Stack allocation:
sub rsp, 32 ; one instruction — done
Stack deallocation:
add rsp, 32 ; one instruction — done
(or just: leave / ret)
Heap allocation:
call malloc ; enters allocator
→ search free lists ; find a suitable block
→ maybe call brk() ; ask kernel for more memory
→ update metadata ; bookkeeping
→ return pointer ; finally done
Heap deallocation:
call free ; enters allocator
→ validate pointer ; is this a real allocation?
→ coalesce blocks ; merge with adjacent free blocks
→ update free list ; bookkeeping
→ maybe call munmap ; return pages to kernel
The stack is also always "hot" in the CPU cache — the top of the stack was accessed by the previous instruction, so it's almost certainly in L1 cache. Heap allocations can be scattered across memory, causing cache misses.
💡 Fun Fact
The x86
pushinstruction does two things in one: it decrementsrspby 8 and writes the value to[rsp]. It's so common that the CPU has special hardware to accelerate it. On modern Intel CPUs,pushcan execute with the same throughput as a simplemov— the stack engine handles therspupdate for free.
Inspecting frames with GDB
You can see the actual stack frames in a debugger:
$ gcc -g -o stacktrace stacktrace.c
$ gdb ./stacktrace
(gdb) break c
(gdb) run
Breakpoint 1, c () at stacktrace.c:5
(gdb) backtrace
#0 c () at stacktrace.c:5
#1 b () at stacktrace.c:12
#2 a () at stacktrace.c:18
#3 main () at stacktrace.c:22
(gdb) info frame
Stack level 0, frame at 0x7fffffffde30:
rip = 0x555555555149 in c (stacktrace.c:5);
saved rip = 0x555555555185
called by frame at 0x7fffffffde50
...
(gdb) info registers rsp rbp
rsp 0x7fffffffde10
rbp 0x7fffffffde20
The backtrace command walks the chain of saved rbp values — each rbp points to the previous
frame's rbp, forming a linked list up the stack. That's how your debugger reconstructs the call
chain.
🔧 Task
Write a recursive function in C (and Rust) that prints its depth and the address of a local variable at each level. Run it and observe:
- Addresses decrease (stack grows down)
- Eventually it crashes (stack overflow)
- The number of frames matches
ulimit -s/ frame_sizeBefore running, calculate: with
ulimit -s 8192(8MB) and a 4096-byte local array per frame, how many levels of recursion should you get? Run it and compare.Try
ulimit -s 16384and run again. Do you get roughly twice as many frames?Check
dmesg | tailafter the crash — you should see something like:[12345.678] overflow[9999]: segfault at 7ffd39920ff0 ip ... sp ... error 6The kernel logged the fault address. Confirm it's near the bottom of the stack region shown in
/proc/<PID>/maps.
The Heap: Flexible and Dangerous
Type this right now
// save as heap_intro.c — compile: gcc -o heap_intro heap_intro.c
#include <stdio.h>
#include <stdlib.h>
int main() {
int *p = malloc(100);
printf("malloc(100) returned: %p\n", (void *)p);
printf("\nProcess maps (heap region):\n");
FILE *f = fopen("/proc/self/maps", "r");
char line[256];
while (fgets(line, sizeof(line), f)) {
if (strstr(line, "[heap]")) printf(" %s", line);
}
fclose(f);
free(p);
return 0;
}
$ gcc -o heap_intro heap_intro.c && ./heap_intro
malloc(100) returned: 0x55a8b7e002a0
Process maps (heap region):
55a8b7e00000-55a8b7e21000 rw-p 00000000 00:00 0 [heap]
You asked for 100 bytes. The kernel gave the allocator 135,168 bytes (0x21000 = 132KB). That discrepancy is the whole story of this chapter.
The two sides of dynamic allocation
C — explicit control, explicit danger:
int *p = malloc(100); // allocate 100 bytes
*p = 42; // use it
free(p); // you MUST remember to do this
Rust — same allocation, automatic cleanup:
#![allow(unused)] fn main() { let p = Box::new(42); // allocate on heap println!("{}", *p); // use it // p is dropped here — free() called automatically }
Vec, String, Box — every Rust heap type calls the allocator underneath. The difference is
that Rust's ownership system guarantees free happens exactly once, at the right time.
The allocator as middleman
Your code never talks to the kernel directly for heap memory. There's a middleman:
┌─────────────────┐
│ Your Code │ malloc(100) / Box::new(42)
│ │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Allocator │ glibc malloc / jemalloc / mimalloc
│ (user space) │ Maintains free lists, bins, arenas
│ │ Splits large blocks, coalesces freed ones
└────────┬────────┘
│ Needs more memory?
▼
┌─────────────────┐
│ Kernel │ brk() — move the program break
│ │ mmap() — map new anonymous pages
└─────────────────┘
The allocator requests large chunks from the kernel, then carves them up to serve your small
malloc calls. This amortizes the cost of system calls — calling the kernel is expensive (hundreds
of cycles), but adjusting a pointer in a free list is cheap (a few cycles).
brk and sbrk: the old way
The simplest heap mechanism is brk/sbrk. The kernel maintains a pointer called the
program break — the end of the data segment. You can move it:
Before any allocation:
┌──────────────┐
│ Text │
│ Data │
│ BSS │
└──────┬───────┘
│ ← program break (brk)
│
▼ (unmapped)
After sbrk(4096):
┌──────────────┐
│ Text │
│ Data │
│ BSS │
├──────────────┤
│ Heap (4KB) │ ← new memory
└──────┬───────┘
│ ← program break moved up by 4096
▼ (unmapped)
#include <unistd.h>
void *old_brk = sbrk(0); // get current break
sbrk(4096); // move break up by 4096 bytes
void *new_brk = sbrk(0); // verify
printf("grew by: %ld\n", (char *)new_brk - (char *)old_brk);
Simple. Contiguous. But limited — you can only grow or shrink from one end. You can't free memory in the middle and return it to the OS.
mmap anonymous: the modern way
For large allocations (glibc's threshold is typically 128KB), the allocator skips brk entirely
and asks the kernel for pages directly:
#include <sys/mman.h>
void *p = mmap(NULL, 1048576, // 1 MB
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS,
-1, 0);
// use p...
munmap(p, 1048576); // return to kernel
mmap returns memory from anywhere in the address space — it doesn't need to be contiguous with
the heap. And when you munmap, the pages go back to the kernel immediately. No fragmentation
problem at the kernel level.
💡 Fun Fact
When you
malloc(1000000)in glibc (about 1MB), it doesn't usebrk. It callsmmapinternally. You can observe this withstrace:$ strace -e brk,mmap ./your_program mmap(NULL, 1003520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f...Small allocations use
brk. Large ones usemmap. The crossover point is tunable (mallopt(M_MMAP_THRESHOLD, ...)).
Fragmentation: the heap's worst enemy
Here's where the heap gets ugly. Watch this allocation sequence:
1. Allocate A (32 bytes), B (64 bytes), C (32 bytes):
┌────┬──────────┬────┐
│ A │ B │ C │
│ 32 │ 64 │ 32 │
└────┴──────────┴────┘
2. Free B:
┌────┬──────────┬────┐
│ A │ (free) │ C │
│ 32 │ 64 │ 32 │
└────┴──────────┴────┘
3. Try to allocate D (96 bytes):
┌────┬──────────┬────┬────────────┐
│ A │ (free) │ C │ D │
│ 32 │ 64 │ 32 │ 96 │
└────┴──────────┴────┴────────────┘
^ ^^^^^^^^
│ 64 bytes wasted — big enough
│ for some things, but not for D
The 64-byte hole can never be used for anything that needs more than 64 bytes. Over time, your heap becomes Swiss cheese — lots of small holes, no large contiguous blocks. Total free memory might be 10MB, but the largest contiguous block is 64 bytes.
This is external fragmentation, and it's one of the hardest problems in allocator design.
There's also internal fragmentation: you ask for 100 bytes, the allocator gives you 128 (rounded up for alignment). Those 28 extra bytes are wasted.
Internal fragmentation:
┌──────────────────────┐
│ Your 100 bytes │ pad │ ← allocator gives you a 128-byte block
│ │ 28 │ 28 bytes wasted inside the block
└──────────────────────┘
C's four deadly heap bugs
The heap in C is a minefield. Here are the four classic ways to corrupt it, each with code and a diagram.
1. Use-after-free
int *p = malloc(sizeof(int));
*p = 42;
free(p); // p is now a dangling pointer
*p = 99; // BUG: writing to freed memory
After malloc: After free: After *p = 99:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ p → [42] │ allocated │ p → [??] │ freed │ p → [99] │ CORRUPTED
└──────────┘ └──────────┘ └──────────┘
Allocator may have Someone else may
recycled this block own this memory now
The pointer p still holds the old address, but that memory may now belong to a different
allocation. Writing to it silently corrupts someone else's data. This is the most common source
of security vulnerabilities in C programs.
2. Double-free
int *p = malloc(sizeof(int));
free(p);
free(p); // BUG: freeing the same block twice
Free list before: After free(p): After free(p) again:
HEAD → [blk_A] HEAD → [p] → [blk_A] HEAD → [p] → [p] → [blk_A]
^^^^^^^^^^^^
Circular! Allocator metadata
is now corrupted.
Double-free corrupts the allocator's internal free list. The next two malloc calls may return
the same pointer, causing two parts of your program to stomp on each other's data.
3. Buffer overflow (heap)
int *p = malloc(10 * sizeof(int)); // 40 bytes
for (int i = 0; i <= 10; i++) { // BUG: off-by-one, writes 11 elements
p[i] = i;
}
Heap memory layout:
┌──────────────────────────────┬──────────────────────┐
│ p's allocation (40 bytes) │ allocator metadata │
│ [0][1][2]...[9] │ [size|flags|next] │
└──────────────────────────────┴──────────────────────┘
^^^^^^^^^^^^^^^^
p[10] writes HERE
Corrupts the allocator's
bookkeeping for the next block
You wrote past the end of your allocation. The bytes that follow are typically allocator metadata
for the next block — size, flags, free-list pointers. Corrupting them means the next malloc or
free behaves unpredictably. The crash often happens far from the bug, making it nightmarish
to debug.
4. Memory leak
void process_data() {
int *data = malloc(1024 * 1024); // 1 MB
// ... use data ...
return; // BUG: forgot to call free(data)
}
After calling process_data() 1000 times:
Heap:
┌──────┬──────┬──────┬──────┬──────┬──────┬─── ─┬──────┐
│ 1 MB │ 1 MB │ 1 MB │ 1 MB │ 1 MB │ 1 MB │ ... │ 1 MB │
│ LOST │ LOST │ LOST │ LOST │ LOST │ LOST │ LOST │ LOST │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
← 1 GB of memory, no way to free it →
Process RSS grows without bound. Eventually: OOM killer.
No pointer to the memory means no way to free it. The memory is allocated but unreachable. In a long-running server, this is death by a thousand cuts.
Rust prevents all four
Here's the remarkable thing: Rust's ownership system makes all four bugs compile-time errors.
Use-after-free: impossible
#![allow(unused)] fn main() { let p = Box::new(42); drop(p); // explicitly free println!("{}", *p); // COMPILE ERROR: use of moved value `p` }
error[E0382]: borrow of moved value: `p`
--> src/main.rs:4:20
|
2 | let p = Box::new(42);
| - move occurs because `p` has type `Box<i32>`
3 | drop(p);
| - value moved here
4 | println!("{}", *p);
| ^^ value borrowed here after move
After drop, the compiler knows p is gone. You can't use it. Period.
Double-free: impossible
#![allow(unused)] fn main() { let p = Box::new(42); drop(p); drop(p); // COMPILE ERROR: use of moved value `p` }
Same error. After the first drop, p is moved. You can't drop it again.
Buffer overflow: caught at runtime
#![allow(unused)] fn main() { let v = vec![0; 10]; println!("{}", v[10]); // RUNTIME PANIC: index out of bounds }
thread 'main' panicked at 'index out of bounds: the len is 10 but the index is 10'
Rust bounds-checks every array and vector access in safe code. You can opt out with
get_unchecked() in unsafe blocks, but the default is safety.
Memory leak: prevented by Drop
#![allow(unused)] fn main() { fn process_data() { let data = vec![0u8; 1024 * 1024]; // 1 MB on the heap // ... use data ... } // data is dropped here — Vec's Drop impl calls free() }
When data goes out of scope, Rust calls its Drop implementation, which frees the heap memory.
You didn't write free. You didn't need to remember. The compiler inserted it for you.
🧠 What do you think happens?
Rust makes leaking memory safe — you can intentionally leak with
Box::leak()ormem::forget(). Why would Rust allow this? Because leaking memory doesn't cause undefined behavior — it's wasteful, but it can't corrupt other data or violate memory safety. The safety guarantee is about soundness, not resource efficiency.
Side by side: the bug and the save
| Bug | C code | What happens | Rust equivalent | What happens |
|---|---|---|---|---|
| Use-after-free | free(p); *p = 1; | Silent corruption | drop(p); *p = 1; | Compile error |
| Double-free | free(p); free(p); | Allocator corruption | drop(p); drop(p); | Compile error |
| Buffer overflow | p[11] on size-10 | Silent write past end | v[11] on size-10 | Panic with message |
| Memory leak | Forget to free | Memory grows forever | Scope exit | Drop runs automatically |
Seeing the allocator in action
You can watch malloc call the kernel using strace:
$ strace -e brk,mmap,munmap ./heap_intro
brk(NULL) = 0x55a8b7e00000
brk(0x55a8b7e21000) = 0x55a8b7e21000
...
That first brk(NULL) queries the current program break. The second moves it up by 0x21000 bytes
(132KB). That's the allocator requesting its initial arena from the kernel — even though you only
asked for 100 bytes.
For a larger allocation:
void *big = malloc(256 * 1024); // 256 KB
$ strace -e brk,mmap ./big_alloc
mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f...
256KB > 128KB threshold, so malloc chose mmap instead of brk.
Heap metadata: what the allocator tracks
Every malloc block has hidden metadata. The allocator stores bookkeeping right next to your
data:
What malloc(100) actually looks like in memory:
┌─────────────────┬──────────────────────────────────┐
│ Chunk header │ Your 100 bytes (user data) │
│ (16 bytes) │ │
│ size | flags │ │
└─────────────────┴──────────────────────────────────┘
^ ^
│ └── pointer returned by malloc
└── actual start of the allocation
The pointer you get back is PAST the header.
This is why buffer overflows are so catastrophic — write past your allocation and you corrupt the next chunk's header. The allocator trusts its own metadata. If you corrupt it, all bets are off.
💡 Fun Fact
glibc's malloc stores the size of the current chunk AND uses the low bits of the size field as flags (since chunks are always aligned to 16 bytes, the bottom 4 bits are free for flags). One bit indicates "previous chunk is in use." This clever trick is why glibc malloc can coalesce adjacent free blocks efficiently — but it's also why heap corruption is so devastating.
Valgrind: your heap safety net in C
Since C can't catch heap bugs at compile time, you use runtime tools. Valgrind is the classic:
// save as uaf.c — compile: gcc -g -o uaf uaf.c
#include <stdlib.h>
int main() {
int *p = malloc(sizeof(int));
*p = 42;
free(p);
*p = 99; // use-after-free
return 0;
}
$ valgrind ./uaf
==12345== Invalid write of size 4
==12345== at 0x401156: main (uaf.c:7)
==12345== Address 0x4a47040 is 0 bytes inside a block of size 4 free'd
==12345== at 0x483CA3F: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12345== by 0x40114D: main (uaf.c:6)
Valgrind tells you exactly what happened, where, and when the block was freed. It's the closest C gets to Rust's compile-time guarantees — but it only catches bugs you actually trigger at runtime, and it makes your program 20-50x slower.
Rust's allocator: same mechanism, different interface
Rust uses the system allocator by default (glibc malloc on Linux). Box::new(42) calls malloc
internally. drop(box_value) calls free.
Rust C equivalent
─────────────────────────────────────────────
Box::new(42) malloc(4); *p = 42;
Vec::with_capacity(100) malloc(100 * elem_size)
String::from("hello") malloc(5); memcpy(...)
drop(x) free(x)
Same allocator. Same brk/mmap calls. Same heap metadata. The only difference is that Rust's
type system ensures you call free exactly once, at exactly the right time.
You can even swap in a different allocator in Rust:
#![allow(unused)] fn main() { use std::alloc::System; #[global_allocator] static GLOBAL: System = System; // use the system allocator explicitly // or use jemalloc, mimalloc, etc. via crates }
🔧 Task
Write a C program that demonstrates use-after-free:
int *p = malloc(sizeof(int)); *p = 42; free(p); printf("After free: %d\n", *p); // What prints? *p = 99;Compile and run it normally. Does it crash? (Probably not — that's what makes it dangerous.) Now run it under Valgrind:
valgrind ./uaf. Observe the report.Try the same pattern in Rust:
#![allow(unused)] fn main() { let p = Box::new(42); drop(p); println!("{}", *p); }Does it compile? Read the compiler error carefully — it tells you exactly what went wrong.
Use
strace -e brk,mmapon a program that does:
malloc(100)— observebrkmalloc(256 * 1024)— observemmap- Why the different system calls?
Challenge: Write a C program that allocates and frees memory in a pattern that creates fragmentation. Allocate 1000 blocks of 1KB, free the even-numbered ones, then try to allocate a single 500KB block. Does it fit in the holes? Use
/proc/self/mapsto observe the heap size before and after.
Global Data and Read-Only Memory
Type this right now
// save as globals.c — compile: gcc -o globals globals.c
#include <stdio.h>
#include <signal.h>
int initialized = 42; // .data
int uninitialized; // .bss
const int constant = 99; // .rodata
const char *greeting = "hello"; // pointer in .data, string in .rodata
int main() {
printf(".data (initialized): %p value=%d\n",
(void *)&initialized, initialized);
printf(".bss (uninitialized): %p value=%d\n",
(void *)&uninitialized, uninitialized);
printf(".rodata (constant): %p value=%d\n",
(void *)&constant, constant);
printf(".rodata (string): %p value=\"%s\"\n",
(void *)greeting, greeting);
return 0;
}
$ gcc -o globals globals.c && ./globals
.data (initialized): 0x55d3a7004010 value=42
.bss (uninitialized): 0x55d3a7004018 value=0
.rodata (constant): 0x55d3a7002004 value=99
.rodata (string): 0x55d3a7002008 value="hello"
Notice: .data and .bss addresses are close together (both writable data). .rodata addresses
are lower, near the code. The layout matches the diagram from Chapter 6.
.data: initialized globals
The .data section holds global and static variables with explicit non-zero initial values.
// C
int counter = 100; // .data — stored in the binary
static int module_state = 5; // .data — file-scoped, still in binary
#![allow(unused)] fn main() { // Rust static COUNTER: i32 = 100; // .data static MODULE_STATE: i32 = 5; // .data }
These values are baked into the ELF binary. When you hexdump the binary, you'll find the bytes
64 00 00 00 (100 in little-endian) sitting right there in the file. The OS loader copies them
into memory at process start.
ELF binary on disk:
┌────────────────────────────────────────────┐
│ ... ELF header ... │ .text │ .rodata │ .data: [64 00 00 00] [05 00 00 00] │
└────────────────────────────────────────────┘
^^^^^^^^^^^^^^^^^^
counter = 100, module_state = 5
literally stored in the file
Permissions: rw- — readable and writable, because your program modifies globals at runtime.
.bss: the zero optimization
// C
int zeroed_global; // .bss — implicitly zero in C
static int big_buffer[100000]; // .bss — 400,000 bytes of zeros
#![allow(unused)] fn main() { // Rust static ZEROED: i32 = 0; // .bss (compiler sees it's zero) static BIG_BUFFER: [i32; 100000] = [0; 100000]; // .bss }
Here's the key insight: the .bss section takes zero bytes on disk.
With .data, storing 400,000 bytes of zeros:
┌──────────────────────────────────────────────────┐
│ ELF header │ .text │ .data: [00 00 00 ... 400KB] │
└──────────────────────────────────────────────────┘
Binary size: ~400 KB larger
With .bss:
┌──────────────────────────────────────────┐
│ ELF header │ .text │ .bss header: "400000 bytes needed" │
└──────────────────────────────────────────┘
Binary size: just a few extra bytes for the header
The ELF file records "I need 400,000 bytes of BSS" but doesn't store the actual zeros. At load
time, the OS allocates the memory and zeroes it. This is why .bss exists as a separate section —
it's a size optimization that matters enormously for real programs.
💡 Fun Fact
The name "BSS" comes from an old IBM 704 assembler directive: "Block Started by Symbol." It's been used since the 1950s. Seventy years later, your Linux kernel still uses the same concept to avoid storing zeros on disk.
Verify it yourself:
$ readelf -S globals | grep -E "\.data|\.bss"
[24] .data PROGBITS 0000000000004000 003000 000010 00 WA 0 0 8
[25] .bss NOBITS 0000000000004010 003010 000004 00 WA 0 0 4
.data has type PROGBITS (actual bits in the program file). .bss has type NOBITS (nothing
on disk).
.rodata: string literals and constants
// C
const char *msg = "Hello, world!"; // the string is in .rodata
const int lookup[] = {1, 1, 2, 3, 5, 8, 13, 21}; // .rodata
#![allow(unused)] fn main() { // Rust let msg: &str = "Hello, world!"; // string literal → .rodata const LOOKUP: [i32; 8] = [1, 1, 2, 3, 5, 8, 13, 21]; // may be in .rodata }
The .rodata section holds data that should never be modified: string literals, constant arrays,
jump tables for switch statements. Its permissions are r-- — readable, not writable, not
executable.
This leads to one of C's most surprising crashes:
char *s = "hello"; // s points to .rodata
s[0] = 'H'; // SIGSEGV! Writing to read-only memory
Memory layout:
.rodata (r-- permissions):
┌─────────────────────────┐
│ 'h' 'e' 'l' 'l' 'o' \0 │ ← read-only page
└─────────────────────────┘
^
s points here
s[0] = 'H' → CPU tries to write to a read-only page
→ MMU raises a page fault
→ Kernel sends SIGSEGV
→ Program crashes
The string "hello" is a literal — it lives in .rodata, which is mapped read-only. The
pointer s is in .data (it's a writable global), but the string it points to is not writable.
Compare with:
char s[] = "hello"; // s is a char ARRAY on the stack — copy of the string
s[0] = 'H'; // Fine! The array is writable stack memory
🧠 What do you think happens?
In C, what's the difference between
char *s = "hello"andchar s[] = "hello"? The first is a pointer to read-only memory. The second is a stack array initialized with a copy of the string. One crashes when you modify it, the other works fine. The syntax looks almost identical, but the memory layout is completely different.
Rust: no surprises here
In Rust, string literals are &'static str — a reference to data in .rodata. The type system
makes the immutability obvious:
#![allow(unused)] fn main() { let s: &str = "hello"; // immutable reference to .rodata // s is &str — there's no way to get a &mut str from a string literal // The type TELLS you it's read-only }
You can't accidentally modify a string literal in Rust because &str is an immutable reference.
The compiler won't let you write through it. The C trap simply doesn't exist.
Rust's global variable story
Rust has several ways to declare global data, each with different properties:
#![allow(unused)] fn main() { // 1. const — compile-time constant, inlined at every use site const MAX_SIZE: usize = 1024; // Not a memory location — the value 1024 is copied wherever MAX_SIZE appears // 2. static — true global variable, has a fixed address in .data or .bss static COUNTER: i32 = 0; // Cannot be mutated without unsafe or interior mutability // 3. static mut — mutable global, requires unsafe to access static mut DANGER: i32 = 0; unsafe { DANGER += 1; } // You asked for it // 4. LazyLock — initialized on first access (like lazy_static!) use std::sync::LazyLock; static CONFIG: LazyLock<String> = LazyLock::new(|| { std::fs::read_to_string("/etc/myapp.conf").unwrap() }); // First access initializes it. Thread-safe. No unsafe. }
| Kind | Section | Mutable? | Thread-safe? | When initialized |
|---|---|---|---|---|
const | Inlined | No | N/A | Compile time |
static | .data/.bss | No | Yes | Load time |
static mut | .data/.bss | Yes (unsafe) | No | Load time |
LazyLock | .bss + heap | Via interior mutability | Yes | First access |
💡 Fun Fact
static mutis so dangerous that the Rust community is actively discussing deprecating it. Every access requiresunsafe, and it's a common source of data races. The recommended alternatives areAtomicI32,Mutex<T>, orLazyLock<T>— all of which provide safe mutation withoutunsafe.
Seeing sections with readelf
You can inspect exactly what sections your binary contains:
$ gcc -o globals globals.c
$ readelf -S globals
Section Headers:
[Nr] Name Type Address Off Size ES Flg
...
[16] .text PROGBITS 0000000000001060 001060 000185 00 AX
[18] .rodata PROGBITS 0000000000002000 002000 000040 00 A
[24] .data PROGBITS 0000000000004000 003000 000010 00 WA
[25] .bss NOBITS 0000000000004010 003010 000004 00 WA
The flags tell you everything:
A= allocated (loaded into memory)X= executableW= writable
.text: AX (allocated + executable) — code
.rodata: A (allocated, not writable, not executable) — read-only data
.data: WA (writable + allocated) — read-write globals
.bss: WA (writable + allocated) — but NOBITS on disk
🔧 Task
Write a C program with:
- An initialized global (
int x = 42;)- An uninitialized global (
int y;)- A large uninitialized array (
int big[100000];)- A string literal (
"hello")- A const array (
const int fib[] = {1,1,2,3,5,8};)Compile it:
gcc -o sections sections.cRun
readelf -S sectionsand find.data,.bss,.rodata. Note the sizes.Predict:
.bssshould be at least 400,000 bytes (100,000 ints * 4 bytes). Is it? Check that the binary file size did NOT grow by 400KB:ls -la sections.Try modifying the string literal at runtime:
char *s = "hello"; s[0] = 'H';Compile and run. Observe theSIGSEGV. Then try the same modification on achar[]array instead. It works. Explain why in terms of which section each lives in.Do the same in Rust. Use
static,const, and a string literal. Compile withrustc -o rust_sections your_file.rsand inspect withreadelf -S rust_sections.
Undefined Behavior: C's Silent Killer
Type this right now
// save as ub_demo.c — compile TWICE with different optimization levels
#include <stdio.h>
#include <limits.h>
int main() {
int x = INT_MAX;
printf("x = %d\n", x);
printf("x + 1 = %d\n", x + 1);
if (x + 1 > x) {
printf("Overflow detected? x + 1 > x is TRUE\n");
} else {
printf("No overflow? x + 1 > x is FALSE\n");
}
return 0;
}
$ gcc -O0 -o ub_O0 ub_demo.c && ./ub_O0
x = 2147483647
x + 1 = -2147483648
No overflow? x + 1 > x is FALSE
$ gcc -O2 -o ub_O2 ub_demo.c && ./ub_O2
x = 2147483647
x + 1 = -2147483648
Overflow detected? x + 1 > x is TRUE
Read that again. The same code, compiled with different flags, produces opposite results.
The -O0 build wraps around and gives FALSE. The -O2 build says TRUE — the compiler
removed the comparison entirely because signed overflow is undefined behavior, so the compiler
assumes it cannot happen, and therefore x + 1 > x must always be true.
This isn't a bug in GCC. This is exactly what the C standard permits. Welcome to undefined behavior.
What undefined behavior actually means
The C standard defines three categories of problematic code:
- Implementation-defined: The behavior varies by compiler, but the compiler must document what
it does (e.g., size of
int, right-shifting signed numbers). - Unspecified: The compiler can choose among several options, but doesn't have to tell you (e.g., order of evaluation of function arguments).
- Undefined: The standard imposes no requirements whatsoever.
That last one is what kills you. When your program has undefined behavior, the compiler is allowed to:
- Produce the result you expected
- Produce a different result
- Crash
- Delete your code
- Make demons fly out of your nose (the community joke)
- Optimize as if the UB can never happen
That last point is the real danger. Modern optimizing compilers actively exploit undefined behavior. They don't just ignore your bug — they use it to justify removing your safety checks.
Example 1: signed integer overflow
The C standard says signed integer overflow is undefined. The compiler exploits this:
int check_overflow(int x) {
if (x + 1 > x) // compiler: "signed overflow can't happen"
return 1; // "so x + 1 is always > x"
else // "this branch is dead code"
return 0; // "I'll remove it"
}
At -O2, GCC compiles this to:
check_overflow:
mov eax, 1 ; just return 1, always
ret
The entire if statement is gone. The compiler reasoned: "Signed overflow is UB. I'm allowed to
assume the programmer never does UB. Therefore x + 1 is always greater than x. Therefore
the function always returns 1."
The logic is airtight given the assumption. The assumption is wrong for INT_MAX. But the
standard says you must never reach INT_MAX + 1, so the compiler is technically correct.
🧠 What do you think happens?
What if you change
inttounsigned intin the example above? Unsigned overflow IS defined in C — it wraps around modulo 2^n. The compiler can no longer optimize away the check. Try it.
Example 2: null pointer "optimization"
void process(int *ptr) {
int value = *ptr; // dereference ptr — if ptr is NULL, this is UB
if (ptr == NULL) { // null check AFTER dereference
printf("ptr is NULL!\n");
return;
}
printf("value = %d\n", value);
}
A human reads this and thinks "the null check is there for safety." The compiler reads it differently:
Compiler's reasoning:
1. Line 2 dereferences ptr
2. If ptr were NULL, that would be UB
3. I'm allowed to assume UB doesn't happen
4. Therefore ptr is NOT NULL at line 2
5. Therefore ptr is NOT NULL at line 3
6. Therefore the NULL check always fails
7. I'll remove it
At -O2:
process:
mov eax, DWORD PTR [rdi] ; dereference ptr (no null check)
; the if (ptr == NULL) block is GONE
mov esi, eax
lea rdi, .LC0 ; "value = %d\n"
jmp printf
The null check was deleted. If ptr is NULL, the program crashes with no diagnostic. The "safety"
code you carefully wrote is not in the binary.
This is not a contrived example. The Linux kernel had exactly this bug in a network driver (CVE-2009-1897). A null check was removed by GCC because a dereference appeared earlier in the function.
Example 3: use-after-free, the optimizer's playground
#include <stdlib.h>
#include <stdio.h>
int main() {
int *p = malloc(sizeof(int));
*p = 42;
free(p);
// UB: accessing freed memory
// The compiler may reuse p's register for something else,
// or the allocator may recycle the memory.
printf("value = %d\n", *p);
return 0;
}
This might print 42, or 0, or 1735289204, or crash, depending on:
- Optimization level
- Allocator implementation
- Whether another thread allocated between
freeand*p - Phase of the moon (only slightly joking)
The insidious part: it might work perfectly in testing and crash only in production. UB doesn't guarantee a crash. It guarantees nothing.
Example 4: uninitialized variables
int foo() {
int x; // uninitialized — reading it is UB
return x; // what value?
}
You might expect "whatever was on the stack." But the compiler is allowed to assume you never read uninitialized memory. This means:
int bar() {
int x;
if (x == 0) {
printf("zero\n");
}
if (x != 0) {
printf("nonzero\n");
}
}
The compiler can print both, neither, or one of these messages. It's not required to be consistent
— x doesn't have to have the same value in both checks. GCC and Clang have both been observed
making different choices at different optimization levels.
💡 Fun Fact
The phrase "nasal demons" comes from a 1992 comp.std.c Usenet post. Someone argued that undefined behavior could cause "demons to fly out of your nose." It became a running joke, but it captures a real truth: the standard truly places NO constraints on what happens. The joke persists because the reality is hard to believe.
UB on Godbolt: see it live
You can see these optimizations yourself at godbolt.org. Paste the signed
overflow example, select x86-64 GCC with -O2, and watch the assembly. The comparison vanishes.
Try these experiments:
- Change
-O2to-O0— the comparison reappears - Change
inttounsigned int— the comparison stays even at-O2 - Add
-fwrapv(tells GCC to treat signed overflow as wrapping) — the comparison stays
-fwrapv is GCC's escape hatch: it makes signed overflow defined (as two's complement wrapping).
Some projects (including the Linux kernel) compile with -fwrapv to eliminate this entire class
of UB.
Rust's answer: no undefined behavior in safe code
Rust makes a bold guarantee: safe Rust has no undefined behavior.
This isn't a suggestion or a best practice. It's a hard property of the language. The compiler rejects programs that could exhibit UB, or inserts runtime checks where compile-time prevention isn't possible.
Signed integer overflow:
fn main() { let x: i32 = i32::MAX; let y = x + 1; // In debug: PANIC at runtime println!("{}", y); }
$ cargo run
thread 'main' panicked at 'attempt to add with overflow'
$ cargo run --release
# In release mode: wraps to -2147483648 (defined behavior, not UB)
Rust made a choice: in debug mode, overflow panics (catches bugs). In release mode, overflow wraps (for performance). Either way, the behavior is defined. The compiler cannot exploit it for optimizations that break your code.
Null pointers:
#![allow(unused)] fn main() { // Rust doesn't have null pointers. // Option<&T> is the equivalent, and you MUST check it: fn process(ptr: Option<&i32>) { match ptr { Some(value) => println!("value = {}", value), None => println!("ptr is None!"), } } // The compiler enforces the check. You can't dereference without matching. }
Uninitialized variables:
fn main() { let x: i32; println!("{}", x); // COMPILE ERROR: use of possibly-uninitialized variable }
error[E0381]: used binding `x` isn't initialized
--> src/main.rs:3:20
|
2 | let x: i32;
| - binding declared here but left uninitialized
3 | println!("{}", x);
| ^ `x` used here but it isn't initialized
No guessing. No "whatever was on the stack." The compiler rejects it outright.
The unsafe boundary
Rust does allow operations that could cause UB — but only inside unsafe blocks:
fn main() { let x: i32; // This is safe Rust — compiler prevents UB // println!("{}", x); // won't compile unsafe { // In unsafe, you can do things that might cause UB let ptr: *const i32 = 0x1234 as *const i32; // let val = *ptr; // UB if address is invalid } }
The unsafe keyword is:
- Opt-in: You must explicitly ask for it
- Explicit: It marks exactly which code has elevated risk
- Auditable: You can grep for
unsafein any codebase - Contained: The responsibility for correctness is localized
The idea is that 95% of your code is safe Rust (no UB possible), and 5% is unsafe (UB is your
problem). You audit the 5%. In C, you audit 100%.
🧠 What do you think happens?
If you have 10,000 lines of Rust and UB occurs, where do you look? You grep for
unsafe— maybe 50 lines. In C, if you have 10,000 lines and UB occurs, where do you look? Everywhere. That's the practical value of the safe/unsafe boundary.
UB in unsafe Rust: still possible
unsafe Rust can still invoke UB. The most common causes:
#![allow(unused)] fn main() { unsafe { // 1. Dereferencing a raw pointer to invalid memory let ptr: *const i32 = std::ptr::null(); let _val = *ptr; // UB: null dereference // 2. Creating an invalid reference let _ref: &i32 = &*ptr; // UB: references must never be null // 3. Data races // Two threads writing to the same memory without synchronization // 4. Breaking aliasing rules // Having &T and &mut T to the same data simultaneously // 5. Calling a function with wrong ABI or invalid arguments } }
Unsafe Rust is roughly as dangerous as C for the code inside the unsafe block. The difference is
that the blast radius is contained — safe code can rely on invariants that unsafe code must
uphold.
The complete comparison
| Undefined Behavior in C | What happens | Rust equivalent | What happens |
|---|---|---|---|
Signed overflow: INT_MAX + 1 | Compiler removes your checks | i32::MAX + 1 | Debug: panic. Release: wraps (defined) |
Null deref: *NULL | Compiler removes null checks | No null pointers | Option<&T> forces you to check |
Use-after-free: free(p); *p | Silent corruption | drop(p); *p | Compile error |
Uninitialized read: int x; return x; | Anything — optimizer goes wild | let x: i32; x | Compile error |
Buffer overflow: a[11] on 10 | Silent corruption | a[11] on 10 | Panic: index out of bounds |
Double-free: free(p); free(p) | Allocator corruption | drop(p); drop(p) | Compile error |
| Data race: two threads, no sync | Torn reads, corruption | &mut T aliasing rules | Compile error |
Invalid enum value | Optimizer makes wrong branch assumptions | Invalid enum via unsafe only | Safe code can't create one |
Compiler flags that help
If you must write C, these flags make UB less likely to bite you:
# Compile-time warnings:
gcc -Wall -Wextra -Wpedantic -Werror
# Runtime sanitizers (debug builds):
gcc -fsanitize=undefined # UBSan: catches UB at runtime
gcc -fsanitize=address # ASan: catches memory bugs
gcc -fsanitize=thread # TSan: catches data races
# UB-safe overflow handling:
gcc -fwrapv # Signed overflow wraps (like unsigned)
gcc -ftrapv # Signed overflow traps (abort)
UBSan in action:
$ gcc -fsanitize=undefined -o ub_demo ub_demo.c && ./ub_demo
ub_demo.c:7:22: runtime error: signed integer overflow:
2147483647 + 1 cannot be represented in type 'int'
UBSan caught it. But remember: sanitizers only catch UB that actually executes. Code paths you don't test can still harbor UB. Rust's approach — preventing UB at compile time — catches it whether you test that path or not.
💡 Fun Fact
The LLVM optimizer (used by both Clang and rustc) has a concept called "poison values." When you compute something with UB (like signed overflow), the result is "poison" — a special marker that infects everything it touches. If a poison value reaches a branch condition, both branches become valid. If it reaches a store, the stored value is undefined. This is the formal mechanism by which UB propagates through your program.
🔧 Task
Compile the signed overflow example from the start of this chapter at
-O0and-O2:gcc -O0 -o ub_O0 ub_demo.c && ./ub_O0 gcc -O2 -o ub_O2 ub_demo.c && ./ub_O2Observe the different output. The compiler is not broken — it's exploiting UB.
Add
-fwrapvto the-O2build:gcc -O2 -fwrapv -o ub_wrap ub_demo.c && ./ub_wrapDoes the output match
-O0now? Why?Compile with UBSan:
gcc -fsanitize=undefined -o ub_san ub_demo.c && ./ub_sanRead the sanitizer's report carefully.
Write the equivalent in Rust:
fn main() { let x: i32 = i32::MAX; let y = x + 1; println!("{}", y); }Run in debug mode (
cargo run) and release mode (cargo run --release). Compare. Neither is undefined behavior — one panics, the other wraps. Both are specified outcomes.Challenge: Find the null-pointer example on Godbolt. Compile with
-O0and-O2. At-O2, confirm the null check is deleted from the assembly. Then add-fno-delete-null-pointer-checksand verify it comes back.
ELF: Dissecting Your Executable
Type this right now
// save as hello.c, compile: gcc -o hello hello.c
#include <stdio.h>
int main() {
printf("Hello, ELF!\n");
return 0;
}
xxd hello | head -4
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............
00000010: 0300 3e00 0100 0000 6010 0000 0000 0000 ..>.....`.......
00000020: 4000 0000 0000 0000 9839 0000 0000 0000 @........9......
00000030: 0000 0000 4000 3800 0d00 4000 1f00 1e00 ....@.8...@.....
That's the first 64 bytes of your compiled program. Every single byte has meaning. By the end of this chapter, you'll be able to read them like a sentence.
The magic bytes
Look at the very first four bytes: 7f 45 4c 46.
0x7f— a non-printable byte, chosen specifically so ELF files can't be confused with text0x45=E0x4c=L0x46=F
Every ELF file on your system — every binary, every .so, every .o — starts with \x7fELF.
# Try it yourself — check /bin/ls
xxd /bin/ls | head -1
The kernel checks these four bytes first. If they don't match, execve() fails immediately.
💡 Fun Fact: The
0x7fbyte was chosen by the original Unix System V designers because it's the ASCII DEL character — the highest single-byte value. It makes it nearly impossible to accidentally create a file that "looks like" an ELF binary.
The ELF header: 64 bytes that describe everything
The ELF header is a fixed-size structure sitting at offset 0. On a 64-bit system, it's exactly 64 bytes. Here's the layout:
Offset Size Field What it means
────── ──── ────────────────── ──────────────────────────────────────
0x00 4 e_ident[EI_MAG] Magic: 7f 45 4c 46 (\x7fELF)
0x04 1 e_ident[EI_CLASS] Class: 1=32-bit, 2=64-bit
0x05 1 e_ident[EI_DATA] Endianness: 1=little, 2=big
0x06 1 e_ident[EI_VERSION] ELF version (always 1)
0x07 1 e_ident[EI_OSABI] OS/ABI: 0=UNIX System V
0x08 8 e_ident[EI_PAD] Padding (zeros)
0x10 2 e_type Type: 1=relocatable 2=exec 3=shared
0x12 2 e_machine Machine: 0x3e=x86-64, 0xb7=aarch64
0x14 4 e_version ELF version (again, always 1)
0x18 8 e_entry Entry point address
0x20 8 e_phoff Program header table offset
0x28 8 e_shoff Section header table offset
0x30 4 e_flags Processor-specific flags
0x34 2 e_ehsize ELF header size (64 for 64-bit)
0x36 2 e_phentsize Program header entry size
0x38 2 e_phnum Number of program headers
0x3a 2 e_shentsize Section header entry size
0x3c 2 e_shnum Number of section headers
0x3e 2 e_shstrndx Section header string table index
That's it. 64 bytes. And from them, the kernel knows everything it needs to begin loading your program.
Reading it the easy way: readelf -h
readelf -h hello
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Position-Independent Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x1060
Start of program headers: 64 (bytes into file)
Start of section headers: 14744 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 13
Size of section headers: 64 (bytes)
Number of section headers: 31
Section header string table index: 30
Every field maps directly to the hex bytes you saw in xxd. The entry point 0x1060 is the
virtual address where execution begins — not main, but _start, the C runtime startup code that
calls main.
🧠 What do you think happens? Why is the Type
DYN(shared object) instead ofEXEC? Modern GCC produces Position-Independent Executables by default. The binary can be loaded at any base address. This enables ASLR — we'll cover this in Chapter 14.
The big picture: ELF file layout
┌──────────────────────────────┐ offset 0
│ ELF Header │ 64 bytes — the "table of contents"
│ (magic, type, entry point) │
├──────────────────────────────┤ offset 64 (0x40)
│ Program Header Table │ tells the LOADER how to map segments
│ (array of segment entries) │
├──────────────────────────────┤
│ │
│ .text section │ your compiled machine code
│ │
├──────────────────────────────┤
│ .rodata section │ read-only data ("Hello, ELF!\n")
├──────────────────────────────┤
│ .data section │ initialized global variables
├──────────────────────────────┤
│ .bss section │ uninitialized globals (zero bytes on disk)
├──────────────────────────────┤
│ .symtab section │ symbol table
├──────────────────────────────┤
│ .strtab section │ string table (symbol names)
├──────────────────────────────┤
│ .debug_* sections (if -g) │ DWARF debug info
├──────────────────────────────┤
│ ... other sections ... │
├──────────────────────────────┤
│ Section Header Table │ tells the LINKER about each section
│ (array of section entries) │
└──────────────────────────────┘
The ELF header points to the program header table (for the loader) and the section header table (for the linker and debugger). Everything else sits between them.
Program headers: what the loader sees
readelf -l hello
This shows the segments — the chunks the kernel maps into memory. Each segment has a type, permissions, a file offset, and a virtual address. We'll dissect these in the next chapter.
Section headers: what the linker sees
readelf -S hello
This shows the sections — the fine-grained pieces the compiler and linker work with. You'll
see .text, .data, .rodata, .bss, .symtab, and many more.
C vs Rust: why the Rust binary is bigger
Let's compile the same program in Rust:
// save as hello.rs, compile: rustc hello.rs fn main() { println!("Hello, ELF!"); }
$ ls -la hello # C binary
-rwxr-xr-x 1 user user 15960 Feb 19 10:00 hello
$ ls -la hello_rs # Rust binary (renamed for comparison)
-rwxr-xr-x 1 user user 4374432 Feb 19 10:01 hello_rs
The Rust binary is ~270x larger. Why?
readelf -S hello | wc -l # C: ~31 sections
readelf -S hello_rs | wc -l # Rust: ~43 sections
Three reasons:
-
Static linking of the standard library. Rust statically links
libstdby default. The C binary dynamically linkslibc. All that code forprintln!, formatting, panic handling, and the Rust runtime gets baked in. -
More debug info. Rust emits richer debug sections (
.debug_info,.debug_abbrev,.debug_line,.debug_str, etc.) even without explicit-g. -
Monomorphized generics. Rust generates specialized code for each concrete type used with generics.
println!alone pulls in a substantial amount of formatting machinery.
Strip it down
$ strip hello_rs -o hello_rs_stripped
$ ls -la hello_rs hello_rs_stripped
-rwxr-xr-x 1 user user 4374432 Feb 19 10:01 hello_rs
-rwxr-xr-x 1 user user 311496 Feb 19 10:02 hello_rs_stripped
strip removes symbol tables and debug info — sections the runtime doesn't need. The binary
still runs, but you can no longer debug it with meaningful function names.
$ strip hello -o hello_stripped
$ ls -la hello hello_stripped
-rwxr-xr-x 1 user user 15960 Feb 19 10:00 hello
-rwxr-xr-x 1 user user 14408 Feb 19 10:02 hello_stripped
The C binary barely shrinks — it was already small because it delegates everything to the shared
libc.so.
💡 Fun Fact: Production Rust binaries are typically compiled with
cargo build --release, which enables optimizations and can be further reduced withstrip,lto = trueinCargo.toml, andopt-level = "z"for size optimization. A release + stripped Rust hello world can get down to ~300 KB.
Finding your function: the symbol table
readelf -s hello | grep main
34: 0000000000001149 35 FUNC GLOBAL DEFAULT 16 main
There it is. main lives at virtual address 0x1149, is 35 bytes long, is a function (FUNC),
has global visibility, and sits in section index 16 (which is .text).
Now for Rust:
readelf -s hello_rs | grep 'main'
2156: 0000000000008280 47 FUNC GLOBAL DEFAULT 14 main
5731: 00000000000082b0 103 FUNC LOCAL DEFAULT 14 hello_rs::main
Rust has two entries: the C-compatible main that the C runtime calls, and hello_rs::main
which is your actual Rust function. The first one is a thin wrapper that calls the second.
The entry point is NOT main
Remember the entry point from readelf -h? It was 0x1060, but main is at 0x1149. What's
at 0x1060?
readelf -s hello | grep ' _start'
1: 0000000000001060 0 FUNC GLOBAL DEFAULT 16 _start
_start is the real entry point. It's provided by the C runtime (crt1.o). It sets up the
stack, initializes the C library, and then calls main. When main returns, _start calls
exit().
Kernel jumps here
│
v
_start (from crt1.o)
│
├── __libc_start_main()
│ │
│ ├── Initialize libc
│ ├── Call constructors
│ ├── Call main() ◄── YOUR CODE
│ ├── Call destructors
│ └── Call exit()
│
└── (never reached)
Rust's entry point
Rust has its own startup sequence, but it still begins with _start and ends up calling the
system's C runtime:
_start → __libc_start_main → main (Rust shim) → std::rt::lang_start
│
└── your fn main()
Same hardware. Same ELF. Same kernel. Different runtime path to reach your code.
🔧 Task: Read the ELF header by hand
- Compile a C program:
gcc -o hello hello.c- Hex dump the first 64 bytes:
xxd hello | head -4- Using the field table from this chapter, identify each field manually:
- Bytes 0-3: Magic number. What are they?
- Byte 4: Class. Is it 32-bit or 64-bit?
- Byte 5: Endianness. Little or big?
- Bytes 16-17: ELF type. What type is it?
- Bytes 18-19: Machine. What architecture?
- Bytes 24-31: Entry point. What address? (Remember: little-endian!)
- Verify your answers with
readelf -h hello.- Repeat with a Rust binary. Do the fields differ?
This is how forensic analysts and reverse engineers read binaries — byte by byte.
Sections vs Segments: Two Views of One File
Type this right now
# Compile a simple C program
gcc -o hello hello.c
# Two different views of the same binary:
echo "=== SECTIONS (compiler/linker view) ==="
readelf -S hello | head -40
echo ""
echo "=== SEGMENTS (loader/kernel view) ==="
readelf -l hello
Run both commands. You'll see two completely different lists describing the same file. By the end of this chapter, you'll understand why both exist and how they relate.
The key insight
An ELF file has two parallel indexing systems:
- Sections — the compiler's and linker's view. Fine-grained. Named. Used during compilation and linking.
- Segments — the loader's and kernel's view. Coarse-grained. Permission-based. Used when mapping the binary into memory.
Compile time Run time
─────────── ────────
┌─────────────────────┐ ┌──────────────────┐
│ Section Header │ │ Program Header │
│ Table │ │ Table │
│ │ │ │
│ .text │ │ LOAD (r-x) │
│ .rodata │──────────► │ (code + rodata) │
│ .data │ │ │
│ .bss │──────────► │ LOAD (rw-) │
│ .symtab │ │ (data + bss) │
│ .strtab │ │ │
│ .debug_* │ │ (not loaded) │
└─────────────────────┘ └──────────────────┘
Many small pieces Few big chunks
The linker needs sections so it can merge .text from file A with .text from file B. The
kernel doesn't care about any of that — it just needs to know which bytes go to which addresses
with which permissions.
Sections: the full catalog
Here are the sections you'll encounter most often:
Section Contents Permissions
───────── ────────────────────────────── ───────────
.text Machine code (your functions) r-x
.rodata Read-only data (string literals) r--
.data Initialized global variables rw-
.bss Uninitialized globals (zeroed) rw-
.symtab Symbol table (function names) --- (not loaded)
.strtab String table (section names) --- (not loaded)
.dynsym Dynamic symbol table r--
.dynstr Dynamic string table r--
.plt Procedure Linkage Table r-x
.got Global Offset Table rw-
.debug_* DWARF debug information --- (not loaded)
.rel.text Relocations for .text --- (not loaded)
.init Startup code r-x
.fini Cleanup code r-x
Not all sections get loaded into memory. .symtab, .strtab, and .debug_* exist only in the
file — the kernel ignores them entirely. They're for the linker, debugger, and tools like nm.
🧠 What do you think happens? If
.bssholds uninitialized globals that are all zero, how many bytes does it occupy in the file? Answer: zero. The section header records its size, but no actual bytes are stored. The kernel allocates and zeroes the memory at load time. This is why.bsssaves disk space.
Segments: what the kernel actually maps
readelf -l hello
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000000040 0x0000000000000040 0x0002d8 0x0002d8 R 0x8
INTERP 0x000318 0x0000000000000318 0x0000000000000318 0x00001c 0x00001c R 0x1
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x000628 0x000628 R 0x1000
LOAD 0x001000 0x0000000000001000 0x0000000000001000 0x000185 0x000185 R E 0x1000
LOAD 0x002000 0x0000000000002000 0x0000000000002000 0x000114 0x000114 R 0x1000
LOAD 0x002db8 0x0000000000003db8 0x0000000000003db8 0x000258 0x000260 RW 0x1000
DYNAMIC 0x002dc8 0x0000000000003dc8 0x0000000000003dc8 0x0001f0 0x0001f0 RW 0x8
NOTE 0x000338 0x0000000000000338 0x0000000000000338 0x000030 0x000030 R 0x8
NOTE 0x000368 0x0000000000000368 0x0000000000000368 0x000044 0x000044 R 0x4
GNU_EH_FRAME 0x00200c 0x000000000000200c 0x000000000000200c 0x00003c 0x00003c R 0x4
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW 0x10
GNU_RELRO 0x002db8 0x0000000000003db8 0x0000000000003db8 0x000248 0x000248 R 0x1
The key segment types:
| Type | Purpose |
|---|---|
LOAD | Map these bytes into memory. This is the main event. |
INTERP | Path to the dynamic linker (/lib64/ld-linux-x86-64.so.2) |
DYNAMIC | Dynamic linking information (needed libraries, symbol tables) |
NOTE | Metadata (build ID, ABI tag) |
GNU_STACK | Stack permissions (notably: no execute) |
GNU_RELRO | Mark GOT as read-only after relocation (security) |
The mapping: sections merge into segments
This is where the two views connect. readelf -l also shows you which sections fall into each
segment:
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag
.gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
03 .init .plt .plt.got .plt.sec .text .fini
04 .rodata .eh_frame_hdr .eh_frame
05 .init_array .fini_array .dynamic .got .data .bss
06 .dynamic
07 .note.gnu.property
08 .note.gnu.build-id .note.ABI-tag
09 .eh_frame_hdr
10 (empty — GNU_STACK has no sections)
11 .init_array .fini_array .dynamic .got
Now the picture becomes clear:
SECTIONS (file) SEGMENTS (memory)
───────────────── ─────────────────
┌─────────────┐
│ .init │───┐
├─────────────┤ │
│ .plt │───┤
├─────────────┤ ├──► LOAD segment 03 (R E)
│ .text │───┤ Executable code
├─────────────┤ │
│ .fini │───┘
├─────────────┤
│ .rodata │───┬──► LOAD segment 04 (R)
├─────────────┤ │ Read-only data
│ .eh_frame │───┘
├─────────────┤
│ .data │───┐
├─────────────┤ ├──► LOAD segment 05 (RW)
│ .bss │───┘ Read-write data
├─────────────┤
│ .symtab │ NOT LOADED
├─────────────┤ (debug/linker use only)
│ .strtab │ NOT LOADED
├─────────────┤
│ .debug_* │ NOT LOADED
└─────────────┘
Multiple sections with the same permissions merge into one segment. The kernel doesn't need
to know the difference between .init and .text — both are executable code, so they get one
mmap() call with PROT_READ | PROT_EXEC.
Why two views?
The split exists because compilation and execution have different needs.
The linker needs to:
- Merge
.textfrom dozens of object files into one.text - Merge
.datafrom dozens of object files into one.data - Apply relocations section by section
- Keep debug info separate from code
The kernel needs to:
- Map as few memory regions as possible (fewer page table entries)
- Set permissions per region (read, write, execute)
- Know the entry point
- Know where the dynamic linker is
Sections are the how (fine-grained building blocks). Segments are the what (what goes where in memory with what permissions).
💡 Fun Fact: A fully stripped, statically linked binary can have zero sections and still run. The kernel only reads program headers. You can literally delete the section header table with a hex editor and the binary works fine. Tools like
readelf -Swill complain, but the kernel won't even notice.
Seeing your code in .text
objdump -d hello | grep -A 15 '<main>'
0000000000001149 <main>:
1149: f3 0f 1e fa endbr64
114d: 55 push rbp
114e: 48 89 e5 mov rbp,rsp
1151: 48 8d 05 ac 0e 00 00 lea rax,[rip+0xeac] # 2004 <_IO_stdin_used+0x4>
1158: 48 89 c7 mov rdi,rax
115b: e8 f0 fe ff ff call 1050 <puts@plt>
1160: b8 00 00 00 00 mov eax,0x0
1165: 5d pop rbp
1166: c3 ret
Address 0x1149 — that's in the .text section, which is inside the LOAD segment with
R E permissions. The lea instruction at 0x1151 references address 0x2004 — that's the
"Hello, ELF!\n" string in .rodata, inside the read-only LOAD segment.
Everything maps. Addresses in the code point to addresses in specific sections, which live in specific segments, which get specific permissions.
Rust comparison
// save as hello.rs, compile: rustc hello.rs fn main() { println!("Hello, ELF!"); }
readelf -l hello_rs | grep LOAD
LOAD 0x000000 0x0000000000000000 ... R 0x1000
LOAD 0x009000 0x0000000000009000 ... R E 0x1000
LOAD 0x05b000 0x000000000005b000 ... R 0x1000
LOAD 0x07e458 0x000000000007f458 ... RW 0x1000
Same pattern: read-only, executable, read-only data, read-write. The Rust binary just has more of everything because it includes the standard library.
🔧 Task: Map sections to segments by hand
- Compile:
gcc -o hello hello.c- Run
readelf -S hello— note the address and flags of each section- Run
readelf -l hello— note the address range and flags of each segment- For each section, determine which segment contains it:
- Does the section's address fall within the segment's [VirtAddr, VirtAddr+MemSiz) range?
- Do the permissions match? (A writable section should be in a writable segment)
- Verify your mapping against the "Section to Segment mapping" output at the bottom of
readelf -l- Find a section that is NOT in any segment. Why isn't it loaded?
This exercise makes the two-view model concrete. Once you can do this mapping by hand, the linker and loader will never be mysterious again.
Compilation and Linking: Source to Binary
Type this right now
cat > greet.c << 'EOF'
#include <stdio.h>
void greet(const char *name) { printf("Hello, %s!\n", name); }
EOF
cat > main.c << 'EOF'
void greet(const char *name);
int main() { greet("world"); return 0; }
EOF
gcc -c greet.c -o greet.o
gcc -c main.c -o main.o
gcc greet.o main.o -o hello
echo "=== main.o symbols ===" && nm main.o
echo "=== After linking ===" && nm hello | grep -E 'main|greet'
=== main.o symbols ===
U greet
0000000000000000 T main
=== After linking ===
0000000000001149 T greet
0000000000001172 T main
main.o has an undefined symbol greet (the U). After linking, it has a real address.
That's the entire purpose of the linker — resolving references between files.
The C compilation pipeline
hello.c Your source code
│
│ gcc -E Preprocessor (#include, #define, #ifdef)
v
hello.i Pure C, all macros expanded
│
│ gcc -S Compiler (cc1) — C to assembly
v
hello.s Human-readable assembly
│
│ gcc -c Assembler (as) — assembly to machine code
v
hello.o Object file (relocatable ELF)
│
│ gcc Linker (ld) — resolves symbols, assigns addresses
v
a.out Executable (final ELF)
You can stop at any stage. Let's see each output.
Preprocessing (gcc -E): A hello.c with #include <stdio.h> expands to ~700 lines. Every
header pasted in, every macro expanded. The output is pure C.
Compilation (gcc -S):
main:
endbr64
pushq %rbp
movq %rsp, %rbp
leaq .LC0(%rip), %rdi
call puts@PLT
movl $0, %eax
popq %rbp
ret
.LC0:
.string "Hello, ELF!"
Notice puts@PLT — the compiler doesn't know where puts lives. It emits a reference the
linker will resolve.
Assembly (gcc -c):
hello.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
The type is relocatable — machine code with provisional addresses starting at zero.
Linking (gcc):
hello: ELF 64-bit LSB pie executable, x86-64, dynamically linked,
interpreter /lib64/ld-linux-x86-64.so.2, ...
The linker resolves all symbols, assigns final addresses, produces the executable.
The Rust pipeline
hello.rs Your source code
│
│ rustc Parser → AST → HIR (desugared, type-checked)
v → MIR (borrow-checked, monomorphized)
LLVM IR LLVM's intermediate representation
│
│ LLVM Machine code generation
v
hello.o Object file (same format as C!)
│
│ linker Same system linker (ld / lld)
v
hello Executable (final ELF)
After LLVM, Rust and C produce the same kind of object file. They use the same linker.
💡 Fun Fact: You can mix C and Rust in one binary. Compile C to
.o, Rust to.o, link them together. The linker doesn't care what language produced the object file — it only sees symbols, sections, and relocations.
Object files: code with holes
objdump -d main.o
0000000000000000 <main>:
0: f3 0f 1e fa endbr64
4: 55 push rbp
5: 48 89 e5 mov rbp,rsp
8: 48 8d 05 00 00 00 00 lea rax,[rip+0x0] # address TBD
f: 48 89 c7 mov rdi,rax
12: e8 00 00 00 00 call 17 <main+0x17> # address TBD
17: b8 00 00 00 00 mov eax,0x0
1c: 5d pop rbp
1d: c3 ret
See the zeros at offsets 0x8 and 0x12? Those are holes. The lea needs the string
address. The call needs greet's address. Neither is known yet.
Relocation entries tell the linker where to fill in:
readelf -r main.o
Relocation section '.rela.text' at offset 0x1e8 contains 2 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000000b 000500000002 R_X86_64_PC32 0000000000000000 .rodata - 4
000000000013 000600000004 R_X86_64_PLT32 0000000000000000 greet - 4
Two relocations. Two holes. The linker fills them.
Symbol types: the nm command
Symbol Meaning
────── ────────────────────────────────────
T Text (code) — defined in this file
D Data — initialized global variable
B BSS — uninitialized global variable
R Read-only data
U Undefined — referenced but not here
W Weak — can be overridden
main.o defines main (T) but needs greet (U). greet.o defines greet (T) but needs
printf (U). The linker connects all the U's to the T's.
🧠 What do you think happens? What if a symbol is
Uin every object file and neverT? The linker emits:undefined reference to 'greet'. The most common linker error. No file provided a definition.
Linking: three jobs
1. Symbol resolution — match undefined references to definitions:
main.o greet.o
┌────────────────┐ ┌────────────────┐
│ T main │ │ T greet │
│ U greet ───────┼────────►│ │
│ │ │ U printf ──────┼──► libc.so
└────────────────┘ └────────────────┘
2. Relocation — fill in placeholder addresses:
Before (main.o): call 0x00000000 # placeholder
After (hello): call 0x00001149 # actual address of greet
3. Section merging — combine sections from all object files:
main.o .text ──┐
├──► final .text
greet.o .text ─┘
Static vs dynamic linking
Static linking: Dynamic linking (default):
┌──────────────────────┐ ┌──────────────┐
│ hello_static │ │ hello │
│ main() │ │ main() │
│ greet() │ │ greet() │
│ printf() │ │ printf ──────┼──► libc.so.6
│ ... all of libc ... │ └───────────────┘
└──────────────────────┘ ~16 KB, needs .so
~880 KB, no dependencies
gcc -static main.o greet.o -o hello_static
ldd hello_static # "not a dynamic executable"
ldd hello # lists libc.so.6, ld-linux, vdso
Static: everything baked in. Dynamic: smaller binary, resolved at runtime.
Rust's linking story
Rust statically links its own standard library but dynamically links the system C library — a hybrid approach.
ldd hello_rs # shows libc.so.6, libgcc_s.so.1, ld-linux
💡 Fun Fact: Fully static Rust:
rustup target add x86_64-unknown-linux-musl && rustc --target x86_64-unknown-linux-musl hello.rs. Zero dynamic dependencies. Runs on any Linux.
PLT/GOT: dynamic linking at a glance
When your binary calls printf dynamically, it uses two structures:
- PLT (Procedure Linkage Table) — code trampolines
- GOT (Global Offset Table) — writable address slots
Your code PLT GOT
───────── ───── ────
call puts@PLT ──────► puts@PLT: ┌──────────────┐
jmp [GOT[puts]]───────►│ address of │
... │ puts in libc │
└──────────────┘
On the first call, the GOT entry points to a resolver that finds the real puts, patches the
GOT, and jumps there. Subsequent calls go directly through the patched GOT. Details in Chapter 14.
🔧 Task: Watch symbols resolve across files
- Create
math.c:int add(int a, int b) { return a + b; } int multiply(int a, int b) { return a * b; }- Create
app.c:#include <stdio.h> int add(int a, int b); int multiply(int a, int b); int main() { printf("%d %d\n", add(3, 4), multiply(5, 6)); return 0; }- Compile separately:
gcc -c math.candgcc -c app.c- Run
nm math.o—addandmultiplyshould beT(defined)- Run
nm app.o—addandmultiplyshould beU(undefined)- Link:
gcc math.o app.o -o app- Run
nm app | grep -E 'add|multiply'— both nowTwith real addresses- Run
./app— output:7 30- Bonus: Delete
math.o, try linking with justapp.o. The error tells you exactly which symbols are missing.
Loading: Binary Becomes Process
Type this right now
// save as hello.c, compile: gcc -o hello hello.c
#include <stdio.h>
int main() {
printf("main is at: %p\n", (void *)main);
return 0;
}
for i in 1 2 3 4 5; do ./hello; done
main is at: 0x55a3f1c00149
main is at: 0x564e28a00149
main is at: 0x55f84dc00149
main is at: 0x558b7e200149
main is at: 0x563420200149
Every run, main is at a different address. The binary on disk hasn't changed. The loader
is moving things around. On purpose.
One syscall to rule them all: execve()
strace -f ./hello 2>&1 | head -15
execve("./hello", ["./hello"], 0x7ffd5c6b4e50 /* 55 vars */) = 0
brk(NULL) = 0x55a4b2e33000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2c8a3f1000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 2125824, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f2c8a000000
mmap(0x7f2c8a028000, 1531904, PROT_READ|PROT_EXEC, ...) = 0x7f2c8a028000
mmap(0x7f2c8a1f4000, 24576, PROT_READ|PROT_WRITE, ...) = 0x7f2c8a1f4000
...
That first line — execve() — is where a file on disk becomes a running process.
What the kernel does
execve("./hello", ...)
│
v
1. Read first bytes → check magic: 7f 45 4c 46 = ELF? Yes.
│
v
2. Read ELF header → entry point, program header offset
│
v
3. For each LOAD segment:
└── mmap(vaddr, filesz, perms, MAP_PRIVATE|MAP_FIXED, fd, offset)
(if memsz > filesz, zero-fill the extra — that's .bss)
│
v
4. INTERP segment found? Load the dynamic linker too.
│
v
5. Set up stack: argc, argv[], envp[], auxv[]
│
v
6. Set instruction pointer → jump to entry point
(dynamic binary: ld-linux entry; static binary: _start)
The kernel doesn't "run" your program. It prepares the address space and jumps to the entry point. User-space code takes over from there.
ELF segments become virtual memory
ELF file on disk Virtual address space
──────────────── ─────────────────────
┌─────────────────┐
│ ELF header │ (not mapped)
├─────────────────┤
│ LOAD (R) │ ────mmap()────────► Read-only (headers, notes)
├─────────────────┤
│ LOAD (R-X) │ ────mmap()────────► Executable code (.text)
├─────────────────┤
│ LOAD (R) │ ────mmap()────────► Read-only data (.rodata)
├─────────────────┤
│ LOAD (RW) │ ────mmap()────────► Read-write data (.data+.bss)
├─────────────────┤
│ .symtab │ NOT MAPPED [heap] ────────►
│ .debug_* │ ...
└─────────────────┘ [stack] ◄────────
Each LOAD segment becomes one mmap() call. ELF permissions (R, RW, RX) become mmap flags
(PROT_READ, PROT_WRITE, PROT_EXEC).
🧠 What do you think happens? The
.bsssegment hasmemsz > filesz. The extra bytes don't exist in the file. The kernel allocates them in memory and fills with zeros. Uninitialized globals become zero without wasting disk space.
The dynamic linker
readelf -l hello | grep INTERP -A 1
INTERP 0x000318 0x0000000000000318 ... 0x00001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
This tells the kernel: "Load this program first." The kernel loads the dynamic linker into memory and jumps to its entry point. The dynamic linker then:
- Reads the DYNAMIC segment of your binary
- Opens and
mmaps every shared library (libc.so.6, etc.) - Resolves symbols between binary and libraries
- Jumps to your binary's
_start
Kernel → ld-linux → _start → __libc_start_main → main()
│
YOUR CODE
PLT and GOT: lazy binding
When your binary calls puts from libc, it uses:
- PLT (Procedure Linkage Table) — small code stubs
- GOT (Global Offset Table) — writable address slots
First call (slow path):
call puts@PLT
│
v
puts@PLT:
jmp [GOT[puts]] ──► GOT[puts] = address of resolver (not puts yet!)
push index │
jmp PLT[0] v
_dl_runtime_resolve:
1. Look up "puts" in libc
2. Patch GOT[puts] = real puts address
3. Jump to puts
Second call (fast path):
call puts@PLT
│
v
puts@PLT:
jmp [GOT[puts]] ──► GOT[puts] = 0x7f...puts (real address!)
│
v
puts() in libc — direct, no resolver
First call resolves the symbol. Every subsequent call is a single indirect jump.
💡 Fun Fact: This is "lazy binding" — symbols resolved on first use. Force eager binding with
LD_BIND_NOW=1 ./helloto resolve everything beforemainruns.
ASLR: why the address changes
Run 1: base = 0x55a3f1c00000 → main = 0x55a3f1c00149
Run 2: base = 0x564e28a00000 → main = 0x564e28a00149
Run 3: base = 0x55f84dc00000 → main = 0x55f84dc00149
^^^^^^^^^^^
Random!
^^^
Always 149
The last three hex digits are always 149 — the offset of main within the binary. What
changes is the base address. This is Address Space Layout Randomization (ASLR).
The kernel randomizes the base address of the executable, every shared library, the stack, the heap, and the mmap region.
Without ASLR: With ASLR:
┌──────────┐ 0x400000 ┌──────────┐ 0x55a3f1c00000
│ code │ │ code │
├──────────┤ 0x600000 ├──────────┤ 0x55a3f1e00000
│ data │ │ data │
│ ... │ │ ... │
│ stack │ 0x7ffffffde000 │ stack │ 0x7ffd5c680000
└──────────┘ └──────────┘
Same every run. Different every run.
Attacker knows all. Attacker must guess.
Without ASLR, an attacker who knows your binary can predict every address. With 28-bit randomization, there are ~268 million possible base locations.
PIE: Position Independent Executable
ASLR only works if the code doesn't depend on fixed addresses. PIE uses rip-relative
addressing:
# PIE (works at any base):
lea rax, [rip+0xeac] # relative to current instruction
# Non-PIE (fixed base only):
mov rax, 0x402004 # hardcoded absolute address
readelf -h hello | grep Type # DYN = PIE
gcc -no-pie -o hello_nopie hello.c
readelf -h hello_nopie | grep Type # EXEC = fixed address, no code ASLR
The Rust version
// save as hello_addr.rs, compile: rustc hello_addr.rs fn main() { let stack_var = 42; println!("main: {:p}", main as fn() as *const ()); println!("stack: {:p}", &stack_var as *const i32); }
for i in 1 2 3 4 5; do ./hello_addr; done
Both code and stack randomized. Same ASLR, same kernel, regardless of language.
Disabling ASLR: proof it's real
setarch $(uname -m) -R ./hello
setarch $(uname -m) -R ./hello
setarch $(uname -m) -R ./hello
main is at: 0x555555555149
main is at: 0x555555555149
main is at: 0x555555555149
Same address every time. The 0x555555555000 base is the well-known GDB default.
Check the system-wide setting: cat /proc/sys/kernel/randomize_va_space — 0=off, 1=partial,
2=full (default).
The complete picture
hello (ELF on disk)
│
│ execve()
v
┌─ Kernel ────────────────────────────────────┐
│ Read ELF header + program headers │
│ mmap LOAD segments (with ASLR offset) │
│ Load dynamic linker (from INTERP) │
│ Set up stack (argc, argv, envp, auxv) │
│ Jump to ld-linux entry point │
└──────────────────────────────────────────────┘
│
v
┌─ Dynamic linker (ld-linux) ─────────────────┐
│ Load shared libraries (libc.so, etc.) │
│ Set up PLT/GOT, process relocations │
│ Jump to binary's _start │
└──────────────────────────────────────────────┘
│
v
┌─ C Runtime (_start) ───────────────────────-┐
│ __libc_start_main → main(argc, argv, envp) │
│ exit(return_value) │
└──────────────────────────────────────────────┘
│
v
YOUR CODE RUNS
From execve() to main(), every step is an ELF field read, an mmap() call, or a symbol
resolution. Nothing magic.
🔧 Task: Observe ASLR yourself
- Compile:
#include <stdio.h> int global = 42; int main() { int local = 7; printf("main: %p\n", (void *)main); printf("global: %p\n", (void *)&global); printf("local: %p\n", (void *)&local); return 0; }- Run five times. All three addresses change each run.
- Notice: the offset between main and global stays constant (same binary). The offset between local (stack) and main (code) varies — randomized independently.
- Disable ASLR:
setarch $(uname -m) -R ./a.out— same addresses every time.- Bonus:
strace ./a.out 2>&1 | grep mmap— count themmapcalls. Each one creates a memory region. That's the loader in action.
Virtual Memory: The Grand Illusion
Type this right now
// save as vmaddr.c — compile: gcc -o vmaddr vmaddr.c
#include <stdio.h>
#include <unistd.h>
#include <sys/wait.h>
int main() {
int x = 42;
pid_t pid = fork();
if (pid == 0) {
// Child process
printf("[child] &x = %p, x = %d\n", (void *)&x, x);
x = 99;
printf("[child] &x = %p, x = %d (after modification)\n", (void *)&x, x);
} else {
wait(NULL);
printf("[parent] &x = %p, x = %d (unchanged!)\n", (void *)&x, x);
}
return 0;
}
$ gcc -o vmaddr vmaddr.c && ./vmaddr
[child] &x = 0x7ffd3a4b1c2c, x = 42
[child] &x = 0x7ffd3a4b1c2c, x = 99 (after modification)
[parent] &x = 0x7ffd3a4b1c2c, x = 42 (unchanged!)
Same address. Different values. That should break your brain a little. Both processes see 0x7ffd3a4b1c2c,
but they're looking at different physical memory. Welcome to virtual memory.
The problem
Your system is running hundreds of processes right now. Each one believes it owns a vast, private stretch of memory — on x86-64, up to 128 TB of user-space addresses. But you probably have 16 GB of RAM. Maybe 32 if you're fancy.
Process A thinks: "I have 128 TB to myself"
Process B thinks: "I have 128 TB to myself"
Process C thinks: "I have 128 TB to myself"
...
Process Z thinks: "I have 128 TB to myself"
Physical RAM: 16 GB total. That's it.
How is this possible? The same way a magician makes one card look like fifty. Indirection.
The solution: one layer of translation
Every address your program uses is a virtual address. It is never placed directly on the memory bus. Instead, hardware translates it to a physical address before the memory access happens.
Your C code: int *p = (int *)0x4000;
│
▼
┌─────────────────────┐
│ Virtual Address │
│ 0x4000 │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ MMU │ ◄── Hardware inside the CPU
│ (translates via │
│ page tables) │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Physical Address │
│ 0x7A2000 │ ◄── Actual RAM location
└─────────────────────┘
Your program never sees physical addresses. It never needs to. The translation happens on every single memory access — every load, every store, every instruction fetch.
Every process gets its own map
This is the key insight. Process A and Process B can both use virtual address 0x4000. The MMU consults
different page tables for each process, so 0x4000 lands at different physical locations.
Process A Process B
───────── ─────────
Virtual: 0x4000 Virtual: 0x4000
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ A's Page │ │ B's Page │
│ Table │ │ Table │
│ │ │ │
│ 0x4000 ──────┼──┐ │ 0x4000 ──────┼──┐
└──────────────┘ │ └──────────────┘ │
│ │
▼ ▼
┌────────────────────────────────────────────────────────────┐
│ Physical RAM │
│ │
│ Frame 0x1A2 ◄─── A's data Frame 0x5F7 ◄─── B's │
│ ┌──────────┐ ┌──────────┐ │
│ │ x = 42 │ │ x = 99 │ │
│ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────┘
Same virtual address. Different physical frames. Complete isolation.
Three gifts from virtual memory
1. Isolation. Process A cannot read or write Process B's memory. There is no virtual address in A's page table that maps to B's physical frames. The hardware enforces this — not the OS, not a runtime check, the MMU itself refuses the translation.
2. Convenience. Every process can use the same virtual address layout: code near the bottom, heap growing up, stack at the top. The linker doesn't need to know where in physical RAM the program will land. It just targets the standard virtual layout.
3. Overcommit. The OS can promise more memory than physically exists. malloc(1 GB) succeeds
even with 2 GB of RAM — because no physical RAM is allocated until you actually touch each page.
We'll see how in Chapter 17.
🧠 What do you think happens?
If 50 processes each
malloc(1 GB), is that 50 GB of RAM consumed? What if none of them ever write to the memory? What if they all write to every byte simultaneously?
The MMU: translation in hardware
The Memory Management Unit lives inside the CPU die. It is not a separate chip. It is not software. It is transistors that execute the page-table walk on every memory access.
┌──────────────────────────────────────────────┐
│ CPU │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Core 0 │ │ Core 1 │ ... │
│ │ │ │ │ │
│ │ ┌─────┐ │ │ ┌─────┐ │ │
│ │ │ TLB │ │ │ │ TLB │ │ ◄── Cache │
│ │ └──┬──┘ │ │ └──┬──┘ │ of recent │
│ │ │ │ │ │ │ translations│
│ │ ┌──┴──┐ │ │ ┌──┴──┐ │ │
│ │ │ MMU │ │ │ │ MMU │ │ ◄── Walks │
│ │ └─────┘ │ │ └─────┘ │ page │
│ └──────────┘ └──────────┘ tables │
└──────────────────────────────────────────────┘
Who sets up the mapping? The operating system kernel. It writes entries into the page tables in
physical memory. It sets the CR3 register to point to the root of each process's page table.
Who enforces the mapping? The hardware. Every memory access goes through the MMU. If the page table says "no access," the CPU raises a page fault exception — even the kernel can't bypass the MMU without disabling it entirely (which no modern OS does).
Memory-mapped files
The same translation mechanism can point virtual pages at file contents on disk instead of
anonymous RAM. This is mmap().
#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
int main() {
int fd = open("/etc/hostname", O_RDONLY);
char *data = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
printf("Hostname: %s\n", data); // Reading the file by just reading memory!
munmap(data, 4096);
close(fd);
return 0;
}
No read() call. You access the file like it's a normal array in memory. When you touch a page,
the kernel loads the file's contents into a physical frame and maps it in. This is how the kernel
loads your program's .text section — it memory-maps the ELF binary.
💡 Fun Fact: Shared libraries (
.sofiles) are memory-mapped once and shared between every process that uses them. If 100 processes uselibc.so, there is only ONE copy of its.textsection in physical RAM, mapped into 100 different virtual address spaces.
Copy-on-write: the fork() trick
When you call fork(), the kernel does NOT copy the parent's entire memory. That would be
absurdly expensive. Instead:
- Clone the page tables — child gets the same virtual-to-physical mapping as parent
- Mark ALL pages read-only in both parent and child
- Both processes continue running, sharing the exact same physical frames
When either process writes to a page: 4. The CPU raises a page fault (the page is marked read-only) 5. The kernel sees it's a copy-on-write page 6. The kernel copies just that one page to a new physical frame 7. Updates the writer's page table to point to the new copy, marks it writable 8. The other process still points to the original frame
After fork() — before any writes:
Parent page table Physical RAM Child page table
┌────────────┐ ┌──────────────┐ ┌────────────┐
│ 0x4000 ────┼────RO───►│ Frame 0x1A2 │◄───RO────┼──── 0x4000 │
│ 0x5000 ────┼────RO───►│ Frame 0x1A3 │◄───RO────┼──── 0x5000 │
│ 0x6000 ────┼────RO───►│ Frame 0x1A4 │◄───RO────┼──── 0x6000 │
└────────────┘ └──────────────┘ └────────────┘
Child writes to 0x5000:
Parent page table Physical RAM Child page table
┌────────────┐ ┌──────────────┐ ┌────────────┐
│ 0x4000 ────┼────RO───►│ Frame 0x1A2 │◄───RO────┼──── 0x4000 │
│ 0x5000 ────┼────RO───►│ Frame 0x1A3 │ │ 0x5000 ────┼──┐
│ 0x6000 ────┼────RO───►│ Frame 0x1A4 │◄───RO────┼──── 0x6000 │ │
└────────────┘ │ Frame 0x2B7 │◄───RW────────────────────┘
└──────────────┘
(copied page)
Only the page that was written gets duplicated. If a child process calls exec() immediately
after fork() (which is common), most pages are never written — so almost nothing gets copied.
Rust's perspective
Rust doesn't have fork() in its standard library — partly because fork is fundamentally
unsafe in multithreaded programs. But the virtual memory system works identically underneath.
// Rust uses the same virtual address space layout use std::alloc::{alloc, Layout}; fn main() { let stack_var = 42; let heap_var = Box::new(99); println!("Stack: {:p}", &stack_var); // High address println!("Heap: {:p}", &*heap_var); // Lower address println!("Code: {:p}", main as *const ()); // Low address // Same /proc/self/maps underneath let maps = std::fs::read_to_string("/proc/self/maps").unwrap(); for line in maps.lines().take(5) { println!("{}", line); } }
Every concept in this chapter — page tables, the MMU, isolation between processes — applies to Rust programs identically. Rust's safety guarantees operate on top of virtual memory, not instead of it.
🔧 Task: Observe copy-on-write in action
Write this program in C. Before running, predict what you'll see:
// save as cow.c — compile: gcc -o cow cow.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
int main() {
int *data = malloc(sizeof(int));
*data = 42;
printf("Before fork: &data[0] = %p, value = %d\n", (void *)data, *data);
pid_t pid = fork();
if (pid == 0) {
// Child
printf("[child] &data[0] = %p, value = %d\n", (void *)data, *data);
*data = 99; // This triggers copy-on-write!
printf("[child] &data[0] = %p, value = %d (modified)\n", (void *)data, *data);
free(data);
_exit(0);
} else {
wait(NULL);
printf("[parent] &data[0] = %p, value = %d (still original!)\n",
(void *)data, *data);
free(data);
}
return 0;
}
What to observe:
- The address printed by parent and child is identical — same virtual address.
- The child modifies
*datato 99, but the parent still sees 42. - The addresses never change. Only the physical backing changes, invisibly.
Now try adding this before and after the child's modification:
char cmd[64];
snprintf(cmd, sizeof(cmd), "grep -A1 'heap' /proc/%d/smaps", getpid());
system(cmd);
Watch the Private_Dirty field increase after the write — that's the copy-on-write page
becoming a private copy.
Pages and Page Tables
Type this right now
// save as pagemath.c — compile: gcc -o pagemath pagemath.c
#include <stdio.h>
#include <stdint.h>
int main() {
uint64_t addr = 0x00007FFE12345678ULL;
uint64_t offset = addr & 0xFFF; // Bits [11:0]
uint64_t pt_idx = (addr >> 12) & 0x1FF; // Bits [20:12]
uint64_t pd_idx = (addr >> 21) & 0x1FF; // Bits [29:21]
uint64_t pdpt_idx = (addr >> 30) & 0x1FF; // Bits [38:30]
uint64_t pml4_idx = (addr >> 39) & 0x1FF; // Bits [47:39]
printf("Virtual address: 0x%016lx\n", addr);
printf("PML4 index: %3lu (0x%03lx)\n", pml4_idx, pml4_idx);
printf("PDPT index: %3lu (0x%03lx)\n", pdpt_idx, pdpt_idx);
printf("PD index: %3lu (0x%03lx)\n", pd_idx, pd_idx);
printf("PT index: %3lu (0x%03lx)\n", pt_idx, pt_idx);
printf("Page offset: %3lu (0x%03lx)\n", offset, offset);
return 0;
}
$ gcc -o pagemath pagemath.c && ./pagemath
Virtual address: 0x00007ffe12345678
PML4 index: 255 (0x0ff)
PDPT index: 504 (0x1f8)
PD index: 145 (0x091)
PT index: 837 (0x345)
Page offset: 1656 (0x678)
You just decomposed a virtual address into the exact indices the CPU uses to walk the page table. Every memory access your program makes involves this decomposition — in hardware, in nanoseconds.
What is a page?
Memory is managed in fixed-size chunks called pages. On x86-64, the standard page size is 4 KB (4096 bytes, or 0x1000 in hex).
One page = 4096 bytes = 4 KB
┌─────────────────────────────────┐ Byte 0
│ │
│ 4096 bytes │
│ of contiguous memory │
│ │
└─────────────────────────────────┘ Byte 4095
Why 4 KB? It's a hardware compromise:
- Too small (e.g., 64 bytes): you'd need billions of page table entries. The tables themselves would consume all your RAM.
- Too big (e.g., 16 MB): you'd waste memory. Allocating 1 byte would reserve 16 MB. ("Internal fragmentation.")
- 4 KB: a reasonable middle ground, chosen in the 1970s and hardened into silicon ever since.
Virtual memory and physical memory are both divided into pages. Virtual pages are just called "pages." Physical pages are called frames. The page table maps pages to frames.
How a virtual address is split
A 48-bit virtual address isn't treated as one big number. It's split into fields, each serving as an index into a different level of the page table.
63 48 47 39 38 30 29 21 20 12 11 0
┌──────────┬────────┬────────┬────────┬────────┬──────────┐
│ Sign │ PML4 │ PDPT │ PD │ PT │ Offset │
│ extension│ Index │ Index │ Index │ Index │ (12 bits)│
│ (16 bits)│(9 bits)│(9 bits)│(9 bits)│(9 bits)│ │
└──────────┴────────┴────────┴────────┴────────┴──────────┘
│ │ │ │ │ │
│ │ │ │ │ └─► Byte within
│ │ │ │ │ the 4KB page
│ │ │ │ │
│ │ │ │ └─► Which entry in the
│ │ │ │ Page Table (PT)
│ │ │ │
│ │ │ └─► Which entry in the
│ │ │ Page Directory (PD)
│ │ │
│ │ └─► Which entry in the
│ │ Page Directory Pointer Table (PDPT)
│ │
│ └─► Which entry in the
│ Page Map Level 4 (PML4)
│
└─► Must be copies of bit 47 (canonical address requirement)
9 bits = 512 possible values. So each table level has 512 entries. 12 bits of offset = 2^12 = 4096 = one page.
💡 Fun Fact: Intel originally designed x86-64 with 48 bits of virtual address (256 TB). Recent processors support 5-level paging with 57 bits (128 PB). The structure is identical — they just add a PML5 level with another 9-bit index.
The page table entry (PTE)
Each entry in a page table is 8 bytes (64 bits). It contains the physical frame address plus control flags:
63 62 52 51 12 11 9 8 7 6 5 4 3 2 1 0
┌──┬──────────┬───────────────────────────┬─────┬─┬─┬─┬─┬─┬───┬─┬─┐
│NX│ Available │ Physical Frame Address │ Avl │G│ │D│A│ │U/S│R│P│
│ │ (for OS) │ (40 bits) │ │ │ │ │ │ │ │W│ │
└──┴──────────┴───────────────────────────┴─────┴─┴─┴─┴─┴─┴───┴─┴─┘
│ │ │ │ │ │
│ │ │ │ │ └─ Present: page in RAM?
│ │ │ │ └─── Read/Write: writable?
│ │ │ └───── User/Supervisor: user-
│ │ │ accessible?
│ │ └───── Accessed: been read?
│ │ Dirty: been written?
│ │
│ └─ Physical frame number (left-shifted by 12
│ to get the physical address of the frame)
│
└─ NX (No Execute): if set, CPU will refuse to execute
code from this page — defense against code injection
The Present bit is crucial. If it's 0, the page is not in RAM. Accessing it triggers a page fault. The kernel then decides what to do — load from swap, demand-allocate, or kill the process with SIGSEGV.
Why a single-level page table is impossible
Let's do the math for one flat table.
48-bit virtual address space
÷ 4 KB page size (12-bit offset)
= 2^36 virtual pages
× 8 bytes per PTE
= 2^39 bytes = 512 GB per process
That's the size of the table alone. For ONE process.
With 100 processes, you'd need 50 TB of RAM just for tables.
Clearly, this doesn't work. The solution: multi-level tables — only allocate the table pages that actually contain mappings.
The four-level page table walk
Here's the complete walk. The CPU's CR3 register holds the physical address of the PML4 table
for the currently running process.
CR3 register
┌──────────────────┐
│ PML4 phys addr │
└────────┬─────────┘
│
▼
┌──────────────────────────────────────────────┐
│ PML4 Table (512 entries × 8 bytes = 4 KB) │
│ │
│ [0] [1] [2] ... [pml4_idx] ... [511] │
│ │ │
└────────────────────────┼─────────────────────┘
│ bits [47:39] of VA
▼
┌──────────────────────────────────────────────┐
│ PDPT Table (512 entries × 8 bytes = 4 KB) │
│ │
│ [0] [1] [2] ... [pdpt_idx] ... [511] │
│ │ │
└────────────────────────┼─────────────────────┘
│ bits [38:30] of VA
▼
┌──────────────────────────────────────────────┐
│ Page Directory (512 entries × 8 bytes = 4 KB)│
│ │
│ [0] [1] [2] ... [pd_idx] ... [511] │
│ │ │
└────────────────────────┼─────────────────────┘
│ bits [29:21] of VA
▼
┌──────────────────────────────────────────────┐
│ Page Table (512 entries × 8 bytes = 4 KB) │
│ │
│ [0] [1] [2] ... [pt_idx] ... [511] │
│ │ │
└────────────────────────┼─────────────────────┘
│ bits [20:12] of VA
▼
┌──────────────────────────────────────────────┐
│ Physical Frame (4096 bytes) │
│ │
│ byte[0] ... byte[offset] ... byte[4095] │
│ │ │
└────────────────────┼─────────────────────────┘
│ bits [11:0] of VA
▼
Target Byte
Four memory reads to translate one virtual address. Each level is itself a 4 KB page in physical memory, containing 512 eight-byte entries.
🧠 What do you think happens?
At level 2 (PD), the CPU reads entry
pd_idxand finds the Present bit is 0. What happens? Does the CPU try the next entry? Does it guess? Or does it immediately trap to the kernel?
Why this saves memory
A process that uses only a small range of addresses needs very few table pages:
A process using addresses 0x400000 - 0x410000 (64 KB):
PML4: 1 page (always exists, pointed to by CR3)
PDPT: 1 page (only entry [0] is populated)
PD: 1 page (only entry [2] is populated)
PT: 1 page (only entries [0]-[15] are populated)
Total: 4 pages × 4 KB = 16 KB of page tables for a 64 KB mapping
A flat table would need: 512 GB
Savings: 99.9999997%
Levels that aren't needed simply don't exist. Their parent's entry has Present=0, and that's that.
The TLB: making it fast
Four memory accesses per translation would be crippling. At ~100 ns per RAM access, that's 400 ns per instruction — you'd run at about 2.5 million instructions per second. A modern CPU does 4+ billion.
The fix: the Translation Lookaside Buffer (TLB). It's a small, fast cache of recent virtual-to-physical translations.
Virtual address 0x7FFE12345678
│
▼
┌─────────────────────────────────────┐
│ TLB │
│ ┌──────────────┬─────────────────┐ │
│ │ Virtual Page │ Physical Frame │ │
│ ├──────────────┼─────────────────┤ │
│ │ 0x7FFE12345 │ 0x3A201 ──────────── HIT! (~1 cycle)
│ │ 0x55A3B2C02 │ 0x1F800 │ │
│ │ 0x7F8A12040 │ 0x22100 │ │
│ │ ... │ ... │ │
│ └──────────────┴─────────────────┘ │
│ │
│ ~64-2048 entries (varies by CPU) │
└─────────────────────────────────────┘
│
│ MISS → full 4-level walk (~100 ns)
▼
TLB hit: ~1 cycle. Translation is essentially free. TLB miss: full 4-level page table walk. Hundreds of cycles.
Most programs have excellent TLB hit rates (>99%) because of locality — they access the same pages repeatedly.
Context switches flush the TLB
When the OS switches from Process A to Process B, it loads B's page table base into CR3.
This invalidates the entire TLB — because A's translations are wrong for B.
Running Process A: TLB full of A's translations (fast!)
│
│ Context switch → load B's CR3
▼
Running Process B: TLB is EMPTY (cold start, slow!)
│
│ First ~1000 memory accesses → all TLB misses
▼
Running Process B: TLB warming up... (getting faster)
This is one reason context switches are expensive. The process that resumes starts with a cold TLB and suffers many misses until it warms up. Modern CPUs have PCID (Process Context Identifiers) — tagging TLB entries with a process ID to avoid full flushes, but the TLB is still small.
Huge pages: fewer translations, fewer misses
Standard 4 KB pages mean a 1 GB dataset spans 262,144 pages. That's a lot of TLB entries. The TLB might only hold 2,048 — so you're constantly evicting and reloading translations.
Huge pages use larger page sizes:
Page Size Offset Bits Levels Walked TLB entries for 1 GB
───────── ─────────── ───────────── ────────────────────
4 KB 12 4 262,144
2 MB 21 3 (skip PT) 512
1 GB 30 2 (skip PT+PD) 1
With 2 MB pages, the PD entry points directly to a 2 MB physical frame — the PT level is skipped. With 1 GB pages, even the PD level is skipped.
// Allocating huge pages in C (Linux)
#include <sys/mman.h>
void *p = mmap(NULL, 2 * 1024 * 1024, // 2 MB
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0);
💡 Fun Fact: Database servers like PostgreSQL and Redis often use huge pages. A database with a 32 GB buffer pool would need 8 million TLB entries with 4 KB pages. With 2 MB huge pages, that drops to 16,384. The performance difference is measurable — 5-10% on memory-heavy workloads.
Rust: same hardware, same rules
fn main() { let addr: u64 = 0x00007FFE12345678; let offset = addr & 0xFFF; let pt_idx = (addr >> 12) & 0x1FF; let pd_idx = (addr >> 21) & 0x1FF; let pdpt_idx = (addr >> 30) & 0x1FF; let pml4_idx = (addr >> 39) & 0x1FF; println!("Virtual address: {:#018x}", addr); println!("PML4 index: {:3} ({:#05x})", pml4_idx, pml4_idx); println!("PDPT index: {:3} ({:#05x})", pdpt_idx, pdpt_idx); println!("PD index: {:3} ({:#05x})", pd_idx, pd_idx); println!("PT index: {:3} ({:#05x})", pt_idx, pt_idx); println!("Page offset: {:3} ({:#05x})", offset, offset); }
The page table structure doesn't care whether the process was written in C, Rust, Python, or assembly. It's hardware. Every byte your Rust program touches goes through the same four-level walk (or TLB hit).
🔧 Task: Walk a page table by hand
Take virtual address 0x00007FFE12345678. Work through the walk on paper:
-
Split the address into its component fields:
- Bits [47:39] → PML4 index = ?
- Bits [38:30] → PDPT index = ?
- Bits [29:21] → PD index = ?
- Bits [20:12] → PT index = ?
- Bits [11:0] → Page offset = ?
-
Convert to binary first if it helps:
0x00007FFE12345678 = 0000 0000 0000 0000 0111 1111 1111 1110 0001 0010 0011 0100 0101 0110 0111 1000 Split: [47:39] = 0_1111_1111 = 0xFF = 255 [38:30] = 1_1111_1000 = 0x1F8 = 504 [29:21] = 0_1001_0001 = 0x091 = 145 [20:12] = 1_0100_0101 = 0x345 = 837 (wait — is it 837 or 325?) [11:0] = 0110_0111_1000 = 0x678 = 1656 -
Verify your answers by running the C program from the top of this chapter.
-
Try a different address: your own stack variable. Print its address, then decompose it. Does the PML4 index match what you'd expect for a stack address (high in the user-space range)?
-
Check the real page tables (Linux, needs root):
# Find a process's page table root $ sudo cat /proc/<pid>/pagemap | xxd | headThe
pagemapfile lets you look up the physical frame for any virtual page. SeeDocumentation/admin-guide/mm/pagemap.rstin the kernel source for the format.
Page Faults: When Things Get Interesting
Type this right now
// save as lazy.c — compile: gcc -o lazy lazy.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main() {
// Ask for 100 MB. Does the OS actually give us 100 MB of RAM?
size_t size = 100 * 1024 * 1024;
char *p = malloc(size);
printf("malloc returned: %p\n", (void *)p);
printf("Now check: ps -o pid,vsz,rss -p %d\n", getpid());
printf("Press Enter BEFORE touching memory...\n");
getchar();
// Touch every page (write one byte per 4 KB page)
for (size_t i = 0; i < size; i += 4096) {
p[i] = 'A';
}
printf("Memory touched. Press Enter to check again...\n");
getchar();
free(p);
return 0;
}
Run it in one terminal. In another terminal, check RSS before and after:
$ ps -o pid,vsz,rss -p $(pgrep lazy)
PID VSZ RSS
1234 203456 2340 ← Before: VSZ is large, RSS is tiny!
# (press Enter in the first terminal)
$ ps -o pid,vsz,rss -p $(pgrep lazy)
PID VSZ RSS
1234 203456 104820 ← After: RSS jumped ~100 MB
VSZ (virtual size) was always large — the address space was mapped. RSS (resident set size) was tiny until you touched the pages. Physical RAM was allocated one page at a time, on demand, via page faults.
A page fault is NOT an error
This is the most misunderstood concept in systems programming. A page fault is a CPU exception that says: "I tried to translate this virtual address, and the page table says I can't proceed. Kernel, please help."
The kernel then decides what to do:
CPU executes: mov [0x7F4000], eax
│
▼
MMU walks page table → PTE has Present=0 (or wrong permissions)
│
▼
CPU raises Page Fault Exception (#PF)
│
▼
Kernel's page fault handler runs
│
├──► Minor fault? → Map a physical frame, resume
│
├──► Major fault? → Load from disk, map frame, resume
│
└──► Invalid? → Send SIGSEGV → process dies
Three types of page faults
1. Minor fault — demand paging
You called malloc(100 MB). The kernel said "sure" and set up virtual address mappings but
left every PTE with Present=0. No physical RAM was allocated.
When you first touch a page:
Your code: p[0] = 'A';
│
▼
Virtual address 0x7F4000 → MMU walks table → Present=0
│
▼
Page Fault (minor) → kernel allocates a physical frame
│ from the free page pool, zeros it,
│ updates the PTE: Present=1, Frame=0x1A200
│
▼
CPU retries the instruction → MMU walks table → Present=1 → success!
p[0] is now 'A' in frame 0x1A200
This happens once per page, then never again (for that page). The process never notices — the
retry is automatic. This is why malloc(1 GB) succeeds on a 4 GB system — physical RAM is only
committed when pages are actually accessed.
🧠 What do you think happens?
You call
malloc(1 TB)on a machine with 16 GB of RAM.mallocreturns a valid pointer. You then try to touch every page. At some point, the kernel runs out of physical frames. What happens? (Hint: look up the "OOM killer.")
2. Minor fault — copy-on-write
After fork(), parent and child share pages marked read-only. When either writes:
Child writes to shared page at 0x5000
│
▼
MMU: page is present but marked read-only → Page Fault
│
▼
Kernel: "Ah, this is a copy-on-write page."
│
├── Allocate new physical frame
├── Copy contents from original frame
├── Update child's PTE: point to new frame, mark writable
└── Resume child's instruction
│
▼
Child's write succeeds. Parent's page is unaffected.
Still a minor fault — no disk I/O. Just a memory copy and a PTE update.
3. Major fault — loading from disk
The page was once in RAM but got swapped out to disk (because the system was low on memory). The PTE has Present=0 but contains a swap entry telling the kernel where on disk the data lives.
Access to page at 0x8000 → Present=0, swap entry = disk sector 42501
│
▼
Kernel: "This page was swapped out."
│
├── Allocate a free physical frame
├── Read 4 KB from swap disk into the frame ◄── SLOW! ~5-10 ms
├── Update PTE: Present=1, Frame=new_frame
└── Resume instruction
│
▼
Access succeeds. But it cost milliseconds, not nanoseconds.
Speed comparison:
TLB hit: ~1 ns
Minor page fault: ~1-10 μs (1,000× slower)
Major page fault: ~5-10 ms (5,000,000× slower than TLB hit)
(1,000× slower than minor fault)
Major faults are the reason your system feels sluggish when it starts swapping. A program that would normally take 1 second can take hours if most of its accesses trigger major faults.
4. Invalid fault — you messed up
The virtual address has no mapping at all. No PTE. No swap entry. Nothing.
Access to 0xDEADBEEF → no mapping exists
│
▼
Kernel: "This address is not valid for this process."
│
▼
Kernel sends SIGSEGV to the process
│
▼
Default handler: print "Segmentation fault", dump core, exit
This is the one that kills your program. We'll cover it in detail in Chapter 18.
Watching page faults happen
Linux tracks page faults per process. You can see them:
$ /usr/bin/time -v ./lazy 2>&1 | grep -i fault
Minor (reclaiming a frame): 25,612
Major (requiring I/O): 0
25,612 minor faults for 100 MB makes sense: 100 MB / 4 KB = 25,600 pages (plus a few for the program itself, stack, libraries).
You can also watch in real time with perf:
$ perf stat -e page-faults,minor-faults,major-faults ./lazy
25,614 page-faults
25,614 minor-faults
0 major-faults
mmap: the Swiss army knife
mmap() is the system call that creates virtual address mappings. Everything runs through it:
malloc (large allocs) → calls mmap(MAP_ANONYMOUS | MAP_PRIVATE)
Loading shared libs → kernel calls mmap(MAP_PRIVATE, fd)
Reading files → you call mmap(MAP_PRIVATE, fd)
Shared memory → mmap(MAP_SHARED | MAP_ANONYMOUS)
Copy-on-write fork → kernel manipulates existing mappings
Here's mmap used to read a file:
// save as mmapread.c — compile: gcc -o mmapread mmapread.c
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
int main() {
int fd = open("/etc/passwd", O_RDONLY);
struct stat st;
fstat(fd, &st);
// Map the entire file into our address space
char *data = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd); // We can close fd — the mapping keeps the file accessible
// Access the file like a normal array
printf("First 80 bytes:\n%.80s\n", data);
munmap(data, st.st_size);
return 0;
}
And in Rust:
use std::fs::File; use std::os::unix::io::AsRawFd; fn main() { let file = File::open("/etc/passwd").unwrap(); let len = file.metadata().unwrap().len() as usize; // Using the memmap2 crate (add to Cargo.toml: memmap2 = "0.9") // let mmap = unsafe { memmap2::Mmap::map(&file).unwrap() }; // println!("{}", std::str::from_utf8(&mmap[..80]).unwrap()); // Or with raw mmap: let ptr = unsafe { libc::mmap( std::ptr::null_mut(), len, libc::PROT_READ, libc::MAP_PRIVATE, file.as_raw_fd(), 0, ) }; let data = unsafe { std::slice::from_raw_parts(ptr as *const u8, len) }; println!("First 80 bytes:\n{}", std::str::from_utf8(&data[..80]).unwrap()); unsafe { libc::munmap(ptr, len); } }
💡 Fun Fact: When the kernel loads your ELF binary at
exec()time, it doesn't read the whole file into RAM. Itmmaps the segments. Your.textsection is demand-paged — functions that are never called are never loaded from disk.
The page fault flow in full
Here's the complete picture of what happens when the CPU can't translate an address:
CPU executes instruction that accesses virtual address VA
│
▼
MMU checks TLB ─── Hit? ──► Translate, access physical memory. Done.
│
No (TLB miss)
│
▼
MMU walks 4-level page table
│
├── Present=1 and permissions OK? ──► Load into TLB, access memory. Done.
│
└── Present=0 or permission violation?
│
▼
CPU pushes fault address to CR2 register
CPU raises #PF exception (interrupt 14)
CPU switches to kernel mode
│
▼
Kernel page fault handler (arch/x86/mm/fault.c)
│
├── Is VA in a valid VMA? (vm_area_struct)
│ │
│ No ──► Send SIGSEGV (invalid access)
│ │
│ Yes
│ ▼
├── Was it a write to a read-only COW page?
│ │
│ Yes ──► Allocate frame, copy page, update PTE, resume
│ │
│ No
│ ▼
├── Is there a swap entry?
│ │
│ Yes ──► Read from swap (major fault), map frame, resume
│ │
│ No
│ ▼
├── Is it a demand-zero page (anonymous)?
│ │
│ Yes ──► Allocate zeroed frame, map it, resume (minor fault)
│ │
│ No
│ ▼
└── Is it a file-backed mapping?
│
Yes ──► Read from file (major fault), map frame, resume
│
No ──► Send SIGSEGV
🔧 Task: Watch demand paging in /proc/self/smaps
// save as demand.c — compile: gcc -o demand demand.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
void show_rss() {
char path[64];
snprintf(path, sizeof(path), "/proc/%d/status", getpid());
FILE *f = fopen(path, "r");
char line[256];
while (fgets(line, sizeof(line), f)) {
if (strncmp(line, "VmRSS", 5) == 0 || strncmp(line, "VmSize", 6) == 0) {
printf(" %s", line);
}
}
fclose(f);
}
int main() {
printf("Before malloc:\n");
show_rss();
char *p = malloc(100 * 1024 * 1024); // 100 MB
printf("\nAfter malloc (before touching):\n");
show_rss();
// Touch every page
for (size_t i = 0; i < 100 * 1024 * 1024; i += 4096) {
p[i] = 1;
}
printf("\nAfter touching every page:\n");
show_rss();
free(p);
printf("\nAfter free:\n");
show_rss();
return 0;
}
$ ./demand
Before malloc:
VmSize: 2580 kB
VmRSS: 1024 kB
After malloc (before touching):
VmSize: 105060 kB ← Virtual size jumped 100 MB
VmRSS: 1040 kB ← Physical memory: basically unchanged!
After touching every page:
VmSize: 105060 kB ← Virtual size: same
VmRSS: 103420 kB ← RSS jumped ~100 MB. NOW the RAM is used.
After free:
VmSize: 2580 kB ← Virtual mapping released
VmRSS: 1040 kB ← Physical RAM returned to the OS
Key insight: VmSize reflects the virtual address space. VmRSS reflects physical RAM.
malloc only affects VmSize. Actually touching the memory triggers page faults, which
allocate physical frames and increase VmRSS.
Now run it again under perf stat:
$ perf stat -e minor-faults,major-faults ./demand
25,630 minor-faults ← ~25,600 pages = 100 MB / 4 KB
0 major-faults
Every single one of those 25,600 minor faults was the kernel giving you one physical frame. No disk I/O. Just pure demand paging.
Segmentation Faults: The Complete Guide
Type this right now
// save as crash.c — compile: gcc -g -O0 -o crash crash.c
#include <stdio.h>
int main() {
int *p = NULL;
printf("About to dereference NULL...\n");
*p = 42;
printf("This line never runs.\n");
return 0;
}
$ ./crash
About to dereference NULL...
Segmentation fault (core dumped)
$ dmesg | tail -1
crash[12345]: segfault at 0 ip 00005555555551a2 sp 00007fffffffde10 error 6 in crash[555555555000+1000]
That kernel log line tells you everything: the fault happened at address 0 (NULL), the
instruction pointer was 0x5555555551a2, and error code 6 means "user-mode write to
a non-present page."
Now you know where it crashed, what it was doing, and why.
What ACTUALLY happens
A segfault is not magic. It's a precise chain of hardware and software events:
1. Your code: *p = 42; (where p = NULL = 0x0)
│
▼
2. CPU: "Load effective address 0x0, store value 42"
│
▼
3. MMU: Walk page table for address 0x0
│
▼
4. MMU: PTE for page 0 has Present=0 (page 0 is intentionally unmapped)
│
▼
5. CPU: Store fault address (0x0) in CR2 register
CPU: Raise #PF exception (interrupt vector 14)
CPU: Switch to Ring 0 (kernel mode)
CPU: Jump to kernel's page fault handler
│
▼
6. Kernel: Check if address 0x0 belongs to any VMA for this process
│
▼
7. Kernel: No valid VMA → this is an invalid access
│
▼
8. Kernel: Send SIGSEGV (signal 11) to the process
│
▼
9. Process: Default SIGSEGV handler runs
→ Print "Segmentation fault"
→ Generate core dump (if ulimit allows)
→ Terminate with exit code 139 (128 + 11)
The CPU doesn't know what a segfault is. It only knows page faults. The kernel decides whether a page fault is recoverable (minor/major fault) or fatal (segfault).
Cause #1: NULL pointer dereference
The most common segfault. Page zero is intentionally unmapped on every modern OS. This turns NULL dereference from silent corruption into an immediate crash.
In C
// null.c — compile: gcc -g -O0 -o null null.c
#include <stdio.h>
#include <stdlib.h>
struct Node {
int value;
struct Node *next;
};
int main() {
struct Node *head = NULL;
// Forgot to allocate! Accessing through NULL pointer:
head->value = 42; // CRASH: writing to address 0x0
return 0;
}
Memory at dereference:
head ──────► 0x0000000000000000 (NULL)
│
▼
┌─────────────────────┐
│ Page 0 (4 KB) │
│ │
│ NOT MAPPED │ ◄── Present=0 in page table
│ Access here = │
│ immediate #PF │
└─────────────────────┘
In Rust
fn main() { let head: Option<Box<i32>> = None; // This won't compile — Rust forces you to handle None: // let val = *head; // ERROR: cannot dereference Option // You must explicitly handle it: match head { Some(val) => println!("Value: {}", val), None => println!("No value!"), } }
Rust's Option<T> makes NULL impossible in safe code. There is no null pointer — there's
Some(value) or None, and the compiler forces you to handle both. You literally cannot
compile code that dereferences without checking.
💡 Fun Fact: The first 64 KB of address space (pages 0-15) are unmapped on Linux. This catches not just
NULLdereference but alsoNULL + small_offset, like((struct s*)NULL)->field. This range is controlled by/proc/sys/vm/mmap_min_addr.
Cause #2: Stack overflow
Every thread has a fixed-size stack (default: 8 MB on Linux). Below the stack is a guard page — an unmapped page that triggers a fault if the stack grows too far.
In C
// stackoverflow.c — compile: gcc -g -O0 -o stackoverflow stackoverflow.c
#include <stdio.h>
void recurse(int depth) {
char buffer[4096]; // 4 KB per frame — eat stack fast
buffer[0] = 'A';
printf("Depth: %d, &buffer = %p\n", depth, (void *)buffer);
recurse(depth + 1);
}
int main() {
recurse(0);
return 0;
}
Stack layout during deep recursion:
0x7FFFFFFFE000 ┌────────────────────┐ ◄── Stack top
│ main() frame │
├────────────────────┤
│ recurse(0) frame │
│ buffer[4096] │
├────────────────────┤
│ recurse(1) frame │
│ buffer[4096] │
├────────────────────┤
│ ... │
│ ~2000 frames │
│ ... │
├────────────────────┤
│ recurse(2047) │
0x7FFFFFF7E000 ├────────────────────┤ ◄── Stack bottom (8 MB limit)
│ GUARD PAGE │ ◄── Unmapped! Touching = SIGSEGV
├────────────────────┤
│ (unmapped) │
└────────────────────┘
In Rust
fn recurse(depth: u64) { let buffer = [0u8; 4096]; println!("Depth: {}, addr: {:p}", depth, &buffer); recurse(depth + 1); } fn main() { recurse(0); }
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)
Rust detects the stack overflow and prints a clear message instead of just "Segmentation fault." But the underlying mechanism is identical — the guard page triggers a fault, and the runtime catches it.
Cause #3: Writing to read-only memory
String literals in C live in the .rodata section, which is mapped read-only (r--p).
In C
// rodata.c — compile: gcc -g -O0 -o rodata rodata.c
#include <stdio.h>
int main() {
char *s = "Hello"; // s points into .rodata (read-only)
s[0] = 'h'; // CRASH: writing to read-only page
return 0;
}
Memory layout:
.rodata section (mapped with permissions: r--p)
┌───────────────────────────────┐
│ 'H' 'e' 'l' 'l' 'o' '\0' │
│ ▲ │
│ │ │
│ s points here │
└───────────────────────────────┘
PTE flags: Present=1, Read/Write=0 (read-only)
CPU tries to WRITE → MMU: "Present=1 but Write=0" → #PF
Kernel: "Valid mapping but wrong permissions" → SIGSEGV
In Rust
fn main() { let s = "Hello"; // &str — immutable by definition // s[0] = 'h'; // Won't compile: &str is not mutable // There's no way to accidentally write to a string literal in safe Rust. // Even with a mutable String, the original literal stays safe: let mut owned = String::from("Hello"); owned.replace_range(0..1, "h"); // This modifies a HEAP copy println!("{}", owned); // "hello" }
Rust's type system distinguishes &str (immutable reference to string data) from String
(owned, heap-allocated, mutable). You can't accidentally modify a string literal.
Cause #4: Use-after-free
In C
// uaf.c — compile: gcc -g -O0 -o uaf uaf.c
#include <stdio.h>
#include <stdlib.h>
int main() {
int *p = malloc(sizeof(int));
*p = 42;
printf("Before free: *p = %d\n", *p);
free(p);
// p still holds the old address, but the memory may be:
// - returned to the allocator's free list
// - unmapped entirely (for large allocations)
// - reused by a later malloc
*p = 99; // UNDEFINED BEHAVIOR
// Might segfault (if page was unmapped)
// Might silently corrupt other data (if reused)
// Might appear to "work" (if page still mapped but unused)
printf("After free: *p = %d\n", *p);
return 0;
}
Before free(p):
p ──────► ┌──────────────┐ 0x55a000 (heap)
│ 42 │ ◄── valid, allocated
└──────────────┘
After free(p):
p ──────► ┌──────────────┐ 0x55a000 (heap)
│ free list │ ◄── returned to allocator
│ metadata │ (or possibly unmapped)
└──────────────┘
Writing *p = 99 here either:
- Overwrites free-list metadata → heap corruption
- Hits an unmapped page → SIGSEGV
- Appears to work → ticking time bomb
The scariest part: use-after-free might not crash immediately. It might corrupt the heap silently and crash minutes later in a completely unrelated function. This is why use-after-free is the #1 source of security vulnerabilities in C/C++ code.
In Rust
fn main() { let p = Box::new(42); drop(p); // Explicitly free // println!("{}", p); // COMPILE ERROR: value used after move // The compiler will not let this happen. Period. }
error[E0382]: borrow of moved value: `p`
--> src/main.rs:4:20
|
2 | let p = Box::new(42);
| - move occurs because `p` has type `Box<i32>`
3 | drop(p);
| - value moved here
4 | println!("{}", p);
| ^ value borrowed here after move
The borrow checker tracks ownership. After drop(p), the variable p is consumed. Any
attempt to use it is a compile-time error. Not a runtime check. Not a sanitizer. The program
never compiles.
Cause #5: Buffer overflow
In C
// overflow.c — compile: gcc -g -O0 -o overflow overflow.c
#include <stdio.h>
int main() {
int arr[10];
// Write way past the end of the array
arr[1000000] = 42; // 4 MB past the end — likely in unmapped space
return 0;
}
Stack layout:
0x7FFFFFFFDE00 ┌────────────────────┐
│ arr[0] ... arr[9] │ 40 bytes (10 × 4)
0x7FFFFFFFDE28 ├────────────────────┤
│ (other stack data) │
├────────────────────┤
│ ... │
│ │
0x7FFFFFF9DE00 │ arr[1000000] │ ◄── 4 MB below arr
│ THIS IS UNMAPPED │ ◄── SIGSEGV
└────────────────────┘
Small overflows (e.g., arr[10] or arr[20]) might NOT segfault — they silently overwrite
adjacent stack data. This is how stack buffer overflows lead to arbitrary code execution.
Only when you go far enough to land in an unmapped page does the hardware catch it.
In Rust
fn main() { let arr = [0i32; 10]; let idx = 1_000_000; println!("{}", arr[idx]); // Panic at runtime (bounds check) }
thread 'main' panicked at 'index out of bounds: the len is 10 but the index is 1000000'
Rust inserts bounds checks on every array/slice access. The program panics with a clear message instead of corrupting memory. The panic is not a segfault — it's a controlled unwinding or abort.
🧠 What do you think happens?
In C,
arr[11] = 42;whenarrhas 10 elements. Does it always segfault? Usually? Rarely? Why is the answer "it depends"?
Cause #6: Wild / uninitialized pointer
In C
// wild.c — compile: gcc -g -O0 -o wild wild.c
int main() {
int *p; // Uninitialized — contains whatever was on the stack
*p = 42; // Dereferences a garbage address
return 0;
}
p contains random stack data:
p ──────► 0x??????????? (whatever bytes were on the stack)
│
▼
Could be:
• 0x0000000000000000 → NULL deref → SIGSEGV
• 0x00007FFF12340000 → might be mapped → silent corruption!
• 0x0000DEADBEEF0000 → unmapped → SIGSEGV
• 0xFFFF800000000000 → kernel space → SIGSEGV
It's random. The behavior changes between runs, compilers,
and optimization levels. This is undefined behavior.
In Rust
fn main() { let p: *mut i32; // unsafe { *p = 42; } // COMPILE ERROR: use of possibly uninitialized `p` // Rust requires all variables to be initialized before use. // Even raw pointers must be given a value. }
error[E0381]: used binding `p` isn't initialized
--> src/main.rs:3:16
|
2 | let p: *mut i32;
| - binding declared here but left uninitialized
3 | unsafe { *p = 42; }
| ^ `p` used here but it isn't initialized
Rust's compiler tracks initialization. You cannot use a variable — of any type, including raw pointers — until it has been assigned a value.
Summary: six causes at a glance
Cause C behavior Rust behavior
───────────────────── ──────────────────── ─────────────────────────
1. NULL deref SIGSEGV Option<T> — compile error
2. Stack overflow SIGSEGV (guard page) Detected, clear message
3. Write to rodata SIGSEGV &str is immutable — compile error
4. Use-after-free SIGSEGV or corruption Borrow checker — compile error
5. Buffer overflow SIGSEGV or corruption Bounds check — panic
6. Wild pointer SIGSEGV or corruption Must initialize — compile error
Rust eliminates four of these at compile time, detects one at runtime with a clear message, and handles the last (stack overflow) with a runtime check. In safe Rust, segfaults from your code are essentially impossible.
Debugging segfaults
1. dmesg — what the kernel saw
$ dmesg | tail -3
[12345.678] crash[9876]: segfault at 0 ip 00005555555551a2
sp 00007fffffffde10 error 6 in crash[555555555000+1000]
Fields:
at 0: the faulting virtual address (0 = NULL)ip 00005555555551a2: instruction pointer — what instruction caused the faulterror 6: error code bits (6 = user-mode write to non-present page)
Error code bits:
Bit 0: 0=non-present page, 1=protection violation
Bit 1: 0=read, 1=write
Bit 2: 0=kernel mode, 1=user mode
Bit 3: 1=reserved bit violation
Bit 4: 1=instruction fetch (NX violation)
Error 6 = 0b110 = user-mode(1) + write(1) + non-present(0)
Error 4 = 0b100 = user-mode(1) + read(0) + non-present(0)
Error 7 = 0b111 = user-mode(1) + write(1) + protection(1)
2. Core dumps
$ ulimit -c unlimited # Enable core dumps
$ ./crash
Segmentation fault (core dumped)
$ gdb ./crash core
(gdb) bt # Backtrace — where exactly it crashed
#0 0x00005555555551a2 in main () at crash.c:5
(gdb) info registers # CPU state at crash time
(gdb) print p # The guilty pointer
$1 = (int *) 0x0
3. addr2line
$ addr2line -e crash 0x00005555555551a2
/home/user/crash.c:5
Converts an instruction address to a source file and line number. Requires -g (debug info)
during compilation.
4. AddressSanitizer (the nuclear option)
$ gcc -g -fsanitize=address -o crash crash.c
$ ./crash
=================================================================
==9876==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000
#0 0x555555555192 in main /home/user/crash.c:5
ASan catches things the kernel can't — like small buffer overflows that stay within mapped pages. It adds ~2x memory overhead and ~2x slowdown, but it catches almost everything.
🔧 Task: Trigger and debug all six types
Create six C programs, one for each cause. For each one:
- Compile with
gcc -g -O0 - Run it, confirm the segfault
- Check
dmesg | tail -1for the kernel's report - Run under GDB:
$ gdb ./program (gdb) run (gdb) bt (gdb) info registers (gdb) print <the pointer variable> - Note the faulting address and error code
// 1_null.c
int main() { *(int *)0 = 42; return 0; }
// 2_stack.c
void f() { f(); }
int main() { f(); return 0; }
// 3_rodata.c
int main() { char *s = "hello"; s[0] = 'H'; return 0; }
// 4_uaf.c
#include <stdlib.h>
int main() { int *p = malloc(4); free(p); *p = 42; return 0; }
// 5_overflow.c
int main() { int a[10]; a[1000000] = 42; return 0; }
// 6_wild.c
int main() { int *p; *p = 42; return 0; }
Bonus: Compile each with -fsanitize=address and compare the output. ASan gives far more
detail than a raw segfault.
Bonus 2: Write the Rust equivalent of each. See which ones the compiler refuses to compile and read the error messages carefully — they're telling you exactly what C doesn't.
Signals and Process Lifecycle
Type this right now
// save as catcher.c — compile: gcc -o catcher catcher.c
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
void handler(int sig) {
printf("\nCaught signal %d (SIGINT)! Not dying today.\n", sig);
}
int main() {
signal(SIGINT, handler);
printf("Try pressing Ctrl+C... (PID: %d)\n", getpid());
for (int i = 0; i < 30; i++) {
printf("Running... %d\n", i);
sleep(1);
}
printf("Finished normally.\n");
return 0;
}
$ ./catcher
Try pressing Ctrl+C... (PID: 12345)
Running... 0
Running... 1
^C
Caught signal 2 (SIGINT)! Not dying today.
Running... 2
Running... 3
^C
Caught signal 2 (SIGINT)! Not dying today.
Running... 4
You pressed Ctrl+C twice. Normally that kills the process. But your signal handler intercepted it. The process kept running. You just took control of how your program responds to external events.
Now try kill -9 12345 from another terminal. That sends SIGKILL. No handler can catch it.
The process dies immediately.
What is a signal?
A signal is an asynchronous notification delivered to a process. It's the kernel's way of saying "something happened that you might care about."
┌──────────────────────────────────────────────────────┐
│ Sources of Signals │
│ │
│ Keyboard: Ctrl+C → SIGINT Ctrl+\ → SIGQUIT │
│ Hardware: Bad address → SIGSEGV Bad math → SIGFPE│
│ Kernel: Child exited → SIGCHLD Timer → SIGALRM│
│ Other process: kill(pid, SIGTERM) │
│ Your code: raise(SIGABRT), abort() │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Kernel sets signal pending │
│ on the target process │
└────────────────────────┬─────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Next time process returns to user space: │
│ → kernel checks pending signals │
│ → delivers the signal by running the handler │
└──────────────────────────────────────────────────────┘
Signals are not interrupts. They don't execute immediately when sent. The kernel marks a signal as pending, and the process sees it the next time it transitions from kernel mode back to user mode (after a syscall, or after being scheduled).
The important signals
Signal Number Default Action Meaning
───────── ────── ────────────── ─────────────────────────────
SIGINT 2 Terminate Ctrl+C — polite interrupt
SIGQUIT 3 Core dump Ctrl+\ — quit with dump
SIGABRT 6 Core dump abort() called
SIGFPE 8 Core dump Arithmetic error (div by 0)
SIGKILL 9 Terminate Unconditional kill (CAN'T CATCH)
SIGSEGV 11 Core dump Invalid memory access
SIGPIPE 13 Terminate Write to pipe with no reader
SIGTERM 15 Terminate Polite "please exit"
SIGCHLD 17 Ignore Child process changed state
SIGSTOP 19 Stop process Pause (CAN'T CATCH)
SIGCONT 18 Continue Resume stopped process
SIGUSR1 10 Terminate User-defined
SIGUSR2 12 Terminate User-defined
SIGBUS 7 Core dump Bus error (misaligned access)
Two signals cannot be caught or ignored: SIGKILL (9) and SIGSTOP (19). These are the kernel's absolute authority — no process can resist them.
💡 Fun Fact:
kill -9is called "kill dash nine" and has become part of programmer folklore. There's even a haiku: "No, your process is / not important. SIGKILL / does not negotiate."
Signal delivery mechanics
Here's what happens when signal delivery is triggered:
Process in user space, executing normally
│
│ Syscall (read, write, etc.) or timer interrupt
▼
Process enters kernel mode
│
│ Kernel does its work (I/O, scheduling, etc.)
│
│ Before returning to user space, kernel checks:
│ "Are there any pending signals for this process?"
│
├── No pending signals → return to user space normally
│
└── Yes, signal S is pending:
│
├── Is there a custom handler for S?
│ │
│ Yes → Modify user-space stack:
│ │ push signal frame (saved registers)
│ │ set instruction pointer to handler function
│ │ return to user space → handler runs
│ │ handler returns → sigreturn syscall
│ │ kernel restores original registers
│ │ process resumes where it was interrupted
│ │
│ No → Execute default action:
│ Terminate, core dump, stop, or ignore
│
└── Is S blocked (signal mask)?
Yes → remains pending, delivered later
The kernel modifies the process's user-space stack to run the handler. This is why signal handlers must be careful — they're running on the interrupted code's stack.
Writing a proper signal handler
signal() is the simple API, but sigaction() is what you should actually use:
// save as proper_handler.c — compile: gcc -o proper_handler proper_handler.c
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <stdlib.h>
volatile sig_atomic_t got_signal = 0;
void handler(int sig) {
// RULE: only call async-signal-safe functions here!
// printf is NOT safe. write() IS safe.
got_signal = sig;
}
int main() {
struct sigaction sa;
sa.sa_handler = handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGINT, &sa, NULL);
sigaction(SIGTERM, &sa, NULL);
printf("PID: %d — send me SIGINT or SIGTERM\n", getpid());
while (!got_signal) {
printf("Working...\n");
sleep(1);
}
printf("Received signal %d. Cleaning up...\n", got_signal);
// Do your cleanup here: close files, flush buffers, etc.
return 0;
}
Critical rule: Inside a signal handler, you can only call async-signal-safe functions.
printf(), malloc(), and most of the standard library are NOT safe. Use write() if you
must output something. The safe pattern is: set a flag in the handler, check it in your main
loop.
Rust and signals
Rust doesn't have built-in signal handling in the standard library. Panic is Rust's primary error mechanism for things like bounds check failures:
fn main() { let v = vec![1, 2, 3]; println!("{}", v[10]); // Panics — not a signal, a Rust panic }
thread 'main' panicked at 'index out of bounds: the len is 3 but the index is 10'
note: run with `RUST_BACKTRACE=1` for a backtrace
For actual Unix signal handling, use the signal-hook crate:
// Cargo.toml: signal-hook = "0.3" use signal_hook::consts::SIGINT; use signal_hook::iterator::Signals; fn main() -> Result<(), Box<dyn std::error::Error>> { let mut signals = Signals::new(&[SIGINT])?; println!("Press Ctrl+C..."); for sig in signals.forever() { match sig { SIGINT => { println!("Caught SIGINT! Exiting gracefully."); break; } _ => unreachable!(), } } Ok(()) }
SIGSEGV from unsafe code still kills a Rust process the same way it kills a C process. The
Rust runtime does not catch segfaults.
Process lifecycle
Every process on your system follows this lifecycle:
Parent process
│
│ fork()
├──────────────────────────┐
│ │
│ Parent continues │ Child is a COPY of parent
│ │
│ │ exec() — optional
│ │ Replace child's memory with
│ │ a new program (e.g., /bin/ls)
│ │
│ │ ... child runs ...
│ │
│ │ exit(status)
│ │ Child terminates
│ │
│ ▼
│ ┌──────────────┐
│ │ ZOMBIE │ Child's entry stays in
│ │ (defunct) │ process table until parent
│ └──────┬───────┘ calls wait()
│ │
│ wait(&status) │
│ Parent collects ◄───────┘
│ child's exit status
│
▼
Child's process table entry is finally freed
fork(): clone the process
pid_t pid = fork();
// After this line, TWO processes are running
if (pid == 0) {
// Child process — fork returned 0
printf("I'm the child, PID %d\n", getpid());
} else {
// Parent process — fork returned child's PID
printf("I'm the parent, child is %d\n", pid);
}
exec(): replace with a new program
// In the child:
execvp("ls", (char *[]){"ls", "-la", NULL});
// If exec succeeds, this line NEVER runs — the entire address space is replaced
perror("exec failed");
wait(): collect the child's exit status
int status;
pid_t child = wait(&status);
if (WIFEXITED(status)) {
printf("Child %d exited with code %d\n", child, WEXITSTATUS(status));
}
Zombie processes
If a child exits but the parent never calls wait(), the child becomes a zombie. It has
no memory, no open files, no running code — but its process table entry remains so the parent
can eventually collect the exit status.
$ ps aux | grep Z
USER PID ... STAT COMMAND
user 5678 ... Z [child] <defunct>
Zombies consume almost no resources (just one entry in the process table), but if a parent spawns thousands of children without waiting, you can exhaust the PID space.
Solution: Call wait() or waitpid() for every child. Or set SIGCHLD to SIG_IGN — this
tells the kernel to automatically reap children:
signal(SIGCHLD, SIG_IGN); // Auto-reap children. No zombies.
🧠 What do you think happens?
If a parent process exits while a child is still running, what happens to the child? Who becomes its new parent? (Hint: it's PID 1.)
Core dumps
When a process crashes with SIGSEGV, SIGABRT, or SIGQUIT (among others), the kernel can write the process's entire memory image to a file: the core dump.
$ ulimit -c unlimited # Enable core dumps
$ ./crash
Segmentation fault (core dumped)
$ file core
core: ELF 64-bit LSB core file, x86-64
$ gdb ./crash core
(gdb) bt # Full backtrace at crash time
(gdb) info registers # All register values
(gdb) x/10x $rsp # Stack contents
(gdb) print *pointer_var # Examine variables
A core dump contains: all mapped memory regions (stack, heap, data), register values for every thread, signal information, and the memory map. It's a complete snapshot of the dying process.
🔧 Task: Build a Ctrl+C counter
Write a C program that:
- Installs a SIGINT handler using
sigaction() - Counts how many times Ctrl+C is pressed
- After 3 presses, prints "OK fine, exiting." and terminates
- Uses only
volatile sig_atomic_tfor the counter (signal safety) - Uses
write()in the handler, notprintf()
// save as counter.c — compile: gcc -o counter counter.c
#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
volatile sig_atomic_t count = 0;
void handler(int sig) {
count++;
const char *msg = "Caught SIGINT!\n";
write(STDOUT_FILENO, msg, 15);
}
int main() {
struct sigaction sa = { .sa_handler = handler };
sigemptyset(&sa.sa_mask);
sigaction(SIGINT, &sa, NULL);
printf("Press Ctrl+C three times to quit (PID: %d)\n", getpid());
while (count < 3) {
pause(); // Sleep until a signal arrives
}
printf("OK fine, exiting after %d SIGINTs.\n", count);
return 0;
}
Bonus: Send other signals from another terminal and observe the behavior:
$ kill -SIGTERM $(pgrep counter) # Not caught — default action kills
$ kill -SIGUSR1 $(pgrep counter) # Not caught — default action kills
$ kill -SIGSTOP $(pgrep counter) # Pauses the process (can't catch)
$ kill -SIGCONT $(pgrep counter) # Resumes it
The Allocator: What malloc Really Does
Type this right now
$ ltrace -e malloc,free ./your_program 2>&1 | head -20
Don't have a program handy? Use this:
// save as allocs.c — compile: gcc -o allocs allocs.c
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main() {
char *a = malloc(32);
char *b = malloc(128);
char *c = malloc(256 * 1024); // 256 KB — big allocation!
printf("a = %p (32 bytes)\n", (void *)a);
printf("b = %p (128 bytes)\n", (void *)b);
printf("c = %p (256 KB)\n", (void *)c);
free(a);
free(b);
free(c);
return 0;
}
$ ltrace -e malloc,free ./allocs
malloc(32) = 0x55a000
malloc(128) = 0x55a030
malloc(262144) = 0x7f4a00000000
free(0x55a000)
free(0x55a030)
free(0x7f4a00000000)
Notice something? a and b are close together (in the heap/brk region). But c is in a
completely different address range. That's because the allocator uses two different strategies
depending on the allocation size.
Now watch the system calls:
$ strace -e brk,mmap,munmap ./allocs 2>&1 | grep -E "^(brk|mmap|munmap)"
brk(NULL) = 0x55a000
brk(0x57b000) = 0x57b000 ← extend heap
mmap(NULL, 266240, ..., MAP_PRIVATE|MAP_ANONYMOUS, ...) = 0x7f4a00000000 ← big alloc
munmap(0x7f4a00000000, 266240) = 0 ← big free
malloc called brk for the small ones. mmap for the big one. Two paths to the kernel.
malloc is NOT a system call
This surprises people. malloc() is a library function — it runs entirely in user space.
It maintains its own data structures, its own free lists, its own bookkeeping. It only calls
the kernel when it needs more memory from the OS.
Your code
─────────
ptr = malloc(32);
│
▼
┌──────────────────────────────────────────────┐
│ C Library (glibc, musl, etc.) │
│ │
│ malloc() implementation: │
│ 1. Check free list — is there a free chunk │
│ that fits? │
│ YES → carve it out, return pointer │
│ NO → ask the kernel for more memory │
│ (brk or mmap) │
│ │
│ free() implementation: │
│ 1. Mark chunk as free │
│ 2. Add to free list │
│ 3. Maybe coalesce with neighbors │
│ 4. Maybe return memory to kernel │
└──────────────────────────────────────────────┘
│
│ Only when needed:
▼
┌──────────────────────────────────────────────┐
│ Kernel │
│ brk() — extend the heap (contiguous) │
│ mmap() — map pages anywhere │
└──────────────────────────────────────────────┘
Two ways to get memory from the kernel
brk / sbrk: the heap
The classic heap is a contiguous region that grows upward. brk() moves the "program break"
— the boundary between allocated and unallocated heap space.
Before malloc:
0x555555570000 ┌──────────────────┐
│ .data, .bss │
0x555555580000 ├──────────────────┤ ◄── program break (brk)
│ │
│ (unallocated) │
│ │
└──────────────────┘
After malloc(32) + malloc(128):
0x555555570000 ┌──────────────────┐
│ .data, .bss │
0x555555580000 ├──────────────────┤ ◄── old brk
│ [chunk: 32 bytes]│
│ [chunk: 128 bytes│
0x55555559B000 ├──────────────────┤ ◄── new brk (moved up)
│ (unallocated) │
└──────────────────┘
brk is fast (just moving a pointer in kernel data structures) but only works for contiguous
growth.
mmap anonymous: pages anywhere
For large allocations (typically >128 KB), the allocator uses mmap with MAP_ANONYMOUS to
get pages at an arbitrary virtual address:
mmap region (somewhere in the address space):
0x7f4a00000000 ┌──────────────────────┐
│ │
│ 256 KB │ ◄── mmap'd directly
│ (65 pages) │
│ │
0x7f4a00040000 └──────────────────────┘
Advantages of mmap for large allocations:
- Can be returned to the OS immediately via
munmap(). Withbrk, you can only shrink the heap from the top — a free block in the middle stays allocated. - No fragmentation of the brk region — large blocks don't interfere with small ones.
Inside a heap chunk
When you malloc(32), the allocator actually allocates more than 32 bytes. Every chunk has a
header containing metadata:
What malloc returns What's actually in memory
────────────────── ────────────────────────────
ptr ──────────────────────► ┌─────────────────────────────┐
│ prev_size (8 bytes) │ ◄── only used if
│ │ previous chunk is free
├─────────────────────────────┤
│ size + flags (8 bytes) │ ◄── chunk header
│ bits: [size | A | M | P] │
ptr points HERE ──────────► ├─────────────────────────────┤
│ │
│ User data (32 bytes) │ ◄── what you requested
│ │
├─────────────────────────────┤
│ (alignment padding) │
└─────────────────────────────┘
Total chunk size: 48 bytes for a 32-byte request
Overhead: 16 bytes (prev_size + size header)
The flags in the size field:
- P (PREV_INUSE): is the previous chunk allocated?
- M (IS_MMAPPED): was this chunk allocated via mmap?
- A (NON_MAIN_ARENA): does this belong to a non-main arena? (threading)
The malloc pointer you get back is 16 bytes past the start of the actual chunk. When you
call free(ptr), the allocator subtracts 16 to find the header.
The free list
When you free() a chunk, the allocator doesn't give the memory back to the kernel (usually).
It adds the chunk to a free list — a linked list of available chunks, threaded through the
chunks themselves.
Heap after several malloc + free operations:
┌──────────────────────────────────────────────────────────┐
│ │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ALLOC │ │ FREE │ │ALLOC │ │ FREE │ │
│ │ 64 B │ │ 128 B │ │ 32 B │ │ 256 B │ │
│ │ │ │ ┌──┐ │ │ │ │ ┌──┐ │ │
│ │ │ │ │FD├──┼──┼────────┼──┼─►│FD├──┼──► ... │
│ │ │ │ │BK│◄─┼──┼────────┼──┼──│BK│ │ │
│ │ │ │ └──┘ │ │ │ │ └──┘ │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ │
│ │
└──────────────────────────────────────────────────────────┘
FD = forward pointer (next free chunk)
BK = backward pointer (previous free chunk)
The free list is a doubly-linked list INSIDE the free chunks.
No extra memory needed — the user data area stores the pointers.
malloc(32): the full sequence
1. malloc(32) called
2. Round up to minimum chunk size: 32 + 16 (header) = 48 bytes
Align to 16 bytes: 48 bytes
3. Check fastbin for 48-byte chunks
├── Found? → unlink from fastbin, return user pointer
└── Not found? → continue
4. Check small bin for 48-byte chunks
├── Found? → unlink from bin, return user pointer
└── Not found? → continue
5. Check unsorted bin — scan for any chunk that fits
├── Exact match? → return it
├── Larger chunk? → split it, return the piece, put remainder in bin
└── Nothing? → continue
6. Check larger bins for a bigger chunk to split
├── Found? → split, return piece, put remainder back
└── Not found? → continue
7. Ask the kernel for more memory
brk() to extend the heap by at least 128 KB
Carve a 48-byte chunk from the new space
Return user pointer
free(ptr): what happens
1. free(ptr) called
2. Subtract 16 bytes to find chunk header
Read the size field → this chunk is 48 bytes
3. Is previous chunk free? (check PREV_INUSE bit)
├── Yes → coalesce: merge with previous chunk, update size
└── No → skip
4. Is next chunk free? (check next chunk's PREV_INUSE)
├── Yes → coalesce: merge with next chunk, update size
└── No → set next chunk's PREV_INUSE = 0
5. Add the (possibly coalesced) chunk to the appropriate free list
Small chunk (< 512 bytes) → fastbin or smallbin
Larger chunk → unsorted bin
6. Was this chunk mmap'd? (IS_MMAPPED flag)
├── Yes → munmap() it immediately — returns pages to kernel
└── No → it stays in the heap free list
Coalescing is critical. Without it, you'd end up with many small free chunks that can't satisfy larger requests, even though the total free space is large. This is external fragmentation.
Why double-free is catastrophic
int *p = malloc(32);
free(p);
free(p); // DOUBLE FREE — disaster
After first free(p):
Free list: HEAD → [chunk at p] → ...
│
└─ FD points to next free chunk
After second free(p):
Free list: HEAD → [chunk at p] → [chunk at p] → ...
│ │
└──────────────────┘
(circular! chunk points to itself)
Now malloc(32) returns p.
Then malloc(32) returns p AGAIN.
Two different parts of your program think they own the same memory.
They overwrite each other's data. Heap corruption. Potential code execution.
This is why double-free is a security vulnerability, not just a bug.
🧠 What do you think happens?
You
malloc(32), write to it,free()it, then immediatelymalloc(32)again. Do you get the same pointer back? Why or why not? (Try it.)
Thread safety: arenas
In a multithreaded program, multiple threads call malloc() simultaneously. A global lock
would be a bottleneck. The solution: arenas.
┌────────────────────────────────────────────────────┐
│ Process │
│ │
│ Thread 1 Thread 2 Thread 3 │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Arena 0 │ │Arena 1 │ │Arena 2 │ │
│ │(main) │ │ │ │ │ │
│ │ lock │ │ lock │ │ lock │ │
│ │ heap │ │ heap │ │ heap │ │
│ │ bins │ │ bins │ │ bins │ │
│ └────────┘ └────────┘ └────────┘ │
│ │
│ Each arena has its own lock and free lists. │
│ Threads pick an arena (usually round-robin). │
│ Contention is reduced: threads rarely fight │
│ for the same lock. │
└────────────────────────────────────────────────────┘
glibc creates up to 8 * num_cores arenas. Each thread is assigned to an arena and sticks
with it (usually). The main arena uses brk(); secondary arenas use mmap().
Rust: same allocator underneath
Rust's default allocator is the system allocator — it literally calls malloc and free
from your platform's C library.
fn main() { // Box::new calls the global allocator (malloc underneath) let x = Box::new(42); println!("x = {} at {:p}", x, &*x); // x dropped here → allocator calls free() // Vec grows by calling realloc or malloc+copy let mut v = Vec::new(); for i in 0..10 { v.push(i); println!("len={}, cap={}, ptr={:p}", v.len(), v.capacity(), v.as_ptr()); } }
len=1, cap=4, ptr=0x55a020 ← initial allocation
len=2, cap=4, ptr=0x55a020
len=3, cap=4, ptr=0x55a020
len=4, cap=4, ptr=0x55a020
len=5, cap=8, ptr=0x55a040 ← doubled! Allocated new buffer, freed old
len=6, cap=8, ptr=0x55a040
len=7, cap=8, ptr=0x55a040
len=8, cap=8, ptr=0x55a040
len=9, cap=16, ptr=0x55a060 ← doubled again!
len=10, cap=16, ptr=0x55a060
You can swap in a different allocator with #[global_allocator]:
#![allow(unused)] fn main() { use std::alloc::System; #[global_allocator] static GLOBAL: System = System; // Explicitly use system allocator (the default) // Or use jemalloc, mimalloc, etc. for better multithreaded performance }
💡 Fun Fact: Firefox switched from the system allocator to jemalloc and saw measurable performance improvements. The allocator you choose matters — it affects fragmentation, multithreaded scaling, and memory overhead. For most programs, the default is fine. For high-performance servers, it's worth benchmarking alternatives.
🔧 Task: Watch the allocator in action
Step 1: ltrace — see malloc/free calls:
$ ltrace -e malloc,calloc,realloc,free ./allocs 2>&1 | head -20
Step 2: strace — see when it talks to the kernel:
$ strace -e brk,mmap,munmap ./allocs 2>&1 | head -20
Step 3: Write a program that does many small allocations, then many large ones, and observe the difference:
// save as alloc_patterns.c — compile: gcc -o alloc_patterns alloc_patterns.c
#include <stdio.h>
#include <stdlib.h>
int main() {
printf("=== Small allocations (brk region) ===\n");
void *ptrs[10];
for (int i = 0; i < 10; i++) {
ptrs[i] = malloc(64);
printf(" malloc(64) = %p\n", ptrs[i]);
}
printf("\n=== Large allocation (mmap region) ===\n");
void *big = malloc(1024 * 1024); // 1 MB
printf(" malloc(1MB) = %p\n", big);
// Free in reverse order — watch coalescing
printf("\n=== Freeing ===\n");
free(big);
for (int i = 9; i >= 0; i--) {
free(ptrs[i]);
}
printf("Done.\n");
return 0;
}
Run with strace -e brk,mmap,munmap and observe: small allocations trigger one brk() call
(to extend the heap). The large allocation triggers mmap(). free(big) triggers munmap().
The small free()s trigger nothing — the memory stays in the process's free list.
Rust's Memory Story
Type this right now
// save as ownership.rs — compile: rustc ownership.rs fn main() { let s1 = String::from("hello"); let s2 = s1; // s1 is MOVED to s2 // println!("{}", s1); // Uncomment this — the compiler will refuse. println!("{}", s2); // Only s2 is valid now. }
$ rustc ownership.rs
$ ./ownership
hello
Now uncomment the s1 line:
error[E0382]: borrow of moved value: `s1`
--> ownership.rs:5:20
|
2 | let s1 = String::from("hello");
| -- move occurs because `s1` has type `String`
3 | let s2 = s1;
| -- value moved here
4 |
5 | println!("{}", s1);
| ^^ value borrowed here after move
That's the borrow checker. It just prevented a use-after-free at compile time. In C, you'd get undefined behavior. In Rust, you get a compiler error with an explanation of exactly what went wrong.
Ownership: one owner, one lifetime
Every value in Rust has exactly one owner. When the owner goes out of scope, the value is dropped (freed). No garbage collector. No manual free. The compiler inserts drop calls at exactly the right places.
fn main() { { let s = String::from("hello"); // s owns the String println!("{}", s); } // s goes out of scope → String::drop() called → heap memory freed // s does not exist here. No dangling pointer possible. }
Compare with C:
#include <stdlib.h>
#include <string.h>
int main() {
{
char *s = malloc(6);
strcpy(s, "hello");
printf("%s\n", s);
// Forgot free(s)? → memory leak
// Remembered free(s), then used s later? → use-after-free
free(s);
}
// s still exists as a dangling pointer. C doesn't care.
return 0;
}
Rust ownership model:
┌─────────────┐ owns ┌──────────────────┐
│ Variable s │─────────────►│ Heap: "hello" │
│ (on stack) │ │ (5 bytes + null) │
└─────────────┘ └──────────────────┘
│ │
│ s goes out of scope │
▼ ▼
s is gone memory is freed (drop)
No dangling ref No leak
Move semantics
When you assign a String to another variable, ownership moves. The original variable
becomes invalid. This prevents double-free.
fn main() { let s1 = String::from("hello"); let s2 = s1; // MOVE — s1 is invalidated // In memory: // s1's stack data (ptr, len, cap) was COPIED to s2 // But s1 is now considered uninitialized by the compiler // The heap data was NOT copied — same pointer, one owner println!("{}", s2); }
Before move:
┌──── s1 ────┐ ┌──────────────────┐
│ ptr ───────────────────────►│ 'h' 'e' 'l' 'l' │
│ len: 5 │ │ 'o' │
│ cap: 5 │ └──────────────────┘
└────────────┘
After move (let s2 = s1):
┌──── s1 ────┐ ┌──────────────────┐
│ INVALID │ │ 'h' 'e' 'l' 'l' │
│ (compiler │ ┌──────────►│ 'o' │
│ forbids) │ │ └──────────────────┘
└────────────┘ │
│
┌──── s2 ────┐ │
│ ptr ────────────┘
│ len: 5 │
│ cap: 5 │
└────────────┘
For types that are cheap to copy (i32, f64, bool, char), Rust uses Copy instead
of move. The value is duplicated, and both variables remain valid:
#![allow(unused)] fn main() { let x = 42; let y = x; // COPY — both x and y are valid println!("{} {}", x, y); // Fine! Integers implement Copy. }
💡 Fun Fact: The distinction between Copy and Move is a zero-cost abstraction. At the machine code level, both are a
memcpyof the stack data. The difference is purely in what the compiler permits afterward. Move adds no runtime cost — it's just a compile-time rule.
Borrowing: references without ownership
Sometimes you want to use a value without taking ownership. That's borrowing.
fn print_length(s: &String) { // borrows s — does not own it println!("Length: {}", s.len()); } // s (the reference) goes out of scope, but the String is NOT dropped fn main() { let s = String::from("hello"); print_length(&s); // lend s to the function println!("{}", s); // s is still valid! }
Two kinds of references:
&T — shared reference (read-only)
• Multiple &T can exist simultaneously
• Cannot modify the data
• Like a "read lock"
&mut T — exclusive reference (read-write)
• Only ONE &mut T can exist at a time
• No &T can coexist with &mut T
• Like a "write lock"
fn main() { let mut s = String::from("hello"); let r1 = &s; // OK — shared reference let r2 = &s; // OK — multiple shared refs allowed println!("{} {}", r1, r2); let r3 = &mut s; // OK — r1 and r2 are no longer used after this point r3.push_str(" world"); println!("{}", r3); }
The compiler enforces these rules at compile time. No runtime overhead. No data races possible in safe code.
🧠 What do you think happens?
#![allow(unused)] fn main() { let mut v = vec![1, 2, 3]; let first = &v[0]; // Borrow an element v.push(4); // Modify the vector println!("{}", first); }Does this compile? Why or why not? (Hint: what does
pushdo if the vector needs to grow?)
Lifetimes: how long references are valid
The compiler tracks the lifetime of every reference — how long the borrowed data is valid.
#![allow(unused)] fn main() { fn longest<'a>(s1: &'a str, s2: &'a str) -> &'a str { if s1.len() > s2.len() { s1 } else { s2 } } }
The 'a says: "the returned reference lives as long as the shorter of s1 and s2."
#![allow(unused)] fn main() { fn broken() -> &str { let s = String::from("hello"); &s // ERROR: returning reference to local variable } // s is dropped here — reference would dangle }
error[E0106]: missing lifetime specifier
error[E0515]: cannot return reference to local variable `s`
In C, this compiles silently and causes a use-after-free:
char *broken() {
char s[] = "hello"; // Stack-allocated
return s; // Returns pointer to stack frame that's about to be freed
} // s is gone. Caller has a dangling pointer.
Smart pointers: Box, Vec, String
Box<T>: single heap allocation
fn main() { let x = Box::new(42); // Allocates 4 bytes on the heap println!("x = {} at {:p}", x, &*x); } // x dropped → heap memory freed
Stack Heap
┌──────────┐ ┌──────┐
│ x: *ptr ─┼───────────►│ 42 │
│ (8 bytes)│ │(4 B) │
└──────────┘ └──────┘
Box<T> is exactly one pointer wide. It's Rust's equivalent of malloc + free, but with
automatic cleanup.
Vec<T>: growable array
fn main() { let mut v: Vec<i32> = Vec::new(); println!("Empty: len={}, cap={}", v.len(), v.capacity()); // 0, 0 v.push(1); // Allocates v.push(2); v.push(3); v.push(4); println!("After 4: len={}, cap={}", v.len(), v.capacity()); // 4, 4 v.push(5); // Capacity exceeded → reallocate (double) println!("After 5: len={}, cap={}", v.len(), v.capacity()); // 5, 8 }
Vec<i32> on the stack: Heap buffer:
┌───────────────────┐ ┌───┬───┬───┬───┬───┬───┬───┬───┐
│ ptr ──────────────┼──────►│ 1 │ 2 │ 3 │ 4 │ 5 │ │ │ │
│ len: 5 │ └───┴───┴───┴───┴───┴───┴───┴───┘
│ cap: 8 │ used (5) unused (3)
└───────────────────┘
24 bytes on stack
When Vec grows past capacity, it allocates a new buffer (typically 2x), copies elements,
and frees the old buffer. This is exactly what C's realloc does, but Rust's borrow checker
ensures no references to the old buffer survive the reallocation.
String: it's a Vec<u8>
fn main() { let s = String::from("hello"); // String is literally: struct String { vec: Vec<u8> } println!("len={}, cap={}, size_of={}", s.len(), s.capacity(), std::mem::size_of::<String>()); // len=5, cap=5, size_of=24 (same as Vec: ptr + len + cap) let slice: &str = &s; // &str is just (pointer, length) — no allocation println!("size_of &str = {}", std::mem::size_of::<&str>()); // 16 }
&str is a fat pointer: a pointer to UTF-8 bytes plus a length. It doesn't own anything.
It can point into a String, a string literal (in .rodata), or a slice of any UTF-8 bytes.
Reference counting: Rc and Arc
When you need multiple owners, Rust provides reference-counted pointers:
use std::rc::Rc; fn main() { let a = Rc::new(String::from("shared data")); let b = Rc::clone(&a); // Increments reference count let c = Rc::clone(&a); // Increments again println!("References: {}", Rc::strong_count(&a)); // 3 println!("a = {}", a); println!("b = {}", b); println!("c = {}", c); } // c dropped (count→2), b dropped (count→1), a dropped (count→0) → data freed
Stack Heap
┌─────┐ ┌────────────────────────┐
│ a ─┼──────────────────────►│ refcount: 3 │
│ │ ┌───►│ String: "shared data" │
│ b ─┼──────────────────┘ ┌─►│ │
│ │ │ └────────────────────────┘
│ c ─┼────────────────────┘
└─────┘
Rc<T> is single-threaded only. For thread-safe reference counting, use Arc<T> (Atomic Rc):
use std::sync::Arc; use std::thread; fn main() { let data = Arc::new(vec![1, 2, 3]); let handles: Vec<_> = (0..3).map(|i| { let data = Arc::clone(&data); thread::spawn(move || { println!("Thread {}: {:?}", i, data); }) }).collect(); for h in handles { h.join().unwrap(); } }
Arc uses atomic operations for the reference count, making it safe to share across threads.
The cost: atomic increments/decrements are slower than normal increments (~5-20 ns vs ~1 ns).
The global allocator
Under the hood, Box::new, Vec::push, and String::from all call the global allocator.
By default, this is your system's malloc/free.
// You can verify this: Rust's alloc calls end up as malloc use std::alloc::{GlobalAlloc, Layout, System}; fn main() { unsafe { let layout = Layout::new::<[u8; 64]>(); let ptr = System.alloc(layout); // This calls malloc(64) println!("Allocated at: {:p}", ptr); System.dealloc(ptr, layout); // This calls free(ptr) } }
You can replace the allocator entirely:
#![allow(unused)] fn main() { // Using jemalloc (add jemallocator = "0.5" to Cargo.toml) // use jemallocator::Jemalloc; // #[global_allocator] // static GLOBAL: Jemalloc = Jemalloc; }
No-alloc: embedded Rust
For embedded systems (microcontrollers, kernels), you can't use a heap at all. Rust supports
this with #![no_std]:
#![allow(unused)] #![no_std] #![no_main] fn main() { // No Vec, no String, no Box — nothing that allocates // Use fixed-size arrays, stack allocation, and the heapless crate // use heapless::Vec; // Fixed-capacity Vec, stored on the stack // let mut v: Vec<i32, 16> = Vec::new(); // Max 16 elements, no heap use core::panic::PanicInfo; #[panic_handler] fn panic(_info: &PanicInfo) -> ! { loop {} } }
This connects directly to embedded programming — when you're writing firmware for an STM32 or an ESP32, you have no OS, no heap, and every byte is precious. Rust's ownership model still works perfectly: stack allocation, static references, and compile-time guarantees.
💡 Fun Fact: The Linux kernel is starting to accept Rust code. Kernel code uses no heap allocator in the traditional sense — memory is managed through slab allocators and page allocators. Rust's
#![no_std]+ custom allocator support makes this possible.
🔧 Task: Trigger every borrow checker error
Write a Rust program (or multiple small programs) that intentionally triggers each of these compile errors. Read each error message carefully — they're some of the best error messages in any compiler.
// 1. Use after move fn ex1() { let s = String::from("hello"); let t = s; println!("{}", s); // E0382: borrow of moved value } // 2. Multiple mutable references fn ex2() { let mut s = String::from("hello"); let r1 = &mut s; let r2 = &mut s; // E0499: cannot borrow `s` as mutable more than once println!("{} {}", r1, r2); } // 3. Mutable + immutable reference fn ex3() { let mut s = String::from("hello"); let r1 = &s; let r2 = &mut s; // E0502: cannot borrow as mutable because also borrowed as immutable println!("{} {}", r1, r2); } // 4. Dangling reference fn ex4() -> &'static str { let s = String::from("hello"); &s // E0515: cannot return reference to local variable } // 5. Move out of borrowed content fn ex5() { let v = vec![String::from("hello")]; let s = v[0]; // E0507: cannot move out of index of `Vec<String>` } fn main() { // Uncomment each one at a time, try to compile, read the error. // ex1(); // ex2(); // ex3(); // println!("{}", ex4()); // ex5(); }
For each error:
- Read the error code (e.g., E0382)
- Run
rustc --explain E0382for a detailed explanation - Fix the error using the compiler's suggestion
- Understand why the rule exists — what bug would it cause in C?
Data Structure Layout in Memory
Type this right now
// save as layout.c — compile: gcc -o layout layout.c
#include <stdio.h>
#include <stddef.h>
struct Bad {
char a; // 1 byte
int b; // 4 bytes
char c; // 1 byte
};
struct Good {
int b; // 4 bytes
char a; // 1 byte
char c; // 1 byte
};
int main() {
printf("struct Bad: sizeof = %zu\n", sizeof(struct Bad));
printf(" offset of a: %zu\n", offsetof(struct Bad, a));
printf(" offset of b: %zu\n", offsetof(struct Bad, b));
printf(" offset of c: %zu\n", offsetof(struct Bad, c));
printf("\nstruct Good: sizeof = %zu\n", sizeof(struct Good));
printf(" offset of b: %zu\n", offsetof(struct Good, b));
printf(" offset of a: %zu\n", offsetof(struct Good, a));
printf(" offset of c: %zu\n", offsetof(struct Good, c));
return 0;
}
$ gcc -o layout layout.c && ./layout
struct Bad: sizeof = 12
offset of a: 0
offset of b: 4
offset of c: 8
struct Good: sizeof = 8
offset of b: 0
offset of a: 4
offset of c: 5
Same three fields. Different order. 4 bytes smaller. If you have a million of these structs, that's 4 MB wasted on invisible padding bytes. The compiler doesn't reorder C struct fields — it lays them out exactly as you declared them. It's on you.
Why alignment matters
Modern CPUs don't read arbitrary bytes from memory. They read in aligned chunks. A 4-byte
int must start at an address divisible by 4. An 8-byte double must start at an address
divisible by 8.
Memory addresses:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
│ │ │ │ │
└── 4-byte ─┘ └── 4-byte ─┘
int at address 0: ✓ aligned (0 % 4 == 0)
int at address 4: ✓ aligned (4 % 4 == 0)
int at address 1: ✗ misaligned! (1 % 4 ≠ 0)
What happens on misaligned access?
- x86-64: works, but slower (may need two cache line reads instead of one)
- ARM (older): hardware exception — your program crashes
- RISC-V: implementation-defined — may work, may trap
The C compiler adds padding bytes between fields to ensure every field is properly aligned. These padding bytes contain garbage and waste space.
struct Bad: the layout problem
struct Bad {
char a; // 1 byte, alignment 1
int b; // 4 bytes, alignment 4
char c; // 1 byte, alignment 1
};
Byte: 0 1 2 3 4 5 6 7 8 9 10 11
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ a │ pad │ pad │ pad │ b │ b │ b │ b │ c │ pad │ pad │ pad │
│ │ │ │ │ │ │ │ │ │ │ │ │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
▲ ▲ ▲
│ │ │
a at offset 0 b at offset 4 c at offset 8
(align 1: OK) (align 4: 4%4=0 ✓) (align 1: OK)
Total: 12 bytes. But actual data is only 6 bytes.
Waste: 6 bytes of padding (50%!)
Why padding after c? The struct's alignment is max(1,4,1) = 4.
sizeof must be a multiple of 4 so arrays of structs stay aligned.
8 + 1 = 9, round up to 12.
struct Good: reordered fields
struct Good {
int b; // 4 bytes, alignment 4
char a; // 1 byte, alignment 1
char c; // 1 byte, alignment 1
};
Byte: 0 1 2 3 4 5 6 7
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ b │ b │ b │ b │ a │ c │ pad │ pad │
│ │ │ │ │ │ │ │ │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
▲ ▲ ▲
│ │ │
b at offset 0 a c
(align 4: 0%4=0 ✓)
Total: 8 bytes. Same data, 4 bytes smaller.
The trick: put larger-aligned fields first.
The golden rule in C: sort fields from largest alignment to smallest. double first, then
int/long, then short, then char. This minimizes internal padding.
🧠 What do you think happens?
struct Mystery { char a; double b; char c; int d; };What's
sizeof(struct Mystery)? Work it out by hand before compiling. (Hint:doublehas alignment 8. The struct's alignment is also 8.)
Rust: the compiler reorders for you
Rust's default struct layout (repr(Rust)) allows the compiler to reorder fields for
optimal packing:
struct Bad { a: u8, // 1 byte b: u32, // 4 bytes c: u8, // 1 byte } fn main() { println!("Size of Bad: {}", std::mem::size_of::<Bad>()); println!("Align of Bad: {}", std::mem::align_of::<Bad>()); }
$ rustc layout.rs && ./layout
Size of Bad: 8
Align of Bad: 4
Even though the fields are declared in the "bad" order, Rust produces an 8-byte struct. The
compiler silently reordered b before a and c.
C layout (struct Bad): Rust layout (same fields):
┌───┬───────┬───┬───────┐ ┌───────────┬───┬───┬─────┐
│ a │padding│ b │c+pad │ │ b │ a │ c │ pad │
└───┴───────┴───┴───────┘ └───────────┴───┴───┴─────┘
12 bytes 8 bytes
Same fields. Same alignment rules. Smaller struct.
#[repr(C)]: when you need C-compatible layout
Sometimes you need the fields in a specific order:
- FFI (Foreign Function Interface) — calling C from Rust or vice versa
- Hardware register mappings — the bytes must match the hardware's expectation
- Network protocols — the bytes go on the wire in a specific order
- Memory-mapped I/O — addresses are fixed
#[repr(C)] struct CCompatible { a: u8, b: u32, c: u8, } fn main() { println!("repr(C) size: {}", std::mem::size_of::<CCompatible>()); // 12 println!("repr(Rust) size: {}", std::mem::size_of::<Bad>()); // 8 }
#[repr(C)] tells the compiler: "Lay out fields in declaration order, with C-style padding
rules. Do not reorder." Now the Rust struct has the same layout as the C struct, byte for byte.
💡 Fun Fact: The Linux kernel's structures are defined in C with precise layouts that match hardware expectations. Any Rust code in the kernel that interacts with these structures must use
#[repr(C)]to guarantee layout compatibility. Getting this wrong means reading the wrong field at the wrong offset — silent data corruption.
Rust enum layout
Rust enums are tagged unions: a discriminant (tag) that identifies the variant, plus the data for the active variant.
enum Message { Quit, // No data Move { x: i32, y: i32 }, // 8 bytes of data Write(String), // 24 bytes of data (ptr + len + cap) } fn main() { println!("Size of Message: {}", std::mem::size_of::<Message>()); println!("Size of String: {}", std::mem::size_of::<String>()); }
Size of Message: 32
Size of String: 24
Enum layout (conceptual):
┌──────────────┬──────────────────────────────────┐
│ Discriminant │ Payload │
│ (tag) │ (large enough for biggest variant)│
├──────────────┼──────────────────────────────────┤
│ 0 (Quit) │ (unused — 24 bytes of nothing) │
│ 1 (Move) │ x: i32, y: i32, (16B padding) │
│ 2 (Write) │ String (ptr, len, cap) = 24B │
└──────────────┴──────────────────────────────────┘
Total = align(discriminant) + size(largest variant)
= 8 + 24 = 32 bytes
The size of the enum is the size of the discriminant plus the size of the largest variant.
Every variant uses the same amount of space, even Quit which has no data. This is the cost
of a tagged union.
Niche optimization: zero-cost Option
Here's where Rust gets clever. Option<&T> is the same size as &T:
fn main() { println!("Size of &i32: {}", std::mem::size_of::<&i32>()); println!("Size of Option<&i32>: {}", std::mem::size_of::<Option<&i32>>()); println!(); println!("Size of Box<i32>: {}", std::mem::size_of::<Box<i32>>()); println!("Size of Option<Box<i32>>: {}", std::mem::size_of::<Option<Box<i32>>>()); }
Size of &i32: 8
Size of Option<&i32>: 8 ← SAME SIZE! No extra discriminant byte.
Size of Box<i32>: 8
Size of Option<Box<i32>>: 8 ← Also the same!
How? Niche optimization. A reference (&T) can never be null. So the compiler uses the
null bit pattern (all zeros) to represent None. No extra tag needed.
Option<&i32> layout:
Some(&val): ┌────────────────────────────────────┐
│ 0x00007FFF12340000 (valid pointer) │
└────────────────────────────────────┘
None: ┌────────────────────────────────────┐
│ 0x0000000000000000 (null = None) │
└────────────────────────────────────┘
Same 8 bytes. The "impossible" value (null) serves as the discriminant.
This works for any type that has an "impossible" bit pattern:
use std::num::NonZeroU32; fn main() { println!("Size of u32: {}", std::mem::size_of::<u32>()); println!("Size of Option<NonZeroU32>: {}", std::mem::size_of::<Option<NonZeroU32>>()); // Both are 4 bytes! Value 0 represents None. }
In C, you'd represent an "optional pointer" as NULL — but nothing stops you from
accidentally dereferencing it. In Rust, Option<&T> has the same representation as a C
pointer (8 bytes, with null for "absent"), but the compiler forces you to check for None
before accessing the value.
Examining layout: the tools
C
#include <stdio.h>
#include <stddef.h>
#include <stdalign.h>
struct Example {
char a;
double b;
int c;
char d;
};
int main() {
printf("sizeof: %zu\n", sizeof(struct Example));
printf("alignof: %zu\n", alignof(struct Example));
printf("offsetof a: %zu\n", offsetof(struct Example, a));
printf("offsetof b: %zu\n", offsetof(struct Example, b));
printf("offsetof c: %zu\n", offsetof(struct Example, c));
printf("offsetof d: %zu\n", offsetof(struct Example, d));
return 0;
}
sizeof: 24
alignof: 8
offsetof a: 0
offsetof b: 8
offsetof c: 16
offsetof d: 20
Byte: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
┌──┬─────────────────────┬──────────────────────┬───────────┬──┬────────┐
│a │ padding │ b │ c │d │ padding│
│ │ (7 bytes!) │ (8 bytes) │ (4 bytes) │ │(3 bytes│
└──┴─────────────────────┴──────────────────────┴───────────┴──┴────────┘
Rust
use std::mem; struct Example { a: u8, b: f64, c: u32, d: u8, } fn main() { println!("size_of: {}", mem::size_of::<Example>()); println!("align_of: {}", mem::align_of::<Example>()); // To see field offsets, we need a trick — create an instance: let e = Example { a: 0, b: 0.0, c: 0, d: 0 }; let base = &e as *const _ as usize; println!("offset of a: {}", &e.a as *const _ as usize - base); println!("offset of b: {}", &e.b as *const _ as usize - base); println!("offset of c: {}", &e.c as *const _ as usize - base); println!("offset of d: {}", &e.d as *const _ as usize - base); }
size_of: 16 ← Smaller! Rust reordered fields.
align_of: 8
offset of b: 0 ← b (8 bytes, align 8) placed first
offset of c: 8 ← c (4 bytes, align 4) placed second
offset of a: 12 ← a and d packed together
offset of d: 13
Rust layout (compiler reordered):
Byte: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
┌──────────────────────┬───────────┬──┬──┬──────┐
│ b │ c │a │d │ pad │
│ (8 bytes) │ (4 bytes) │ │ │(2 B) │
└──────────────────────┴───────────┴──┴──┴──────┘
C layout: 24 bytes. Rust layout: 16 bytes. Same fields.
Packed structs: removing all padding
Sometimes you want no padding at all — typically for wire protocols or file formats:
// C: using __attribute__((packed))
struct __attribute__((packed)) Packed {
char a;
int b;
char c;
};
// sizeof = 6. No padding. But misaligned access to b!
#![allow(unused)] fn main() { // Rust: using repr(packed) #[repr(packed)] struct Packed { a: u8, b: u32, c: u8, } // size_of = 6. No padding. Accessing b requires care. }
Packed structs are dangerous: accessing b at an odd offset may cause a hardware exception
on some architectures, or slow misaligned access on x86. Rust makes you use unsafe to take
references to misaligned fields, or copy them to a local variable first.
🧠 What do you think happens?
You have a
#[repr(packed)]struct in Rust and try to take&packed_struct.bwherebis au32at offset 1. Does it compile? Does it crash? What does the compiler warn you about?
🔧 Task: Compare C and Rust layouts
Step 1: Create this struct in both C and Rust:
// C version
struct Record {
char type_flag; // 1 byte
double value; // 8 bytes
short count; // 2 bytes
char active; // 1 byte
int id; // 4 bytes
};
#![allow(unused)] fn main() { // Rust version struct Record { type_flag: u8, value: f64, count: i16, active: u8, id: i32, } }
Step 2: Print sizeof / size_of and all field offsets in both languages.
Step 3: Draw the byte-level layout diagram for each, marking padding bytes.
Step 4: Reorder the C struct fields to minimize padding. Verify the new size matches (or approaches) Rust's automatically optimized layout.
Step 5: Add #[repr(C)] to the Rust struct and confirm the size matches your C struct.
Expected results:
C (original order): sizeof = 32
C (reordered): sizeof = 24 (or less)
Rust (default): size_of = 24
Rust (repr(C)): size_of = 32 (matches C original)
The compiler is a better struct packer than most humans — but only if you let it (Rust default) or think about it (C manual ordering).
Threads and Shared Memory
Type This First
Save this as race.c and compile with gcc -pthread -o race race.c:
#include <stdio.h>
#include <pthread.h>
int counter = 0; // shared global
void *increment(void *arg) {
for (int i = 0; i < 1000000; i++) {
counter++; // looks innocent...
}
return NULL;
}
int main(void) {
pthread_t t1, t2;
pthread_create(&t1, NULL, increment, NULL);
pthread_create(&t2, NULL, increment, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("Expected: 2000000\n");
printf("Got: %d\n", counter);
return 0;
}
Run it five times. You will get a different wrong answer each time.
Threads vs Processes
A process is an isolated world. Its own address space, its own page tables, its own file descriptors. When you fork(), the child gets a copy of everything.
A thread is different. Threads live inside a process. They share the same address space. Same heap. Same globals. Same code. Same file descriptors.
What each thread gets of its own: a stack and a set of registers. That's it.
+--------------------------------------------------+
| Process (PID 42) |
| |
| +----------+ +----------+ +----------+ |
| | Thread 1 | | Thread 2 | | Thread 3 | |
| | Stack | | Stack | | Stack | |
| | regs | | regs | | regs | |
| +----+-----+ +----+-----+ +----+-----+ |
| | | | |
| v v v |
| +------------------------------------------------+
| | Shared Address Space |
| | |
| | .text (code) -- same code, all threads |
| | .data (globals) -- same globals, all threads |
| | heap -- same heap, all threads |
| | mmap region -- same mappings |
| +------------------------------------------------+
+--------------------------------------------------+
This sharing is what makes threads fast. No copying memory. No new page tables. Context-switching between threads in the same process is cheap.
It is also what makes threads dangerous.
The Data Race Problem
Look at counter++ in the program above. In C, that is one statement. But in machine code, it is three operations:
Thread A Thread B
-------- --------
1. READ counter (gets 5)
1. READ counter (gets 5)
2. ADD 1 (now 6)
2. ADD 1 (now 6)
3. WRITE counter (writes 6)
3. WRITE counter (writes 6)
Result: counter = 6, not 7. One increment was LOST.
This is called a lost update. Two threads read the same value, compute the same result, and one write overwrites the other.
What do you think happens?
If you run the race program with 10 threads instead of 2, does the error get bigger or smaller? Why?
The counter++ operation is a read-modify-write. It is NOT atomic. The CPU does not execute it as one indivisible step. The OS can switch threads between any of those three micro-steps.
C Solution: Mutexes (Discipline-Based)
In C, you protect shared data with a mutex (mutual exclusion lock):
#include <stdio.h>
#include <pthread.h>
int counter = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
void *increment(void *arg) {
for (int i = 0; i < 1000000; i++) {
pthread_mutex_lock(&lock);
counter++;
pthread_mutex_unlock(&lock);
}
return NULL;
}
int main(void) {
pthread_t t1, t2;
pthread_create(&t1, NULL, increment, NULL);
pthread_create(&t2, NULL, increment, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("Expected: 2000000\n");
printf("Got: %d\n", counter);
return 0;
}
Now the count is always exactly 2,000,000.
But notice the problem: the mutex and the counter are separate things. Nothing connects them. The compiler does not know that counter requires lock. You could forget to lock. You could lock the wrong mutex. You could access counter from a new function and never realize it needs protection.
The relationship between the lock and the data it protects exists only in the programmer's head.
Rust Solution: Mutex<T> Wraps the Data
In Rust, the mutex contains the data. You cannot touch the data without locking:
use std::sync::{Arc, Mutex}; use std::thread; fn main() { let counter = Arc::new(Mutex::new(0)); let mut handles = vec![]; for _ in 0..2 { let counter = Arc::clone(&counter); let handle = thread::spawn(move || { for _ in 0..1_000_000 { let mut num = counter.lock().unwrap(); *num += 1; // lock released here when `num` goes out of scope } }); handles.push(handle); } for handle in handles { handle.join().unwrap(); } println!("Expected: 2000000"); println!("Got: {}", *counter.lock().unwrap()); }
Mutex::new(0) wraps the integer. The only way to read or write the integer is to call .lock(), which returns a guard. When the guard is dropped, the lock is released.
You literally cannot access the data without locking. The type system enforces it.
Send and Sync: The Compiler Checks Your Threads
Rust has two marker traits that the compiler checks automatically:
Send: This type is safe to move to another thread.Sync: This type is safe to share references between threads.
You never implement these by hand (usually). The compiler derives them from the types you use.
+-------------------+------+------+
| Type | Send | Sync |
+-------------------+------+------+
| i32, String, Vec | Y | Y |
| Mutex<T> | Y | Y | <-- designed for sharing
| Arc<T> | Y | Y | <-- atomic ref count
| Rc<T> | N | N | <-- NOT thread-safe
| Cell<T> | Y | N | <-- interior mutability, not Sync
| *mut T | N | N | <-- raw pointers
+-------------------+------+------+
Rc<T> uses a non-atomic reference count. If two threads increment it simultaneously, you get the same lost-update problem we saw with counter. So Rust marks Rc<T> as !Send. The compiler will refuse to let you move it to another thread.
Arc<T> uses atomic reference counting. It is safe to share. The compiler allows it.
The Compiler Catches Data Races
Try this in Rust — sharing an Rc across threads:
use std::rc::Rc; use std::thread; fn main() { let data = Rc::new(vec![1, 2, 3]); let data_clone = Rc::clone(&data); thread::spawn(move || { println!("{:?}", data_clone); }); }
This does NOT compile:
error[E0277]: `Rc<Vec<i32>>` cannot be sent between threads safely
--> src/main.rs:8:5
|
8 | thread::spawn(move || {
| ^^^^^^^^^^^^^ `Rc<Vec<i32>>` cannot be sent between threads safely
|
= help: the trait `Send` is not implemented for `Rc<Vec<i32>>`
In C, the equivalent code compiles without any warning. It crashes at runtime. Maybe. Or worse — it corrupts memory silently and you only find out in production.
Fun Fact
The Rust compiler's thread-safety checks have no runtime cost. Send and Sync are "zero-sized" marker traits — they exist only at compile time. Your binary contains no trace of them.
Atomic Types: Lock-Free Concurrency
Sometimes a full mutex is overkill. For simple counters and flags, CPUs provide atomic instructions that complete as one indivisible step.
use std::sync::atomic::{AtomicU32, Ordering}; use std::sync::Arc; use std::thread; fn main() { let counter = Arc::new(AtomicU32::new(0)); let mut handles = vec![]; for _ in 0..2 { let counter = Arc::clone(&counter); let handle = thread::spawn(move || { for _ in 0..1_000_000 { counter.fetch_add(1, Ordering::Relaxed); } }); handles.push(handle); } for handle in handles { handle.join().unwrap(); } println!("Got: {}", counter.load(Ordering::Relaxed)); }
fetch_add compiles to a single lock xadd instruction on x86. No mutex, no waiting, no deadlock possible.
C11 has atomics too (<stdatomic.h>), but again — nothing stops you from mixing atomic and non-atomic access to the same variable.
The Spectrum of Concurrency Safety
C Rust
| |
v v
No guardrails ------> Compiler-enforced
Data races compile Data races don't compile
Mutex is separate Mutex wraps data
Human discipline Type system discipline
Bugs found at runtime Bugs found at compile time
(or never found) (before your code ships)
This does not mean Rust concurrency is easy. Deadlocks are still possible. Logic bugs are still possible. But an entire class of bugs — data races — is eliminated at compile time.
Task
- Compile and run the
race.cprogram at the top of this chapter. Run it 10 times. Record the different results.- Try the
Rcexample in Rust. Read the compiler error carefully.- Replace
RcwithArcandvec![1,2,3]withMutex::new(0). Make the threaded counter work.- Modify the C version to use 10 threads. How wrong does the count get?
- Try the atomic Rust version. Verify the count is always exactly 2,000,000.
- Bonus: Use
timeto compare the mutex version vs the atomic version. Which is faster? Why?
Where C Shines, Where Rust Shines
Type This First
Save this as add.c:
// add.c
int add(int a, int b) {
return a + b;
}
And this as main.rs:
extern "C" { fn add(a: i32, b: i32) -> i32; } fn main() { unsafe { println!("C says: {}", add(3, 4)); } }
Compile and link them together:
$ gcc -c -o add.o add.c
$ rustc main.rs -l static -L . --edition 2021
... wait, we need an archive:
$ ar rcs libadd.a add.o
$ rustc main.rs -l static=add -L . --edition 2021
$ ./main
C says: 7
C and Rust, cooperating. Same process. Same address space. Same binary.
This Is Not a Flame War
Both C and Rust compile to native machine code. Both give you direct memory access. Both have zero runtime overhead — no garbage collector, no virtual machine.
The question is not "which is better." The question is: which tradeoffs fit your project?
The Honest Comparison
+--------------------+---------------------------+----------------------------------+
| Dimension | C | Rust |
+--------------------+---------------------------+----------------------------------+
| Kernel / OS dev | THE standard | Growing (Linux modules, Redox) |
| | (Linux, Windows, macOS) | |
+--------------------+---------------------------+----------------------------------+
| ABI stability | C ABI is THE universal | No stable ABI; FFI goes |
| | interface between langs | through the C ABI |
+--------------------+---------------------------+----------------------------------+
| Legacy codebases | 50+ years of code | Excellent C interop |
| | | (bindgen, extern "C") |
+--------------------+---------------------------+----------------------------------+
| Compile speed | Fast | Slower (borrow checker, |
| | | monomorphization, LLVM) |
+--------------------+---------------------------+----------------------------------+
| Runtime overhead | Zero | Zero (same as C, no GC) |
+--------------------+---------------------------+----------------------------------+
| Memory safety | Programmer discipline | Compiler-enforced |
+--------------------+---------------------------+----------------------------------+
| Concurrency safety | Discipline + sanitizers | Compiler-enforced (Send/Sync) |
+--------------------+---------------------------+----------------------------------+
| Tooling | make, cmake, varied | cargo (build+test+doc+publish |
| | editors, gdb | unified), clippy, rustfmt |
+--------------------+---------------------------+----------------------------------+
| Embedded (common) | Everywhere, mature | Great and growing (Embassy, |
| | | probe-rs, RTIC) |
+--------------------+---------------------------+----------------------------------+
| Embedded (exotic | Often the only option | Needs LLVM target support; |
| / 8-bit) | | no AVR-8 stability yet |
+--------------------+---------------------------+----------------------------------+
| Error handling | errno, return codes (-1) | Result<T,E>, Option<T>, |
| | no enforcement | ? operator, exhaustive matching |
+--------------------+---------------------------+----------------------------------+
| Package management | manual / conan / vcpkg | cargo + crates.io built-in |
+--------------------+---------------------------+----------------------------------+
| Learning curve | Small language, large | Steeper (ownership, lifetimes), |
| | footgun surface | but compiler teaches you |
+--------------------+---------------------------+----------------------------------+
Where C Shines
Operating system kernels. Linux is 30+ million lines of C. Windows kernel is C. macOS kernel is C. These are not being rewritten. When you write a Linux driver, you write C (or now, optionally Rust for new modules).
ABI stability. When Python calls a shared library, it uses the C ABI. When Java uses JNI, it uses the C ABI. When Rust calls foreign code, it uses the C ABI. C is the lingua franca of systems interfaces.
Existing codebases. SQLite: ~150,000 lines of carefully audited C. OpenSSL, zlib, libpng, curl — the infrastructure of the internet is C. You don't rewrite what works.
Exotic hardware. Writing firmware for an 8-bit PIC microcontroller? A DSP with a custom architecture? C has a compiler for it. Rust needs LLVM to support the target.
Team expertise. If your team has 20 years of C experience and deep knowledge of its pitfalls, that expertise is real and valuable.
Where Rust Shines
When correctness matters. Safety-critical systems. Financial software. Aerospace. Medical devices. The cost of a bug is not just a crash — it is lives or millions of dollars.
Concurrent code. The Send/Sync system catches data races at compile time. Chapter 23 showed you this. In C, concurrent bugs hide for years.
New projects. No legacy to maintain? No existing C codebase to integrate with? Rust gives you the same performance with a dramatically smaller bug surface.
When bugs are expensive. Google reported that ~70% of Chromium security bugs are memory safety issues. Microsoft reported the same for Windows. Each CVE costs investigation, patching, disclosure, and reputation. Rust eliminates the entire class.
Long-running services. A web server that runs for months. A database. Memory leaks that build up over days? Use-after-free that triggers once per million requests? Rust catches these before you deploy.
The Sharp Knife Metaphor
C gives you a sharp knife with no guard. Rust gives you the same sharp knife with a guard you can remove (
unsafe) when needed. The blade is equally sharp.
Both produce the same machine code. Both give you the same control. The difference is in what the compiler checks before you run.
C programmer's workflow:
Write code -> Compile -> Run -> Test -> Find bug in prod -> Debug
Rust programmer's workflow:
Write code -> Compile (fight borrow checker) -> It compiles!
-> Run -> Fewer bugs in prod
The borrow checker fight is real. It can be frustrating. But every error the borrow checker throws is a bug you did not ship.
FFI: They Interoperate, Not Compete
Calling C from Rust
// Declare the C function signature extern "C" { fn strlen(s: *const u8) -> usize; } fn main() { let s = b"hello\0"; let len = unsafe { strlen(s.as_ptr()) }; println!("Length: {}", len); // 5 }
The unsafe block is required because Rust cannot verify the C function's memory safety. You are telling the compiler: "I have checked this myself."
Calling Rust from C
#![allow(unused)] fn main() { // lib.rs #[no_mangle] pub extern "C" fn rust_add(a: i32, b: i32) -> i32 { a + b } }
// main.c
#include <stdio.h>
extern int rust_add(int a, int b);
int main(void) {
printf("Rust says: %d\n", rust_add(10, 20));
return 0;
}
$ rustc --crate-type=staticlib lib.rs -o librust_add.a
$ gcc main.c -L. -lrust_add -lpthread -ldl -o main
$ ./main
Rust says: 30
Same ABI. Same calling convention. Same registers. The CPU does not know which language produced the instructions.
Same Function, Same Assembly
Here is add in C and Rust:
int add(int a, int b) { return a + b; }
#![allow(unused)] fn main() { pub fn add(a: i32, b: i32) -> i32 { a + b } }
Compile both with optimizations and look at the assembly:
; Both produce exactly this (x86-64, -O2):
add:
lea eax, [rdi+rsi]
ret
Same instruction. Same registers. Same binary. The language is a compile-time concept. At runtime, there is only machine code.
Fun Fact
You can verify this yourself on godbolt.org. Type the C version in one pane and the Rust version in another. With optimizations enabled, the assembly is often instruction-for-instruction identical.
When to Choose What
Choose C when: Choose Rust when:
+-------------------------------+ +-------------------------------+
| Extending a C codebase | | Starting a new project |
| Targeting exotic hardware | | Correctness is critical |
| Maximum ABI compatibility | | Heavy concurrency |
| OS kernel work (tradition) | | Bugs are very expensive |
| Team deeply knows C | | Long-running services |
| Interfacing with C-only libs | | Want unified tooling (cargo) |
+-------------------------------+ +-------------------------------+
Choose BOTH when:
+-------------------------------+
| Wrapping C libs in safe Rust |
| Adding Rust to a C project |
| Performance-critical + safe |
+-------------------------------+
What do you think happens?
If you write a function in C and the same function in Rust, and both compile to the same assembly — what is the "cost" of Rust's safety? Where does the safety checking actually happen?
The answer: the cost is entirely at compile time. Zero runtime cost. The safety checks are erased before the binary is produced. This is Rust's core promise.
Task
- Write a function
int square(int x)in C andpub fn square(x: i32) -> i32in Rust.- Compile both with
-O2/--releaseand compare the assembly (useobjdump -dor godbolt.org).- Use
extern "C"to call your Csquarefrom Rust. Print the result.- Use
#[no_mangle] pub extern "C"to call your Rustsquarefrom C. Print the result.- Bonus: Write a C function with a deliberate buffer overflow. Wrap it in Rust with a safe API that checks bounds before calling the C function. This is the pattern real-world Rust/C interop uses.
The Bug Hall of Fame
Type This First
This is the essence of Heartbleed in 15 lines. Save as heartbleed_demo.c:
#include <stdio.h>
#include <string.h>
char secret[] = "MY_SECRET_KEY_12345";
char buffer[64];
void heartbeat(int claimed_length) {
// Copies 'claimed_length' bytes — but doesn't check
// if the actual data is that long!
char response[64];
memcpy(response, buffer, claimed_length);
printf("Response (%d bytes): ", claimed_length);
for (int i = 0; i < claimed_length; i++)
printf("%c", response[i] >= 32 ? response[i] : '.');
printf("\n");
}
int main(void) {
strcpy(buffer, "hi"); // actual payload: 2 bytes
heartbeat(50); // but we claim 50 bytes!
return 0;
}
Compile and run. You will see memory contents beyond what was sent — possibly including MY_SECRET_KEY_12345. This is exactly how Heartbleed worked.
Why This Chapter Exists
These are not theoretical bugs. These are real vulnerabilities that affected billions of devices, cost billions of dollars, and compromised millions of passwords.
Every single one is a memory safety bug. Every single one would not compile in safe Rust.
This is not about blaming C. C was designed in 1972 for a world where programs were small and programmers were few. The question for today is: should the compiler catch these mistakes, or should we rely on human discipline at scale?
Heartbleed (CVE-2014-0160)
What happened. OpenSSL's TLS heartbeat extension had a buffer over-read. A client sends a heartbeat message with a payload and a claimed length. The server echoes back claimed_length bytes — without checking that the actual payload is that long.
The root cause in C:
// Simplified from the actual OpenSSL code:
memcpy(response, payload, claimed_length);
// ^^^^^^^^^^^^^^
// Never verified: claimed_length <= actual_payload_size
Impact. Any server running affected OpenSSL would leak up to 64KB of process memory per request. That memory contained passwords, private keys, session tokens. An estimated 17% of the internet's secure servers were vulnerable.
How Rust prevents it. In Rust, slices carry their length. &[u8] knows how many bytes it contains. You cannot memcpy past the end — the runtime panics (bounds check) or the compiler prevents it entirely.
Client sends: Server does:
payload = "hi" memcpy(buf, payload, 500)
claimed_len = 500 ^^^^^^^^^^^^^^^^^^^^^^^^
Copies 500 bytes starting at "hi"
Next 498 bytes: whatever's in memory
Private keys, passwords, sessions...
sudo Baron Samedit (CVE-2021-3156)
What happened. A heap-based buffer overflow in sudo — the program that grants root access. Present for almost 10 years before discovery.
The root cause. When parsing command-line arguments in sudoedit mode, a backslash at the end of a string caused a write past the end of a heap buffer.
// Simplified pattern:
while (*from) {
if (from[0] == '\\' && from[1] != '\0')
from++; // skip backslash
*to++ = *from++; // write to heap buffer
}
// If the string ends with '\', from[1] reads past the null terminator
// and the loop keeps writing past the buffer boundary
Impact. Any local user could gain root access on nearly every Unix-like system. CVSSv3 score: 7.8.
How Rust prevents it. Ownership and bounds checking. A Vec<u8> in Rust will panic on out-of-bounds write, or better — the iterator-based approach would never produce an out-of-bounds index.
WannaCry / EternalBlue (CVE-2017-0144)
What happened. The EternalBlue exploit targeted a buffer overflow in Windows' SMBv1 implementation. The WannaCry ransomware used it to spread across networks.
The root cause. A buffer overflow in the Windows SMB driver (srv.sys). A specially crafted SMB transaction caused a pool buffer overflow in kernel memory.
Impact. Over 200,000 computers in 150 countries. Hospitals shut down. Factories stopped. Estimated damage: $4 billion or more.
How Rust prevents it. Safe Rust does not allow buffer overflows. Period. Slice access is bounds-checked. Vec access is bounds-checked. You would need unsafe to bypass this, and unsafe blocks are visible and auditable.
iOS Jailbreaks: Use-After-Free in WebKit
What happened. Many iOS jailbreaks (and zero-day exploits) have been based on use-after-free bugs in Safari's WebKit rendering engine.
The pattern:
Widget *w = create_widget();
destroy_widget(w); // frees the memory
// ... more code ...
w->render(); // USE AFTER FREE
// The memory at w might now contain attacker-controlled data
Impact. Full device compromise. In the hands of nation-states, these exploits were used for surveillance. The Pegasus spyware used WebKit vulnerabilities.
How Rust prevents it. Ownership. When you free (drop) a value, the compiler will not let you use it again. The borrow checker ensures that no references outlive the data they point to.
#![allow(unused)] fn main() { let w = create_widget(); drop(w); // w.render(); // COMPILE ERROR: use of moved value `w` }
The Chromium Data
Google published the numbers for Chromium (Chrome's open-source base):
+---------------------------------------------+
| Chromium Security Bugs by Category |
| |
| Memory safety: ~70% <-- THIS |
| Logic errors: ~15% |
| Other: ~15% |
+---------------------------------------------+
Seventy percent. Not 7%. Seven-zero. The dominant source of security vulnerabilities in one of the most-used programs on Earth is memory safety.
Google's response: new Chromium components are increasingly written in Rust and memory-safe C++.
The Android Data
Android's security team reported a similar pattern:
Year Memory Safety Bugs (%) Rust Code (%)
2019 76% 0%
2020 72% ~1%
2021 65% ~5%
2022 55% ~10%
2023 40% ~15%
As the percentage of new code written in Rust increased, the percentage of memory safety bugs decreased — even though the total codebase grew. The existing C/C++ code was not rewritten; new code was simply written in Rust.
Fun Fact
As of 2024, there have been ZERO memory safety vulnerabilities discovered in Android's Rust code. Not "few." Zero.
Linux Kernel: Rust for New Drivers
In 2022, Rust support was merged into the Linux kernel. The philosophy:
- Existing C code stays as C. It works. It is well-tested.
- New drivers and modules can be written in Rust.
- The goal is preventing new bugs, not rewriting old code.
Linus Torvalds approved this not because C is bad, but because humans make mistakes, and the kernel cannot afford those mistakes.
The Bug Pattern Table
+---------------------+------------------+-----------------------------+
| Bug Type | Frequency in | Rust Prevention |
| | C Codebases | |
+---------------------+------------------+-----------------------------+
| Buffer overflow | Very common | Bounds checking on all |
| (read or write) | | slice/array access |
+---------------------+------------------+-----------------------------+
| Use-after-free | Common | Ownership: compiler tracks |
| | | when values are dropped |
+---------------------+------------------+-----------------------------+
| Double free | Common | Ownership: only one owner, |
| | | dropped exactly once |
+---------------------+------------------+-----------------------------+
| Null pointer deref | Very common | No null: Option<T> forces |
| | | explicit handling |
+---------------------+------------------+-----------------------------+
| Data race | Common in | Send/Sync: compiler refuses |
| | concurrent code | unsafe sharing |
+---------------------+------------------+-----------------------------+
| Uninitialized read | Common | All variables must be |
| | | initialized before use |
+---------------------+------------------+-----------------------------+
| Format string | Occasional | No format strings; macros |
| | | are type-checked |
+---------------------+------------------+-----------------------------+
| Integer overflow | Common | Panics in debug, wrapping |
| | | explicit in release |
+---------------------+------------------+-----------------------------+
Not Blaming C
C is one of the most important languages ever created. It powers operating systems, databases, embedded systems, and the infrastructure of the internet. Billions of devices run C code.
The bugs in this chapter are not C's fault. They are human mistakes. The question is not whether humans make mistakes — they do, always, inevitably — but whether the toolchain should catch those mistakes before they reach production.
What do you think happens?
If a large company rewrites all their C code in Rust, are they now bug-free? What kinds of bugs does Rust NOT prevent? (Hint: think about logic errors, deadlocks, and incorrect algorithms.)
Rust prevents memory safety bugs. It does not prevent wrong business logic, algorithmic errors, deadlocks, resource leaks from forgetting to close files, or plain old bad design.
But eliminating 70% of security vulnerabilities is a powerful argument.
Timeline of Awareness
2014 Heartbleed shocks the world
2017 WannaCry causes $4B+ damage
2019 Microsoft: 70% of CVEs are memory safety
2020 Google: 70% of Chromium bugs are memory safety
2021 Baron Samedit: sudo exploitable for 10 years
2022 Rust merged into Linux kernel
2023 Android: zero memory-safety bugs in Rust code
2024 White House recommends memory-safe languages
The industry is moving. Not because C is bad, but because the stakes are too high for human discipline alone.
Task
- Look up CVE-2014-0160 (Heartbleed) on any CVE database. Read the description.
- Find the actual OpenSSL patch that fixed it. The key change is a bounds check — identify the exact line.
- Write a 10-line Rust equivalent of the
heartbeatfunction. Try to make it read out of bounds. Observe the panic.- Look up CVE-2021-3156. Can you identify the buffer overflow pattern?
- Bonus: Search for "memory safety vulnerabilities" + your favorite open-source project. How many of the CVEs are buffer overflows or use-after-free?
The Toolbox
Type This First
Run this on any program from any previous chapter (or just use /bin/ls):
$ readelf -h /bin/ls
$ nm /bin/ls 2>/dev/null || echo "stripped"
$ file /bin/ls
$ size /bin/ls
$ strace -c ls /tmp 2>&1 | tail -20
Five tools, five different views of the same binary.
/proc: The Kernel Tells You Everything
Every running process has a directory under /proc/[pid]/. Not real files — the kernel generates them on demand.
/proc/[pid]/maps Memory layout (virtual address ranges)
/proc/[pid]/smaps Detailed per-mapping info (RSS, shared, private)
/proc/[pid]/status Process summary (state, memory, threads)
/proc/[pid]/exe Symlink to the actual executable
/proc/[pid]/fd/ Open file descriptors
$ sleep 1000 &
$ cat /proc/$!/maps
555555554000-555555556000 r--p 00000000 08:01 131074 /usr/bin/sleep
555555556000-555555558000 r-xp 00002000 08:01 131074 /usr/bin/sleep
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0 [stack]
Fun Fact
pmapis basically a pretty-printer for/proc/[pid]/maps.
Process Inspection: strace, ltrace, pmap
strace traces system calls — every interaction between your program and the kernel:
$ strace -e trace=write echo "hello"
write(1, "hello\n", 6) = 6
ltrace traces library calls (malloc, free, printf):
$ ltrace -e malloc+free ls /tmp
malloc(132) = 0x55a1234
free(0x55a1234)
pmap shows the memory map with sizes and permissions: pmap -x <pid>.
Binary Analysis Tools
readelf -h a.out ELF header (type, arch, entry point)
readelf -S a.out Section headers (.text, .data, .bss...)
readelf -l a.out Program headers (segments)
readelf -s a.out Symbol table
readelf -d a.out Dynamic section (.so dependencies)
objdump -d a.out Disassembly
objdump -d -M intel a.out Intel syntax (more readable)
nm a.out Symbol list (T=text, D=data, B=bss, U=undefined)
size a.out Section sizes (.text, .data, .bss)
strings a.out Embedded string literals
file a.out File type and architecture
Debugging: GDB Essentials
$ gcc -g -o prog prog.c && gdb ./prog
break main Breakpoint at function
break prog.c:42 Breakpoint at line
run / run arg1 Start execution
next (n) Step over
step (s) Step into
continue (c) Continue to next breakpoint
print x / print/x ptr Print variable (decimal / hex)
bt Backtrace (call stack)
info registers All register values
info proc mappings Memory map
x/16xb 0x7fff... Examine 16 bytes in hex
x/4xg $rsp 4 quad-words at stack pointer
watch counter Break when variable changes
What do you think happens?
If you set a watchpoint with
watch counterand two threads modify it, will GDB catch both? (Hint: hardware watchpoints are per-CPU.)
Memory Debugging
Valgrind (10-50x slowdown, very thorough):
$ valgrind --leak-check=full ./prog
==12345== Invalid read of size 4
==12345== at 0x1091A2: main (prog.c:10)
Catches: leaks, use-after-free, buffer over-reads, uninitialized reads.
AddressSanitizer (2-3x slowdown, compile-time instrumentation):
$ gcc -g -fsanitize=address -o prog prog.c && ./prog
==12345==ERROR: AddressSanitizer: heap-buffer-overflow
Catches: buffer overflows, use-after-free, double-free.
UBSan (minimal overhead):
$ gcc -g -fsanitize=undefined -o prog prog.c && ./prog
prog.c:5: runtime error: signed integer overflow
Catches: signed overflow, null deref, misaligned access.
Rust-Specific Tools
cargo-geiger counts unsafe blocks in your code and dependencies:
$ cargo geiger
Functions Expressions Impls Traits Methods
2/5 14/60 0/0 0/0 1/3
Miri interprets your code and detects UB, even in unsafe:
$ cargo +nightly miri run
error: Undefined Behavior: dereferencing null pointer
cargo-bloat shows what makes your binary big:
$ cargo bloat --release
5.3% 12.1% 3.2KiB std::io::Write::write_fmt
godbolt.org — type C or Rust, see assembly instantly. Color-coded source-to-asm mapping. The single best tool for understanding compiler output.
Quick Reference: Problem to Tool
+-----------------------------------+---------------------------+
| Problem | Tool |
+-----------------------------------+---------------------------+
| "What's in this binary?" | file, readelf -h |
| "What sections does it have?" | readelf -S, size |
| "What symbols are exported?" | nm, readelf -s |
| "What does the assembly look like"| objdump -d, godbolt.org |
| "What strings are embedded?" | strings |
| "What syscalls does it make?" | strace |
| "What libraries does it call?" | ltrace, ldd |
| "Where is its memory?" | /proc/pid/maps, pmap |
| "Why does it crash?" | gdb, bt, info registers |
| "Does it leak memory?" | valgrind --leak-check |
| "Does it overflow buffers?" | -fsanitize=address |
| "Does it have undefined behavior?"| -fsanitize=undefined, miri|
| "How much unsafe in my Rust?" | cargo-geiger |
| "Why is my Rust binary big?" | cargo-bloat |
+-----------------------------------+---------------------------+
Task
- Pick any program you compiled in a previous chapter.
- Run it through at least THREE tools from this chapter.
- For each tool, write down one thing you learned that you didn't know before.
- Bonus: Compile a buggy C program from Chapter 25 with
-fsanitize=address. Does it catch the bug?- Bonus: Run
straceonls, then onls | cat. Does thewritecount change? Why?
Experiments
Ten guided labs. Each has C and Rust versions, clear steps, expected output. Do them. Reading about memory is not the same as seeing it.
Experiment 1: Print the Memory Layout
Verify the address space from Chapter 6.
// layout.c
#include <stdio.h>
#include <stdlib.h>
int global_init = 42;
int global_uninit;
int main(void) {
int stack_var = 99;
int *heap_var = malloc(sizeof(int));
printf("Code (main): %p\n", (void *)main);
printf("Data (init): %p\n", (void *)&global_init);
printf("BSS (uninit): %p\n", (void *)&global_uninit);
printf("Heap (malloc): %p\n", (void *)heap_var);
printf("Stack (local): %p\n", (void *)&stack_var);
free(heap_var);
system("cat /proc/self/maps");
return 0;
}
static GLOBAL: i32 = 42; fn main() { let stack_var: i32 = 99; let heap_var = Box::new(0i32); println!("Code: {:p}", main as *const ()); println!("Data: {:p}", &GLOBAL); println!("Heap: {:p}", &*heap_var); println!("Stack: {:p}", &stack_var); }
Verify: Code < Data < BSS < Heap < ... gap ... < Stack.
Experiment 2: Stack Buffer Overflow
// overflow.c — compile: gcc -fno-stack-protector -g -o overflow overflow.c
#include <stdio.h>
#include <string.h>
void vulnerable(void) {
char buffer[8];
memset(buffer, 'A', 32); // 32 bytes into 8-byte buffer
printf("After overflow\n");
}
int main(void) { vulnerable(); return 0; }
$ ./overflow
Segmentation fault
$ gdb ./overflow -ex run -ex bt
#0 0x4141414141414141 in ?? () <-- return addr overwritten
Rust equivalent panics cleanly at the boundary:
fn main() { let mut buf = [0u8; 8]; for i in 0..32 { buf[i] = b'A'; } // panics at i=8 }
Experiment 3: Fork and Copy-on-Write
// cow.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
int main(void) {
int *data = malloc(4096);
*data = 42;
printf("Before fork: %p = %d\n", (void*)data, *data);
pid_t pid = fork();
if (pid == 0) {
printf("Child before write: %p = %d\n", (void*)data, *data);
*data = 99; // triggers copy-on-write
printf("Child after write: %p = %d\n", (void*)data, *data);
free(data); _exit(0);
}
wait(NULL);
printf("Parent after child: %p = %d\n", (void*)data, *data);
free(data);
}
Same virtual address in both, different values. Different physical pages after CoW.
What do you think happens?
Both print the same pointer. How can the values differ? (Different page tables, different physical frames.)
Experiment 4: mmap a File
// mmap_file.c
#include <stdio.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <string.h>
int main(void) {
int fd = open("test.txt", O_RDWR | O_CREAT | O_TRUNC, 0644);
write(fd, "Hello, mmap!\n", 13);
char *m = mmap(NULL, 13, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
printf("Via mmap: %s", m);
memcpy(m, "ZZZZZ", 5);
msync(m, 13, MS_SYNC);
munmap(m, 13);
system("cat test.txt"); // prints "ZZZZZ mmap!\n"
}
The file and the memory are the same thing. Writing to the pointer writes to disk.
Experiment 5: A 50-Line Bump Allocator
// bump.c — simplest possible malloc
#include <stdio.h>
#define HEAP_SIZE 1024
static char heap[HEAP_SIZE];
static size_t offset = 0;
void *bump_alloc(size_t size) {
size_t aligned = (size + 7) & ~7;
if (offset + aligned > HEAP_SIZE) return NULL;
void *ptr = &heap[offset];
offset += aligned;
return ptr;
}
int main(void) {
int *a = bump_alloc(sizeof(int)); *a = 42;
int *b = bump_alloc(sizeof(int)); *b = 99;
printf("a=%d at %p, b=%d at %p\n", *a, (void*)a, *b, (void*)b);
printf("Used: %zu / %d bytes\n", offset, HEAP_SIZE);
}
struct Bump { heap: [u8; 1024], offset: usize } impl Bump { fn new() -> Self { Bump { heap: [0; 1024], offset: 0 } } fn alloc(&mut self, size: usize) -> Option<&mut [u8]> { let aligned = (size + 7) & !7; if self.offset + aligned > 1024 { return None; } let start = self.offset; self.offset += aligned; Some(&mut self.heap[start..start + size]) } } fn main() { let mut a = Bump::new(); let x = a.alloc(4).unwrap(); x.copy_from_slice(&42i32.to_ne_bytes()); println!("Allocated: {:?}, used: {}/1024", x, a.offset); }
Rust returns Option — no null pointers, no forgetting to check.
Experiment 6: Trigger All 6 Segfault Types
// segfaults.c — compile: gcc -g -fno-stack-protector -o segfaults segfaults.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void null_deref(void) { int *p = NULL; *p = 42; }
void stack_blow(void) { stack_blow(); }
void use_after_free(void){ int *p = malloc(4); free(p); *p = 42; }
void write_rodata(void) { char *s = "hello"; s[0] = 'H'; }
void exec_stack(void) { char c[]={0xc3}; ((void(*)(void))c)(); }
void unmapped(void) { *(int*)0xDEADBEEF = 42; }
int main(int argc, char **argv) {
if (argc != 2) { printf("Usage: %s <1-6>\n", argv[0]); return 1; }
switch(argv[1][0]) {
case '1': null_deref(); break; case '2': stack_blow(); break;
case '3': use_after_free(); break; case '4': write_rodata(); break;
case '5': exec_stack(); break; case '6': unmapped(); break;
}
}
Debug each with GDB: gdb ./segfaults -ex "run 1" -ex bt -ex "info registers rip".
For each case, note: which address faulted, what bt shows, which permission was violated (R/W/X).
Experiment 7: Compare ELF — C vs Rust "Hello World"
$ echo '#include <stdio.h>
int main(){ puts("hello"); }' > hello.c && gcc -o hello_c hello.c
$ echo 'fn main(){ println!("hello"); }' > hello.rs && rustc -o hello_rust hello.rs
$ ls -la hello_c hello_rust
$ size hello_c hello_rust
$ readelf -S hello_c | wc -l
$ readelf -S hello_rust | wc -l
Rust binary: 1-4 MB. C binary: ~16 KB. The difference: panic handling, unwinding tables, println! formatting. Try rustc -O then strip:
$ rustc -O -o hello_opt hello.rs && strip hello_opt
$ ls -la hello_c hello_rust hello_opt
Fun Fact
Most of a Rust binary's size is not your code. It is the standard library support for panics and formatting. Set
panic = "abort"in Cargo.toml and the binary shrinks dramatically.
Experiment 8: Manual Linking
// math.c
int add(int a, int b) { return a + b; }
int mul(int a, int b) { return a * b; }
// main.c
#include <stdio.h>
extern int add(int, int);
extern int mul(int, int);
int main(void) { printf("3+4=%d, 3*4=%d\n", add(3,4), mul(3,4)); }
$ gcc -c math.c && gcc -c main.c
$ nm math.o # T add, T mul (defined)
$ nm main.o # U add, U mul (undefined — need linking)
$ gcc -o prog main.o math.o
$ nm prog | grep -E 'add|mul|main'
What do you think happens?
Link
main.owithoutmath.o. The linker error shows exactly how symbol resolution works.
Experiment 9: Cache Performance
// cache.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N (16*1024*1024)
int main(void) {
int *a = malloc(N * sizeof(int));
for (int i = 0; i < N; i++) a[i] = i;
clock_t t;
volatile int sum = 0;
t = clock();
for (int i = 0; i < N; i++) sum += a[i];
printf("Sequential: %.1f ms\n", 1000.0*(clock()-t)/CLOCKS_PER_SEC);
srand(42);
for (int i = N-1; i > 0; i--) {
int j = rand() % (i+1);
int tmp = a[i]; a[i] = a[j]; a[j] = tmp;
}
sum = 0; t = clock();
int idx = 0;
for (int i = 0; i < N; i++) { sum += a[idx]; idx = abs(a[idx]) % N; }
printf("Random: %.1f ms\n", 1000.0*(clock()-t)/CLOCKS_PER_SEC);
free(a);
}
$ gcc -O2 -o cache cache.c && ./cache
Sequential: 7.0 ms
Random: 85.0 ms <-- 10x+ slower, same data, same operations
The difference is cache misses. Use perf stat -e cache-misses,cache-references ./cache to see the numbers.
Experiment 10: Handwritten ELF (No Compiler)
A valid executable in 163 bytes. Just write(1, "Hi\n", 3) and exit(0).
#!/usr/bin/env python3
# tiny_elf.py
import struct, os
code = bytes([
0x48,0xc7,0xc0,0x01,0x00,0x00,0x00, # mov rax, 1 (write)
0x48,0xc7,0xc7,0x01,0x00,0x00,0x00, # mov rdi, 1 (stdout)
0x48,0x8d,0x35,0x12,0x00,0x00,0x00, # lea rsi, [rip+18]
0x48,0xc7,0xc2,0x03,0x00,0x00,0x00, # mov rdx, 3
0x0f,0x05, # syscall
0x48,0xc7,0xc0,0x3c,0x00,0x00,0x00, # mov rax, 60 (exit)
0x48,0x31,0xff, # xor rdi, rdi
0x0f,0x05, # syscall
0x48,0x69,0x0a, # "Hi\n"
])
LOAD = 0x400000; EH = 64; PH = 56
ENTRY = LOAD + EH + PH; FSIZE = EH + PH + len(code)
ehdr = struct.pack('<4sBBBBBxxxxxxx', b'\x7fELF', 2, 1, 1, 0, 0)
ehdr += struct.pack('<HHIQQQIHHHHHH', 2,0x3E,1,ENTRY,EH,0,0,EH,PH,1,0,0,0)
phdr = struct.pack('<IIQQQQQQ', 1, 5, 0, LOAD, LOAD, FSIZE, FSIZE, 0x1000)
with open('tiny','wb') as f: f.write(ehdr + phdr + code)
os.chmod('tiny', 0o755)
print(f"Created 'tiny' ({FSIZE} bytes). Run: ./tiny")
$ python3 tiny_elf.py && ./tiny
Created 'tiny' (163 bytes). Run: ./tiny
Hi
$ file tiny
tiny: ELF 64-bit LSB executable, x86-64, statically linked, no section header
No compiler, no libc, no linker. Pure bytes that the kernel understands.
Task
Complete at least 5 of these 10 experiments. For each:
- Run the code exactly as written.
- Modify one thing and predict the result before running.
- Write down what surprised you.
The goal is not to memorize. It is to build intuition. When you have seen the stack grow downward, you never forget which way it grows.
Appendix A: x86-64 Register Reference
General-Purpose Registers
+--------+--------+------+-------+----------------------------------+
| 64-bit | 32-bit | 16b | 8-bit | Primary Purpose |
+--------+--------+------+-------+----------------------------------+
| rax | eax | ax | al/ah | Return value, syscall number |
| rbx | ebx | bx | bl/bh | Callee-saved (preserved) |
| rcx | ecx | cx | cl/ch | 4th integer argument |
| rdx | edx | dx | dl/dh | 3rd integer argument |
| rsi | esi | si | sil | 2nd integer argument |
| rdi | edi | di | dil | 1st integer argument |
| rbp | ebp | bp | bpl | Frame pointer (callee-saved) |
| rsp | esp | sp | spl | Stack pointer |
| r8 | r8d | r8w | r8b | 5th integer argument |
| r9 | r9d | r9w | r9b | 6th integer argument |
| r10 | r10d | r10w | r10b | Caller-saved temporary |
| r11 | r11d | r11w | r11b | Caller-saved temporary |
| r12 | r12d | r12w | r12b | Callee-saved |
| r13 | r13d | r13w | r13b | Callee-saved |
| r14 | r14d | r14w | r14b | Callee-saved |
| r15 | r15d | r15w | r15b | Callee-saved |
+--------+--------+------+-------+----------------------------------+
Writing to a 32-bit sub-register (e.g., eax) zero-extends into the full 64-bit register. Writing to a 16-bit or 8-bit sub-register does NOT zero-extend — the upper bits are preserved.
Special Registers
+--------+----------------------------------------------------------+
| rip | Instruction pointer — address of the NEXT instruction |
| | You cannot write to it directly (use jmp/call/ret) |
+--------+----------------------------------------------------------+
| rflags | Status flags, set by arithmetic/comparison instructions |
| | CF (bit 0) — Carry flag (unsigned overflow) |
| | ZF (bit 6) — Zero flag (result was zero) |
| | SF (bit 7) — Sign flag (result was negative) |
| | OF (bit 11) — Overflow flag (signed overflow) |
+--------+----------------------------------------------------------+
System V AMD64 ABI Calling Convention
This is the calling convention used on Linux, macOS, and most Unix-like systems.
Integer / Pointer Arguments
Argument #: 1st 2nd 3rd 4th 5th 6th 7th+
Register: rdi rsi rdx rcx r8 r9 stack
Arguments beyond the 6th are passed on the stack, pushed right-to-left.
Floating-Point Arguments
Argument #: 1st 2nd 3rd 4th 5th 6th 7th 8th 9th+
Register: xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 stack
Return Values
Integer / pointer: rax (and rdx for 128-bit returns)
Floating-point: xmm0 (and xmm1 for complex returns)
Callee-Saved Registers (Must Be Preserved Across Calls)
rbx, rbp, r12, r13, r14, r15
If a function uses any of these, it must save and restore them.
Caller-Saved Registers (May Be Destroyed By Calls)
rax, rcx, rdx, rsi, rdi, r8, r9, r10, r11
If you need values in these registers to survive a call, save them yourself.
Linux Syscall Convention
Syscall number: rax
Arguments: rdi rsi rdx r10 r8 r9
Return value: rax (negative = -errno)
Instruction: syscall
Clobbered: rcx, r11 (overwritten by the kernel)
Note: the 4th argument uses r10, not rcx — this differs from the normal calling convention because syscall clobbers rcx (it stores the return address there).
Common Syscall Numbers (x86-64 Linux)
0 read 1 write 2 open 3 close
9 mmap 11 munmap 12 brk 57 fork
59 execve 60 exit 62 kill 231 exit_group
Full list: /usr/include/asm/unistd_64.h or ausyscall --dump.
Quick Reference Diagram
Argument passing (System V AMD64):
my_function(a, b, c, d, e, f, g, h)
| | | | | | | |
v v v v v v v v
rdi rsi rdx rcx r8 r9 [stack] [stack]
Return: rax
Appendix B: Page Table Entry Bitfields
x86-64 Page Table Entry (PTE)
Each page table entry is 64 bits (8 bytes). Not all bits are used — the CPU ignores some, and the OS can repurpose them.
63 62..52 51..12 11..9 8 7 6 5 4 3 2 1 0
+---+-------+----------------------+-----+---+---+---+---+---+---+---+---+---+
|NX | Avail | Physical Page Number | AVL | G |PAT| D | A |PCD|PWT|U/S|R/W| P |
+---+-------+----------------------+-----+---+---+---+---+---+---+---+---+---+
Bit-by-Bit Reference
+------+--------+-----------------------------------------------------------+
| Bit | Name | Meaning |
+------+--------+-----------------------------------------------------------+
| 0 | P | Present. 1 = page is in physical memory. |
| | | 0 = not present; access triggers page fault (#PF). |
+------+--------+-----------------------------------------------------------+
| 1 | R/W | Read/Write. 1 = writable. 0 = read-only. |
| | | Write to read-only page triggers page fault. |
+------+--------+-----------------------------------------------------------+
| 2 | U/S | User/Supervisor. 1 = user-mode (ring 3) can access. |
| | | 0 = kernel-only. User access triggers page fault. |
+------+--------+-----------------------------------------------------------+
| 3 | PWT | Page-level Write-Through. 1 = write-through caching. |
| | | 0 = write-back caching (normal). |
+------+--------+-----------------------------------------------------------+
| 4 | PCD | Page-level Cache Disable. 1 = caching disabled. |
| | | Used for memory-mapped I/O (device registers). |
+------+--------+-----------------------------------------------------------+
| 5 | A | Accessed. Set by CPU when page is read or written. |
| | | OS clears it periodically to track working set. |
+------+--------+-----------------------------------------------------------+
| 6 | D | Dirty. Set by CPU when page is written to. |
| | | OS uses this to know which pages need writing to disk. |
+------+--------+-----------------------------------------------------------+
| 7 | PS/PAT | Page Size (in PDE): 1 = large page (2MB or 1GB). |
| | | PAT (in PTE): Page Attribute Table index bit. |
+------+--------+-----------------------------------------------------------+
| 8 | G | Global. 1 = don't flush from TLB on CR3 switch. |
| | | Used for kernel pages shared across all processes. |
+------+--------+-----------------------------------------------------------+
| 9-11 | AVL | Available for OS use. Linux uses these for swap info, |
| | | soft-dirty tracking, and other bookkeeping. |
+------+--------+-----------------------------------------------------------+
|12-51 | PPN | Physical Page Number. The upper bits of the physical |
| | | address (shift left by 12 to get byte address). |
+------+--------+-----------------------------------------------------------+
|52-62 | Avail | Available / reserved. Some used by OS or hardware. |
+------+--------+-----------------------------------------------------------+
| 63 | NX | No-Execute. 1 = code execution forbidden on this page. |
| | | Attempting to execute triggers page fault. |
| | | Critical for W^X security (stack, heap are NX). |
+------+--------+-----------------------------------------------------------+
How the OS Uses These Bits
Scenario Bits set
----------------------------------------------
Normal code page: P=1, R/W=0, U/S=1, NX=0
Writable data page: P=1, R/W=1, U/S=1, NX=1
Stack page: P=1, R/W=1, U/S=1, NX=1
Kernel code: P=1, R/W=0, U/S=0, NX=0
Copy-on-Write page: P=1, R/W=0 (trap on write)
Demand-paged (not loaded): P=0 (trap on any access)
Guard page (stack boundary): P=0 (trap = stack overflow)
ARM Comparison
ARM uses a different page table format, but the concepts are the same:
x86-64 Bit ARM Equivalent Notes
----------- ---------------- --------------------------------
P (Present) Valid bit Same concept
R/W AP[2:1] Access Permission bits
U/S AP[1], PXN, UXN More granular in ARM
NX XN / UXN / PXN Separate execute control for
user and privileged modes
D (Dirty) DBM (Dirty Bit Hardware-managed on newer ARM
Management)
A (Accessed) AF (Access Flag) Same concept
ARM page tables can be 4KB, 16KB, or 64KB granules (x86-64 is always 4KB base). ARM also supports 3-level or 4-level page tables depending on the virtual address size configuration.
Appendix C: ELF Format Quick Reference
ELF Header Fields
Every ELF file starts with a 64-byte header (for 64-bit) at offset 0.
+----------------+--------+------------------------------------------------+
| Field | Size | Description |
+----------------+--------+------------------------------------------------+
| e_ident[0..3] | 4 | Magic: 0x7f 'E' 'L' 'F' |
| e_ident[4] | 1 | Class: 1 = 32-bit, 2 = 64-bit |
| e_ident[5] | 1 | Data: 1 = little-endian, 2 = big-endian |
| e_ident[6] | 1 | Version: 1 (always) |
| e_ident[7] | 1 | OS/ABI: 0 = SYSV, 3 = Linux |
| e_ident[8..15] | 8 | Padding (zeroes) |
+----------------+--------+------------------------------------------------+
| e_type | 2 | ET_REL (1) = relocatable (.o) |
| | | ET_EXEC (2) = executable (fixed address) |
| | | ET_DYN (3) = shared object / PIE executable |
+----------------+--------+------------------------------------------------+
| e_machine | 2 | EM_X86_64 (0x3E), EM_AARCH64 (0xB7), etc. |
| e_version | 4 | 1 (current) |
| e_entry | 8 | Entry point virtual address (_start) |
| e_phoff | 8 | Program header table offset in file |
| e_shoff | 8 | Section header table offset in file |
| e_flags | 4 | Processor-specific flags |
| e_ehsize | 2 | ELF header size (64 for 64-bit) |
| e_phentsize | 2 | Size of one program header entry (56) |
| e_phnum | 2 | Number of program header entries |
| e_shentsize | 2 | Size of one section header entry (64) |
| e_shnum | 2 | Number of section header entries |
| e_shstrndx | 2 | Index of section name string table |
+----------------+--------+------------------------------------------------+
Common Sections
+-------------+-----------------------------------------------------------+
| Section | Contents |
+-------------+-----------------------------------------------------------+
| .text | Executable machine code |
| .data | Initialized global/static variables |
| .bss | Uninitialized globals (zero-filled at load, no file space)|
| .rodata | Read-only data (string literals, constants) |
| .symtab | Symbol table (functions, globals) — for linking/debugging |
| .strtab | String table for symbol names |
| .shstrtab | String table for section names |
| .rel / .rela| Relocation entries (fixups for the linker) |
| .plt | Procedure Linkage Table (lazy binding stubs) |
| .got | Global Offset Table (resolved dynamic addresses) |
| .got.plt | GOT entries specifically for PLT |
| .dynamic | Dynamic linking info (needed libraries, symbol tables) |
| .interp | Path to dynamic linker (/lib64/ld-linux-x86-64.so.2) |
| .init/.fini | Constructor/destructor code |
| .debug_* | DWARF debug information (line numbers, types, variables) |
| .eh_frame | Exception/stack unwinding tables |
| .note.* | Build ID, ABI tags |
| .comment | Compiler version string |
+-------------+-----------------------------------------------------------+
Common Segment Types (Program Headers)
Segments are the runtime view. The kernel reads these to load the program.
+----------------+----------------------------------------------------------+
| Type | Purpose |
+----------------+----------------------------------------------------------+
| PT_LOAD | Loadable segment. Mapped into memory. Usually two: |
| | 1) r-x: .text, .rodata (code + constants) |
| | 2) rw-: .data, .bss (writable data) |
+----------------+----------------------------------------------------------+
| PT_DYNAMIC | Points to .dynamic section. Used by the dynamic linker |
| | to find shared libraries and resolve symbols. |
+----------------+----------------------------------------------------------+
| PT_INTERP | Path to the dynamic linker (e.g., /lib64/ld-linux...) |
| | Kernel reads this to know which loader to invoke. |
+----------------+----------------------------------------------------------+
| PT_NOTE | Auxiliary info: build ID, ABI version. |
| | Not loaded into process memory. |
+----------------+----------------------------------------------------------+
| PT_GNU_STACK | Stack executability. If absent or flags=RW, stack is NX. |
| | If flags=RWX, stack is executable (rare, insecure). |
+----------------+----------------------------------------------------------+
| PT_GNU_RELRO | Read-only after relocation. The dynamic linker resolves |
| | GOT entries, then marks this region read-only. |
+----------------+----------------------------------------------------------+
| PT_PHDR | Points to the program header table itself. |
+----------------+----------------------------------------------------------+
Quick Inspection Commands
readelf -h binary # ELF header
readelf -S binary # Section headers
readelf -l binary # Program headers (segments)
readelf -s binary # Symbol table
readelf -d binary # Dynamic section
objdump -d binary # Disassemble .text
hexdump -C binary | head # Raw bytes (look for 7f 45 4c 46)
Appendix D: Signal Reference
Signal Table (x86-64 Linux)
+--------+---------+---------+----------------------------------------------+
| Number | Name | Default | Common Cause |
+--------+---------+---------+----------------------------------------------+
| 1 | SIGHUP | Term | Terminal closed, or controlling process died |
| 2 | SIGINT | Term | Ctrl+C from terminal |
| 3 | SIGQUIT | Core | Ctrl+\ from terminal (quit + core dump) |
| 4 | SIGILL | Core | Illegal instruction (corrupt code, bad jump) |
| 5 | SIGTRAP | Core | Breakpoint hit (used by debuggers, int3) |
| 6 | SIGABRT | Core | abort() called (failed assert, double free) |
| 7 | SIGBUS | Core | Bus error: misaligned access, bad mmap |
| 8 | SIGFPE | Core | Arithmetic error: divide by zero, overflow |
| 9 | SIGKILL | Term | Unconditional kill (CANNOT be caught) |
| 10 | SIGUSR1 | Term | User-defined signal 1 |
| 11 | SIGSEGV | Core | Segmentation fault: invalid memory access |
| 12 | SIGUSR2 | Term | User-defined signal 2 |
| 13 | SIGPIPE | Term | Write to pipe/socket with no reader |
| 14 | SIGALRM | Term | Timer from alarm() expired |
| 15 | SIGTERM | Term | Polite termination request (what kill sends) |
| 17 | SIGCHLD | Ignore | Child process stopped or terminated |
| 18 | SIGCONT | Cont | Resume stopped process (sent by fg, kill -18)|
| 19 | SIGSTOP | Stop | Unconditional stop (CANNOT be caught) |
| 20 | SIGTSTP | Stop | Ctrl+Z from terminal |
+--------+---------+---------+----------------------------------------------+
Default Actions
Term = Terminate the process
Core = Terminate + generate core dump (if enabled: ulimit -c unlimited)
Stop = Suspend the process (can resume with SIGCONT)
Cont = Resume a stopped process
Ignore = Do nothing by default
Uncatchable Signals
Only two signals cannot be caught, blocked, or ignored:
- SIGKILL (9): Always terminates. The process gets no chance to clean up.
- SIGSTOP (19): Always suspends. The process cannot prevent it.
Every other signal can be caught with signal() or sigaction().
Sending Signals
kill -SIGTERM 1234 # send SIGTERM to PID 1234
kill -9 1234 # send SIGKILL (cannot be caught)
kill -STOP 1234 # suspend process
kill -CONT 1234 # resume process
Ctrl+C # sends SIGINT to foreground process
Ctrl+Z # sends SIGTSTP to foreground process
Ctrl+\ # sends SIGQUIT (core dump)
The Signals You Will See Most
If your program crashes, check the signal:
- SIGSEGV (11): You accessed memory you should not have. See Chapter 18.
- SIGABRT (6): Something called
abort()— often a failed assertion or detected heap corruption. - SIGFPE (8): Division by zero or integer overflow trap.
- SIGPIPE (13): You wrote to a closed pipe. Common in shell pipelines and network code.
- SIGBUS (7): Misaligned memory access, or mmap beyond file size.
Appendix E: Glossary
Address Space -- The range of virtual addresses a process can use. On x86-64 Linux, user space is typically 0 to 0x7FFFFFFFFFFF (128 TB). Each process has its own address space, isolated by the MMU.
ASLR (Address Space Layout Randomization) -- The kernel loads the stack, heap, shared libraries, and PIE executables at randomized addresses each run. Makes exploitation harder because attackers cannot predict where code or data will be.
brk / sbrk -- System calls that move the "program break" — the boundary of the heap. malloc uses brk for small allocations and mmap for large ones.
BSS (Block Started by Symbol) -- The section of an ELF file for uninitialized global and static variables. Takes no space in the file — the kernel zero-fills it when loading.
Cache Line -- The unit of transfer between cache and main memory. Typically 64 bytes on x86-64. When you access one byte, the CPU loads the entire 64-byte line into cache.
Calling Convention -- The rules for how functions receive arguments and return values. On x86-64 Linux (System V ABI): first 6 integer args in rdi, rsi, rdx, rcx, r8, r9; return in rax.
Context Switch -- When the kernel suspends one process/thread and resumes another. Saves and restores registers, updates page table base (CR3), flushes parts of the TLB.
Copy-on-Write (CoW) -- After fork(), parent and child share the same physical pages marked read-only. On first write, the kernel copies the page so each process gets its own. Saves memory and makes fork fast.
Core Dump -- A file containing the memory image of a crashed process. Generated by signals like SIGSEGV and SIGABRT. Open with gdb program core to inspect the crash state.
Demand Paging -- Pages are not loaded from disk until they are actually accessed. The first access triggers a page fault; the kernel then loads the page from the ELF file or swap.
ELF (Executable and Linkable Format) -- The standard binary format on Linux. Contains headers, sections (compile-time view), and segments (runtime view). See Appendix C.
Frame (Page Frame) -- A physical memory page. The MMU maps virtual pages to physical frames. On x86-64, a standard frame is 4096 bytes.
Frame (Stack Frame) -- The region of the stack belonging to one function call. Contains local variables, saved registers, and the return address. Bounded by rbp (base) and rsp (top).
GOT (Global Offset Table) -- A table of addresses filled in at runtime by the dynamic linker. Used for position-independent access to global variables and functions in shared libraries.
Heap -- The region of memory used for dynamic allocation (malloc/free in C, Box/Vec in Rust). Grows upward from the program break toward higher addresses.
MMU (Memory Management Unit) -- Hardware in the CPU that translates virtual addresses to physical addresses using page tables. Enforces permissions (read/write/execute) and traps invalid access.
mmap -- System call that maps files or anonymous memory into the process address space. Used for shared libraries, large allocations, file I/O, and shared memory between processes.
Page -- The unit of virtual memory. 4096 bytes (4 KB) on x86-64 by default. The MMU translates addresses at page granularity. Large pages (2 MB, 1 GB) are also available.
Page Fault -- A CPU exception triggered when accessing a virtual address that is not currently mapped to a physical frame. The kernel handles it by loading the page, allocating a frame, or killing the process (segfault).
Page Table -- A hierarchical data structure (4 levels on x86-64) that maps virtual pages to physical frames. One per process. See Appendix B for bit-level details.
PTE (Page Table Entry) -- A single entry in a page table. Contains the physical frame number and permission bits (present, read/write, user/kernel, no-execute). See Appendix B.
PIE (Position-Independent Executable) -- An executable compiled so it can be loaded at any address. All modern Linux executables are PIE by default, enabling full ASLR.
PLT (Procedure Linkage Table) -- A set of small stubs that redirect calls to shared library functions. On the first call, the PLT invokes the dynamic linker to resolve the address (lazy binding). Subsequent calls go directly.
Process -- An instance of a running program. Has its own address space, page tables, file descriptors, and one or more threads. Created by fork() or execve().
Register -- A small, fast storage location inside the CPU. x86-64 has 16 general-purpose registers (rax through r15), the instruction pointer (rip), flags (rflags), and SIMD registers (xmm0-15). See Appendix A.
Relocation -- The process of adjusting addresses in object files when the linker combines them into an executable. Necessary because the compiler does not know final addresses when compiling individual files.
Segfault (Segmentation Fault) -- A SIGSEGV signal sent to a process when it accesses memory it is not allowed to (null pointer, freed memory, read-only pages, unmapped addresses). The most common crash in C programs.
Signal -- An asynchronous notification sent to a process. Can be generated by the kernel (SIGSEGV on bad memory access), by another process (kill), or by the terminal (Ctrl+C = SIGINT). See Appendix D.
Stack -- A LIFO memory region used for function calls. Grows downward (toward lower addresses on x86-64). Each function call pushes a stack frame; each return pops it. Each thread has its own stack.
Symbol -- A named entity in an object file: a function, a global variable, or a label. The linker uses symbols to connect references across object files. Inspect with nm or readelf -s.
Syscall (System Call) -- The interface between user-space programs and the kernel. Triggered by the syscall instruction on x86-64. Examples: read, write, mmap, fork, exit.
TLB (Translation Lookaside Buffer) -- A hardware cache inside the CPU that stores recent virtual-to-physical address translations. Avoids walking the page table on every memory access. A TLB miss requires a page table walk.
Undefined Behavior (UB) -- Code whose behavior the language standard does not define. In C: signed overflow, null dereference, buffer overflow, use-after-free. The compiler may assume UB never happens, leading to surprising optimizations.
Virtual Memory -- The abstraction that gives each process its own address space. Virtual addresses are translated to physical addresses by the MMU. Enables isolation, sharing, demand paging, and memory-mapped I/O.