The Stack: Disciplined and Fast
Type this right now
// save as stacktrace.c — compile: gcc -g -o stacktrace stacktrace.c
#include <stdio.h>
void c() {
int z = 3;
printf("c: &z = %p\n", (void *)&z);
}
void b() {
int y = 2;
printf("b: &y = %p\n", (void *)&y);
c();
}
void a() {
int x = 1;
printf("a: &x = %p\n", (void *)&x);
b();
}
int main() {
a();
return 0;
}
$ gcc -g -o stacktrace stacktrace.c && ./stacktrace
a: &x = 0x7ffd3a1b1c5c
b: &y = 0x7ffd3a1b1c3c
c: &z = 0x7ffd3a1b1c1c
Each address is lower than the last. The stack grows downward. Every function call you've ever made used this mechanism. Let's see exactly how.
What is a stack frame?
When you call a function, the CPU creates a stack frame — a block of memory on the stack that holds everything that function needs:
High addresses (top of stack at process start)
┌──────────────────────────┐
│ main's frame │
│ ┌────────────────────┐ │
│ │ (main's locals) │ │
│ │ saved rbp │ │
│ │ return address │ │ ← where to go when main returns
│ └────────────────────┘ │
├──────────────────────────┤
│ a's frame │
│ ┌────────────────────┐ │
│ │ x = 1 │ │ [rbp - 4]
│ │ saved rbp ─────────│──┘ points to main's rbp
│ │ return address │ addr in main after call a()
│ └────────────────────┘ │
├──────────────────────────┤
│ b's frame │
│ ┌────────────────────┐ │
│ │ y = 2 │ │ [rbp - 4]
│ │ saved rbp ─────────│──┘ points to a's rbp
│ │ return address │ addr in a after call b()
│ └────────────────────┘ │
├──────────────────────────┤
│ c's frame │
│ ┌────────────────────┐ │
│ │ z = 3 │ │ [rbp - 4]
│ │ saved rbp ─────────│──┘ points to b's rbp
│ │ return address │ addr in b after call c()
rsp ──────► │ └────────────────────┘ │
└──────────────────────────┘
Low addresses (stack grows this direction)
Two registers drive the whole thing:
rsp(stack pointer): always points to the top of the stack (the lowest used address)rbp(base pointer): points to the base of the current frame — a stable reference point
How call and ret work
When the CPU executes call some_function:
call some_function
; is equivalent to:
push rip ; push address of next instruction onto stack
jmp some_function ; jump to the function
When the CPU executes ret:
ret
; is equivalent to:
pop rip ; pop address from stack into instruction pointer
; execution continues at that address
That's it. call = push return address + jump. ret = pop return address + jump back.
The function prologue and epilogue
Every compiled function begins with a prologue and ends with an epilogue. This is the bookkeeping that creates and destroys stack frames.
Prologue (at the start of every function):
push rbp ; save caller's base pointer
mov rbp, rsp ; set our base pointer to current stack top
sub rsp, N ; reserve N bytes for local variables
Epilogue (at the end of every function):
leave ; equivalent to: mov rsp, rbp; pop rbp
ret ; pop return address, jump to it
Let's trace through calling a():
Before call a():
rsp → [... main's stuff ...]
1. call a → push return address, jump to a
rsp → [ret_addr | ... main's stuff ...]
2. push rbp → save main's rbp
rsp → [main_rbp | ret_addr | ... main's stuff ...]
3. mov rbp, rsp → rbp now points here
rbp → [main_rbp | ret_addr | ... main's stuff ...]
4. sub rsp, 16 → reserve space for locals
rsp → [ | x = ? | padding | main_rbp | ret_addr | ...]
^^^^^^^^^^^^^^^^
a's local variables
🧠 What do you think happens?
If you never execute the epilogue — say, you
longjmpout of a function or a signal handler interrupts you — what happens to the stack? The space is still "allocated" (rsp was never restored). Think about what this means for stack usage in signal handlers.
Local variables: just offsets from rbp
The compiler doesn't give your local variables names at runtime. They're just offsets:
void example() {
int a = 10; // [rbp - 4]
int b = 20; // [rbp - 8]
long c = 30; // [rbp - 16]
}
The assembly:
example:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 10 ; a
mov DWORD PTR [rbp-8], 20 ; b
mov QWORD PTR [rbp-16], 30 ; c
leave
ret
No names. No types. Just memory locations relative to rbp. The "variable" is a fiction that
exists only in your source code and your debugger's symbol table.
C: extra stack tricks
C gives you a few extra ways to use the stack:
alloca — allocate on the stack dynamically:
#include <alloca.h>
void risky() {
int n = 1000;
char *buf = alloca(n); // n bytes on the stack, freed on return
buf[0] = 'A';
}
This just moves rsp down by n bytes. No malloc, no free, no heap involvement. Fast, but
dangerous — if n is too large, you blow through the stack limit with no warning.
Variable-Length Arrays (VLAs):
void process(int n) {
int arr[n]; // VLA: n elements on the stack
arr[0] = 42;
}
Same idea as alloca — the compiler adjusts rsp at runtime. Same risks.
Rust: stack discipline
Rust uses the same stack mechanics but gives you fewer footguns:
#![allow(unused)] fn main() { fn example() { let a: i32 = 10; // stack — same as C let b = [0u8; 100]; // stack — fixed-size array, size known at compile time let c = Box::new(42); // heap — when you need dynamic or large allocation // no alloca, no VLAs } }
Rust doesn't have alloca or VLAs. If you need a runtime-sized buffer, you use Vec<T>, which
allocates on the heap. This is a deliberate choice — the Rust team decided that stack overflows
from unchecked dynamic stack allocation are too dangerous to allow in safe code.
If you really need stack allocation with a runtime size, you'd use a crate or unsafe code.
But the standard path is: fixed sizes on the stack, dynamic sizes on the heap.
Stack size and limits
Check your current stack size limit:
$ ulimit -s
8192
That's 8192 KB = 8 MB. This is a per-thread limit. Every thread gets its own stack.
What this means in practice:
8 MB = 8,388,608 bytes
int takes 4 bytes
Maximum ints on the stack ≈ 2,000,000
But stack frames have overhead (return address, saved rbp, alignment),
and you have many nested function calls, so the practical limit is lower.
You can change it:
$ ulimit -s 16384 # 16 MB stack
$ ulimit -s unlimited # no limit (dangerous)
💡 Fun Fact
Threads created with
pthread_createget their own stack (default 2MB on most systems, not 8MB). You can configure this withpthread_attr_setstacksize. Rust'sstd::thread::Builderexposesstack_size()for the same purpose.
Guard pages: the safety net
The OS doesn't just give you 8MB of stack and hope for the best. It places a guard page at the bottom of the stack — an unmapped page (usually 4KB) that exists solely to catch stack overflows:
┌───────────────────────┐
│ Stack │ 8 MB of rw-p pages
│ (used portion) │
│ │
│ (unused portion) │
├───────────────────────┤
rsp could │ Guard Page │ ---p (no permissions)
reach here → │ (unmapped, 4KB) │
├───────────────────────┤
│ Other mappings... │
└───────────────────────┘
When rsp drops into the guard page, the CPU triggers a page fault. The kernel sees that the
faulting address is in the guard region and sends SIGSEGV to the process. Game over.
Stack overflow: what really happens
// save as overflow.c — compile: gcc -o overflow overflow.c
#include <stdio.h>
void recurse(int depth) {
char buffer[4096]; // 4KB per frame, eats the stack fast
buffer[0] = 'A'; // touch it so compiler doesn't optimize it away
printf("depth: %d, &buffer = %p\n", depth, (void *)buffer);
recurse(depth + 1);
}
int main() {
recurse(0);
return 0;
}
$ ./overflow
depth: 0, &buffer = 0x7ffd3a1b0c50
depth: 1, &buffer = 0x7ffd3a1afbe0
depth: 2, &buffer = 0x7ffd3a1aeb70
...
depth: 2040, &buffer = 0x7ffd39924f50
depth: 2041, &buffer = 0x7ffd39923ee0
Segmentation fault (core dumped)
Each frame is ~4KB. 8MB / 4KB = ~2048 frames. The numbers check out.
The same thing in Rust:
fn recurse(depth: u32) { let buffer = [0u8; 4096]; // Use black_box to prevent optimization std::hint::black_box(&buffer); println!("depth: {}, &buffer = {:p}", depth, &buffer); recurse(depth + 1); } fn main() { recurse(0); }
Rust's behavior is the same — a stack overflow triggers a SIGSEGV. The Rust runtime catches it
and prints a friendlier message (thread 'main' has overflowed its stack), but the underlying
mechanism is identical: rsp hits the guard page, CPU faults, kernel delivers a signal.
Why the stack is fast
Let's compare stack allocation to heap allocation:
Stack allocation:
sub rsp, 32 ; one instruction — done
Stack deallocation:
add rsp, 32 ; one instruction — done
(or just: leave / ret)
Heap allocation:
call malloc ; enters allocator
→ search free lists ; find a suitable block
→ maybe call brk() ; ask kernel for more memory
→ update metadata ; bookkeeping
→ return pointer ; finally done
Heap deallocation:
call free ; enters allocator
→ validate pointer ; is this a real allocation?
→ coalesce blocks ; merge with adjacent free blocks
→ update free list ; bookkeeping
→ maybe call munmap ; return pages to kernel
The stack is also always "hot" in the CPU cache — the top of the stack was accessed by the previous instruction, so it's almost certainly in L1 cache. Heap allocations can be scattered across memory, causing cache misses.
💡 Fun Fact
The x86
pushinstruction does two things in one: it decrementsrspby 8 and writes the value to[rsp]. It's so common that the CPU has special hardware to accelerate it. On modern Intel CPUs,pushcan execute with the same throughput as a simplemov— the stack engine handles therspupdate for free.
Inspecting frames with GDB
You can see the actual stack frames in a debugger:
$ gcc -g -o stacktrace stacktrace.c
$ gdb ./stacktrace
(gdb) break c
(gdb) run
Breakpoint 1, c () at stacktrace.c:5
(gdb) backtrace
#0 c () at stacktrace.c:5
#1 b () at stacktrace.c:12
#2 a () at stacktrace.c:18
#3 main () at stacktrace.c:22
(gdb) info frame
Stack level 0, frame at 0x7fffffffde30:
rip = 0x555555555149 in c (stacktrace.c:5);
saved rip = 0x555555555185
called by frame at 0x7fffffffde50
...
(gdb) info registers rsp rbp
rsp 0x7fffffffde10
rbp 0x7fffffffde20
The backtrace command walks the chain of saved rbp values — each rbp points to the previous
frame's rbp, forming a linked list up the stack. That's how your debugger reconstructs the call
chain.
🔧 Task
Write a recursive function in C (and Rust) that prints its depth and the address of a local variable at each level. Run it and observe:
- Addresses decrease (stack grows down)
- Eventually it crashes (stack overflow)
- The number of frames matches
ulimit -s/ frame_sizeBefore running, calculate: with
ulimit -s 8192(8MB) and a 4096-byte local array per frame, how many levels of recursion should you get? Run it and compare.Try
ulimit -s 16384and run again. Do you get roughly twice as many frames?Check
dmesg | tailafter the crash — you should see something like:[12345.678] overflow[9999]: segfault at 7ffd39920ff0 ip ... sp ... error 6The kernel logged the fault address. Confirm it's near the bottom of the stack region shown in
/proc/<PID>/maps.