The Stack: Disciplined and Fast

Type this right now

// save as stacktrace.c — compile: gcc -g -o stacktrace stacktrace.c
#include <stdio.h>

void c() {
    int z = 3;
    printf("c: &z = %p\n", (void *)&z);
}

void b() {
    int y = 2;
    printf("b: &y = %p\n", (void *)&y);
    c();
}

void a() {
    int x = 1;
    printf("a: &x = %p\n", (void *)&x);
    b();
}

int main() {
    a();
    return 0;
}
$ gcc -g -o stacktrace stacktrace.c && ./stacktrace
a: &x = 0x7ffd3a1b1c5c
b: &y = 0x7ffd3a1b1c3c
c: &z = 0x7ffd3a1b1c1c

Each address is lower than the last. The stack grows downward. Every function call you've ever made used this mechanism. Let's see exactly how.


What is a stack frame?

When you call a function, the CPU creates a stack frame — a block of memory on the stack that holds everything that function needs:

High addresses (top of stack at process start)
                 ┌──────────────────────────┐
                 │   main's frame           │
                 │  ┌────────────────────┐  │
                 │  │ (main's locals)    │  │
                 │  │ saved rbp          │  │
                 │  │ return address     │  │ ← where to go when main returns
                 │  └────────────────────┘  │
                 ├──────────────────────────┤
                 │   a's frame              │
                 │  ┌────────────────────┐  │
                 │  │ x = 1              │  │ [rbp - 4]
                 │  │ saved rbp ─────────│──┘ points to main's rbp
                 │  │ return address     │    addr in main after call a()
                 │  └────────────────────┘  │
                 ├──────────────────────────┤
                 │   b's frame              │
                 │  ┌────────────────────┐  │
                 │  │ y = 2              │  │ [rbp - 4]
                 │  │ saved rbp ─────────│──┘ points to a's rbp
                 │  │ return address     │    addr in a after call b()
                 │  └────────────────────┘  │
                 ├──────────────────────────┤
                 │   c's frame              │
                 │  ┌────────────────────┐  │
                 │  │ z = 3              │  │ [rbp - 4]
                 │  │ saved rbp ─────────│──┘ points to b's rbp
                 │  │ return address     │    addr in b after call c()
  rsp ──────►   │  └────────────────────┘  │
                 └──────────────────────────┘
Low addresses (stack grows this direction)

Two registers drive the whole thing:

  • rsp (stack pointer): always points to the top of the stack (the lowest used address)
  • rbp (base pointer): points to the base of the current frame — a stable reference point

How call and ret work

When the CPU executes call some_function:

call some_function
; is equivalent to:
push rip          ; push address of next instruction onto stack
jmp some_function ; jump to the function

When the CPU executes ret:

ret
; is equivalent to:
pop rip           ; pop address from stack into instruction pointer
; execution continues at that address

That's it. call = push return address + jump. ret = pop return address + jump back.


The function prologue and epilogue

Every compiled function begins with a prologue and ends with an epilogue. This is the bookkeeping that creates and destroys stack frames.

Prologue (at the start of every function):

push rbp          ; save caller's base pointer
mov  rbp, rsp     ; set our base pointer to current stack top
sub  rsp, N       ; reserve N bytes for local variables

Epilogue (at the end of every function):

leave             ; equivalent to: mov rsp, rbp; pop rbp
ret               ; pop return address, jump to it

Let's trace through calling a():

Before call a():
  rsp → [... main's stuff ...]

1. call a          → push return address, jump to a
   rsp → [ret_addr | ... main's stuff ...]

2. push rbp        → save main's rbp
   rsp → [main_rbp | ret_addr | ... main's stuff ...]

3. mov rbp, rsp    → rbp now points here
   rbp → [main_rbp | ret_addr | ... main's stuff ...]

4. sub rsp, 16     → reserve space for locals
   rsp → [        | x = ? | padding | main_rbp | ret_addr | ...]
          ^^^^^^^^^^^^^^^^
          a's local variables

🧠 What do you think happens?

If you never execute the epilogue — say, you longjmp out of a function or a signal handler interrupts you — what happens to the stack? The space is still "allocated" (rsp was never restored). Think about what this means for stack usage in signal handlers.


Local variables: just offsets from rbp

The compiler doesn't give your local variables names at runtime. They're just offsets:

void example() {
    int a = 10;    // [rbp - 4]
    int b = 20;    // [rbp - 8]
    long c = 30;   // [rbp - 16]
}

The assembly:

example:
    push rbp
    mov  rbp, rsp
    sub  rsp, 16
    mov  DWORD PTR [rbp-4],  10    ; a
    mov  DWORD PTR [rbp-8],  20    ; b
    mov  QWORD PTR [rbp-16], 30    ; c
    leave
    ret

No names. No types. Just memory locations relative to rbp. The "variable" is a fiction that exists only in your source code and your debugger's symbol table.


C: extra stack tricks

C gives you a few extra ways to use the stack:

alloca — allocate on the stack dynamically:

#include <alloca.h>

void risky() {
    int n = 1000;
    char *buf = alloca(n);  // n bytes on the stack, freed on return
    buf[0] = 'A';
}

This just moves rsp down by n bytes. No malloc, no free, no heap involvement. Fast, but dangerous — if n is too large, you blow through the stack limit with no warning.

Variable-Length Arrays (VLAs):

void process(int n) {
    int arr[n];   // VLA: n elements on the stack
    arr[0] = 42;
}

Same idea as alloca — the compiler adjusts rsp at runtime. Same risks.


Rust: stack discipline

Rust uses the same stack mechanics but gives you fewer footguns:

#![allow(unused)]
fn main() {
fn example() {
    let a: i32 = 10;           // stack — same as C
    let b = [0u8; 100];        // stack — fixed-size array, size known at compile time
    let c = Box::new(42);      // heap — when you need dynamic or large allocation
    // no alloca, no VLAs
}
}

Rust doesn't have alloca or VLAs. If you need a runtime-sized buffer, you use Vec<T>, which allocates on the heap. This is a deliberate choice — the Rust team decided that stack overflows from unchecked dynamic stack allocation are too dangerous to allow in safe code.

If you really need stack allocation with a runtime size, you'd use a crate or unsafe code. But the standard path is: fixed sizes on the stack, dynamic sizes on the heap.


Stack size and limits

Check your current stack size limit:

$ ulimit -s
8192

That's 8192 KB = 8 MB. This is a per-thread limit. Every thread gets its own stack.

What this means in practice:

8 MB = 8,388,608 bytes
int takes 4 bytes
Maximum ints on the stack ≈ 2,000,000

But stack frames have overhead (return address, saved rbp, alignment),
and you have many nested function calls, so the practical limit is lower.

You can change it:

$ ulimit -s 16384   # 16 MB stack
$ ulimit -s unlimited  # no limit (dangerous)

💡 Fun Fact

Threads created with pthread_create get their own stack (default 2MB on most systems, not 8MB). You can configure this with pthread_attr_setstacksize. Rust's std::thread::Builder exposes stack_size() for the same purpose.


Guard pages: the safety net

The OS doesn't just give you 8MB of stack and hope for the best. It places a guard page at the bottom of the stack — an unmapped page (usually 4KB) that exists solely to catch stack overflows:

                 ┌───────────────────────┐
                 │   Stack               │  8 MB of rw-p pages
                 │   (used portion)      │
                 │                       │
                 │   (unused portion)    │
                 ├───────────────────────┤
  rsp could     │   Guard Page          │  ---p (no permissions)
  reach here →  │   (unmapped, 4KB)     │
                 ├───────────────────────┤
                 │   Other mappings...   │
                 └───────────────────────┘

When rsp drops into the guard page, the CPU triggers a page fault. The kernel sees that the faulting address is in the guard region and sends SIGSEGV to the process. Game over.


Stack overflow: what really happens

// save as overflow.c — compile: gcc -o overflow overflow.c
#include <stdio.h>

void recurse(int depth) {
    char buffer[4096];  // 4KB per frame, eats the stack fast
    buffer[0] = 'A';   // touch it so compiler doesn't optimize it away
    printf("depth: %d, &buffer = %p\n", depth, (void *)buffer);
    recurse(depth + 1);
}

int main() {
    recurse(0);
    return 0;
}
$ ./overflow
depth: 0, &buffer = 0x7ffd3a1b0c50
depth: 1, &buffer = 0x7ffd3a1afbe0
depth: 2, &buffer = 0x7ffd3a1aeb70
...
depth: 2040, &buffer = 0x7ffd39924f50
depth: 2041, &buffer = 0x7ffd39923ee0
Segmentation fault (core dumped)

Each frame is ~4KB. 8MB / 4KB = ~2048 frames. The numbers check out.

The same thing in Rust:

fn recurse(depth: u32) {
    let buffer = [0u8; 4096];
    // Use black_box to prevent optimization
    std::hint::black_box(&buffer);
    println!("depth: {}, &buffer = {:p}", depth, &buffer);
    recurse(depth + 1);
}

fn main() {
    recurse(0);
}

Rust's behavior is the same — a stack overflow triggers a SIGSEGV. The Rust runtime catches it and prints a friendlier message (thread 'main' has overflowed its stack), but the underlying mechanism is identical: rsp hits the guard page, CPU faults, kernel delivers a signal.


Why the stack is fast

Let's compare stack allocation to heap allocation:

Stack allocation:
    sub rsp, 32        ; one instruction — done

Stack deallocation:
    add rsp, 32        ; one instruction — done
    (or just: leave / ret)

Heap allocation:
    call malloc         ; enters allocator
    → search free lists ; find a suitable block
    → maybe call brk()  ; ask kernel for more memory
    → update metadata   ; bookkeeping
    → return pointer    ; finally done

Heap deallocation:
    call free           ; enters allocator
    → validate pointer  ; is this a real allocation?
    → coalesce blocks   ; merge with adjacent free blocks
    → update free list  ; bookkeeping
    → maybe call munmap ; return pages to kernel

The stack is also always "hot" in the CPU cache — the top of the stack was accessed by the previous instruction, so it's almost certainly in L1 cache. Heap allocations can be scattered across memory, causing cache misses.

💡 Fun Fact

The x86 push instruction does two things in one: it decrements rsp by 8 and writes the value to [rsp]. It's so common that the CPU has special hardware to accelerate it. On modern Intel CPUs, push can execute with the same throughput as a simple mov — the stack engine handles the rsp update for free.


Inspecting frames with GDB

You can see the actual stack frames in a debugger:

$ gcc -g -o stacktrace stacktrace.c
$ gdb ./stacktrace
(gdb) break c
(gdb) run
Breakpoint 1, c () at stacktrace.c:5
(gdb) backtrace
#0  c () at stacktrace.c:5
#1  b () at stacktrace.c:12
#2  a () at stacktrace.c:18
#3  main () at stacktrace.c:22

(gdb) info frame
Stack level 0, frame at 0x7fffffffde30:
 rip = 0x555555555149 in c (stacktrace.c:5);
 saved rip = 0x555555555185
 called by frame at 0x7fffffffde50
 ...

(gdb) info registers rsp rbp
rsp  0x7fffffffde10
rbp  0x7fffffffde20

The backtrace command walks the chain of saved rbp values — each rbp points to the previous frame's rbp, forming a linked list up the stack. That's how your debugger reconstructs the call chain.


🔧 Task

  1. Write a recursive function in C (and Rust) that prints its depth and the address of a local variable at each level. Run it and observe:

    • Addresses decrease (stack grows down)
    • Eventually it crashes (stack overflow)
    • The number of frames matches ulimit -s / frame_size
  2. Before running, calculate: with ulimit -s 8192 (8MB) and a 4096-byte local array per frame, how many levels of recursion should you get? Run it and compare.

  3. Try ulimit -s 16384 and run again. Do you get roughly twice as many frames?

  4. Check dmesg | tail after the crash — you should see something like:

    [12345.678] overflow[9999]: segfault at 7ffd39920ff0 ip ... sp ... error 6
    

    The kernel logged the fault address. Confirm it's near the bottom of the stack region shown in /proc/<PID>/maps.