Page Faults: When Things Get Interesting

Type this right now

// save as lazy.c — compile: gcc -o lazy lazy.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    // Ask for 100 MB. Does the OS actually give us 100 MB of RAM?
    size_t size = 100 * 1024 * 1024;
    char *p = malloc(size);

    printf("malloc returned: %p\n", (void *)p);
    printf("Now check: ps -o pid,vsz,rss -p %d\n", getpid());
    printf("Press Enter BEFORE touching memory...\n");
    getchar();

    // Touch every page (write one byte per 4 KB page)
    for (size_t i = 0; i < size; i += 4096) {
        p[i] = 'A';
    }

    printf("Memory touched. Press Enter to check again...\n");
    getchar();

    free(p);
    return 0;
}

Run it in one terminal. In another terminal, check RSS before and after:

$ ps -o pid,vsz,rss -p $(pgrep lazy)
  PID    VSZ   RSS
 1234 203456  2340      ← Before: VSZ is large, RSS is tiny!

# (press Enter in the first terminal)

$ ps -o pid,vsz,rss -p $(pgrep lazy)
  PID    VSZ   RSS
 1234 203456 104820     ← After: RSS jumped ~100 MB

VSZ (virtual size) was always large — the address space was mapped. RSS (resident set size) was tiny until you touched the pages. Physical RAM was allocated one page at a time, on demand, via page faults.


A page fault is NOT an error

This is the most misunderstood concept in systems programming. A page fault is a CPU exception that says: "I tried to translate this virtual address, and the page table says I can't proceed. Kernel, please help."

The kernel then decides what to do:

    CPU executes: mov [0x7F4000], eax
         │
         ▼
    MMU walks page table → PTE has Present=0 (or wrong permissions)
         │
         ▼
    CPU raises Page Fault Exception (#PF)
         │
         ▼
    Kernel's page fault handler runs
         │
         ├──► Minor fault?  → Map a physical frame, resume
         │
         ├──► Major fault?  → Load from disk, map frame, resume
         │
         └──► Invalid?      → Send SIGSEGV → process dies

Three types of page faults

1. Minor fault — demand paging

You called malloc(100 MB). The kernel said "sure" and set up virtual address mappings but left every PTE with Present=0. No physical RAM was allocated.

When you first touch a page:

    Your code:  p[0] = 'A';
                   │
                   ▼
    Virtual address 0x7F4000 → MMU walks table → Present=0
                   │
                   ▼
    Page Fault (minor) → kernel allocates a physical frame
                   │     from the free page pool, zeros it,
                   │     updates the PTE: Present=1, Frame=0x1A200
                   │
                   ▼
    CPU retries the instruction → MMU walks table → Present=1 → success!
    p[0] is now 'A' in frame 0x1A200

This happens once per page, then never again (for that page). The process never notices — the retry is automatic. This is why malloc(1 GB) succeeds on a 4 GB system — physical RAM is only committed when pages are actually accessed.

🧠 What do you think happens?

You call malloc(1 TB) on a machine with 16 GB of RAM. malloc returns a valid pointer. You then try to touch every page. At some point, the kernel runs out of physical frames. What happens? (Hint: look up the "OOM killer.")

2. Minor fault — copy-on-write

After fork(), parent and child share pages marked read-only. When either writes:

    Child writes to shared page at 0x5000
         │
         ▼
    MMU: page is present but marked read-only → Page Fault
         │
         ▼
    Kernel: "Ah, this is a copy-on-write page."
         │
         ├── Allocate new physical frame
         ├── Copy contents from original frame
         ├── Update child's PTE: point to new frame, mark writable
         └── Resume child's instruction
         │
         ▼
    Child's write succeeds. Parent's page is unaffected.

Still a minor fault — no disk I/O. Just a memory copy and a PTE update.

3. Major fault — loading from disk

The page was once in RAM but got swapped out to disk (because the system was low on memory). The PTE has Present=0 but contains a swap entry telling the kernel where on disk the data lives.

    Access to page at 0x8000 → Present=0, swap entry = disk sector 42501
         │
         ▼
    Kernel: "This page was swapped out."
         │
         ├── Allocate a free physical frame
         ├── Read 4 KB from swap disk into the frame     ◄── SLOW! ~5-10 ms
         ├── Update PTE: Present=1, Frame=new_frame
         └── Resume instruction
         │
         ▼
    Access succeeds. But it cost milliseconds, not nanoseconds.
    Speed comparison:

    TLB hit:           ~1 ns
    Minor page fault:  ~1-10 μs    (1,000× slower)
    Major page fault:  ~5-10 ms    (5,000,000× slower than TLB hit)
                                    (1,000× slower than minor fault)

Major faults are the reason your system feels sluggish when it starts swapping. A program that would normally take 1 second can take hours if most of its accesses trigger major faults.

4. Invalid fault — you messed up

The virtual address has no mapping at all. No PTE. No swap entry. Nothing.

    Access to 0xDEADBEEF → no mapping exists
         │
         ▼
    Kernel: "This address is not valid for this process."
         │
         ▼
    Kernel sends SIGSEGV to the process
         │
         ▼
    Default handler: print "Segmentation fault", dump core, exit

This is the one that kills your program. We'll cover it in detail in Chapter 18.


Watching page faults happen

Linux tracks page faults per process. You can see them:

$ /usr/bin/time -v ./lazy 2>&1 | grep -i fault
    Minor (reclaiming a frame): 25,612
    Major (requiring I/O): 0

25,612 minor faults for 100 MB makes sense: 100 MB / 4 KB = 25,600 pages (plus a few for the program itself, stack, libraries).

You can also watch in real time with perf:

$ perf stat -e page-faults,minor-faults,major-faults ./lazy
     25,614      page-faults
     25,614      minor-faults
          0      major-faults

mmap: the Swiss army knife

mmap() is the system call that creates virtual address mappings. Everything runs through it:

    malloc (large allocs)  → calls mmap(MAP_ANONYMOUS | MAP_PRIVATE)
    Loading shared libs    → kernel calls mmap(MAP_PRIVATE, fd)
    Reading files          → you call mmap(MAP_PRIVATE, fd)
    Shared memory          → mmap(MAP_SHARED | MAP_ANONYMOUS)
    Copy-on-write fork     → kernel manipulates existing mappings

Here's mmap used to read a file:

// save as mmapread.c — compile: gcc -o mmapread mmapread.c
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main() {
    int fd = open("/etc/passwd", O_RDONLY);
    struct stat st;
    fstat(fd, &st);

    // Map the entire file into our address space
    char *data = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    close(fd);  // We can close fd — the mapping keeps the file accessible

    // Access the file like a normal array
    printf("First 80 bytes:\n%.80s\n", data);

    munmap(data, st.st_size);
    return 0;
}

And in Rust:

use std::fs::File;
use std::os::unix::io::AsRawFd;

fn main() {
    let file = File::open("/etc/passwd").unwrap();
    let len = file.metadata().unwrap().len() as usize;

    // Using the memmap2 crate (add to Cargo.toml: memmap2 = "0.9")
    // let mmap = unsafe { memmap2::Mmap::map(&file).unwrap() };
    // println!("{}", std::str::from_utf8(&mmap[..80]).unwrap());

    // Or with raw mmap:
    let ptr = unsafe {
        libc::mmap(
            std::ptr::null_mut(),
            len,
            libc::PROT_READ,
            libc::MAP_PRIVATE,
            file.as_raw_fd(),
            0,
        )
    };
    let data = unsafe { std::slice::from_raw_parts(ptr as *const u8, len) };
    println!("First 80 bytes:\n{}", std::str::from_utf8(&data[..80]).unwrap());
    unsafe { libc::munmap(ptr, len); }
}

💡 Fun Fact: When the kernel loads your ELF binary at exec() time, it doesn't read the whole file into RAM. It mmaps the segments. Your .text section is demand-paged — functions that are never called are never loaded from disk.


The page fault flow in full

Here's the complete picture of what happens when the CPU can't translate an address:

    CPU executes instruction that accesses virtual address VA
         │
         ▼
    MMU checks TLB ─── Hit? ──► Translate, access physical memory. Done.
         │
         No (TLB miss)
         │
         ▼
    MMU walks 4-level page table
         │
         ├── Present=1 and permissions OK? ──► Load into TLB, access memory. Done.
         │
         └── Present=0 or permission violation?
              │
              ▼
         CPU pushes fault address to CR2 register
         CPU raises #PF exception (interrupt 14)
         CPU switches to kernel mode
              │
              ▼
         Kernel page fault handler (arch/x86/mm/fault.c)
              │
              ├── Is VA in a valid VMA? (vm_area_struct)
              │    │
              │    No ──► Send SIGSEGV (invalid access)
              │    │
              │    Yes
              │    ▼
              ├── Was it a write to a read-only COW page?
              │    │
              │    Yes ──► Allocate frame, copy page, update PTE, resume
              │    │
              │    No
              │    ▼
              ├── Is there a swap entry?
              │    │
              │    Yes ──► Read from swap (major fault), map frame, resume
              │    │
              │    No
              │    ▼
              ├── Is it a demand-zero page (anonymous)?
              │    │
              │    Yes ──► Allocate zeroed frame, map it, resume (minor fault)
              │    │
              │    No
              │    ▼
              └── Is it a file-backed mapping?
                   │
                   Yes ──► Read from file (major fault), map frame, resume
                   │
                   No ──► Send SIGSEGV

🔧 Task: Watch demand paging in /proc/self/smaps

// save as demand.c — compile: gcc -o demand demand.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

void show_rss() {
    char path[64];
    snprintf(path, sizeof(path), "/proc/%d/status", getpid());
    FILE *f = fopen(path, "r");
    char line[256];
    while (fgets(line, sizeof(line), f)) {
        if (strncmp(line, "VmRSS", 5) == 0 || strncmp(line, "VmSize", 6) == 0) {
            printf("  %s", line);
        }
    }
    fclose(f);
}

int main() {
    printf("Before malloc:\n");
    show_rss();

    char *p = malloc(100 * 1024 * 1024);  // 100 MB
    printf("\nAfter malloc (before touching):\n");
    show_rss();

    // Touch every page
    for (size_t i = 0; i < 100 * 1024 * 1024; i += 4096) {
        p[i] = 1;
    }
    printf("\nAfter touching every page:\n");
    show_rss();

    free(p);
    printf("\nAfter free:\n");
    show_rss();

    return 0;
}
$ ./demand
Before malloc:
  VmSize:     2580 kB
  VmRSS:      1024 kB

After malloc (before touching):
  VmSize:   105060 kB     ← Virtual size jumped 100 MB
  VmRSS:      1040 kB     ← Physical memory: basically unchanged!

After touching every page:
  VmSize:   105060 kB     ← Virtual size: same
  VmRSS:   103420 kB     ← RSS jumped ~100 MB. NOW the RAM is used.

After free:
  VmSize:     2580 kB     ← Virtual mapping released
  VmRSS:      1040 kB     ← Physical RAM returned to the OS

Key insight: VmSize reflects the virtual address space. VmRSS reflects physical RAM. malloc only affects VmSize. Actually touching the memory triggers page faults, which allocate physical frames and increase VmRSS.

Now run it again under perf stat:

$ perf stat -e minor-faults,major-faults ./demand
     25,630      minor-faults     ← ~25,600 pages = 100 MB / 4 KB
          0      major-faults

Every single one of those 25,600 minor faults was the kernel giving you one physical frame. No disk I/O. Just pure demand paging.