Pages and Page Tables

Type this right now

// save as pagemath.c — compile: gcc -o pagemath pagemath.c
#include <stdio.h>
#include <stdint.h>

int main() {
    uint64_t addr = 0x00007FFE12345678ULL;

    uint64_t offset  = addr & 0xFFF;           // Bits [11:0]
    uint64_t pt_idx  = (addr >> 12) & 0x1FF;   // Bits [20:12]
    uint64_t pd_idx  = (addr >> 21) & 0x1FF;   // Bits [29:21]
    uint64_t pdpt_idx = (addr >> 30) & 0x1FF;  // Bits [38:30]
    uint64_t pml4_idx = (addr >> 39) & 0x1FF;  // Bits [47:39]

    printf("Virtual address:  0x%016lx\n", addr);
    printf("PML4  index:      %3lu  (0x%03lx)\n", pml4_idx, pml4_idx);
    printf("PDPT  index:      %3lu  (0x%03lx)\n", pdpt_idx, pdpt_idx);
    printf("PD    index:      %3lu  (0x%03lx)\n", pd_idx, pd_idx);
    printf("PT    index:      %3lu  (0x%03lx)\n", pt_idx, pt_idx);
    printf("Page  offset:     %3lu  (0x%03lx)\n", offset, offset);

    return 0;
}
$ gcc -o pagemath pagemath.c && ./pagemath
Virtual address:  0x00007ffe12345678
PML4  index:      255  (0x0ff)
PDPT  index:      504  (0x1f8)
PD    index:      145  (0x091)
PT    index:      837  (0x345)
Page  offset:     1656  (0x678)

You just decomposed a virtual address into the exact indices the CPU uses to walk the page table. Every memory access your program makes involves this decomposition — in hardware, in nanoseconds.


What is a page?

Memory is managed in fixed-size chunks called pages. On x86-64, the standard page size is 4 KB (4096 bytes, or 0x1000 in hex).

    One page = 4096 bytes = 4 KB

    ┌─────────────────────────────────┐  Byte 0
    │                                 │
    │         4096 bytes              │
    │    of contiguous memory         │
    │                                 │
    └─────────────────────────────────┘  Byte 4095

Why 4 KB? It's a hardware compromise:

  • Too small (e.g., 64 bytes): you'd need billions of page table entries. The tables themselves would consume all your RAM.
  • Too big (e.g., 16 MB): you'd waste memory. Allocating 1 byte would reserve 16 MB. ("Internal fragmentation.")
  • 4 KB: a reasonable middle ground, chosen in the 1970s and hardened into silicon ever since.

Virtual memory and physical memory are both divided into pages. Virtual pages are just called "pages." Physical pages are called frames. The page table maps pages to frames.


How a virtual address is split

A 48-bit virtual address isn't treated as one big number. It's split into fields, each serving as an index into a different level of the page table.

    63        48 47    39 38    30 29    21 20    12 11       0
    ┌──────────┬────────┬────────┬────────┬────────┬──────────┐
    │  Sign    │ PML4   │ PDPT   │  PD    │  PT    │  Offset  │
    │ extension│ Index  │ Index  │ Index  │ Index  │ (12 bits)│
    │ (16 bits)│(9 bits)│(9 bits)│(9 bits)│(9 bits)│          │
    └──────────┴────────┴────────┴────────┴────────┴──────────┘
         │         │        │        │        │          │
         │         │        │        │        │          └─► Byte within
         │         │        │        │        │              the 4KB page
         │         │        │        │        │
         │         │        │        │        └─► Which entry in the
         │         │        │        │            Page Table (PT)
         │         │        │        │
         │         │        │        └─► Which entry in the
         │         │        │            Page Directory (PD)
         │         │        │
         │         │        └─► Which entry in the
         │         │            Page Directory Pointer Table (PDPT)
         │         │
         │         └─► Which entry in the
         │             Page Map Level 4 (PML4)
         │
         └─► Must be copies of bit 47 (canonical address requirement)

9 bits = 512 possible values. So each table level has 512 entries. 12 bits of offset = 2^12 = 4096 = one page.

💡 Fun Fact: Intel originally designed x86-64 with 48 bits of virtual address (256 TB). Recent processors support 5-level paging with 57 bits (128 PB). The structure is identical — they just add a PML5 level with another 9-bit index.


The page table entry (PTE)

Each entry in a page table is 8 bytes (64 bits). It contains the physical frame address plus control flags:

    63  62       52 51                       12 11  9 8 7 6 5 4 3 2 1 0
    ┌──┬──────────┬───────────────────────────┬─────┬─┬─┬─┬─┬─┬───┬─┬─┐
    │NX│ Available │  Physical Frame Address   │ Avl │G│ │D│A│ │U/S│R│P│
    │  │ (for OS) │       (40 bits)            │     │ │ │ │ │ │   │W│ │
    └──┴──────────┴───────────────────────────┴─────┴─┴─┴─┴─┴─┴───┴─┴─┘
     │                      │                              │   │  │  │
     │                      │                              │   │  │  └─ Present: page in RAM?
     │                      │                              │   │  └─── Read/Write: writable?
     │                      │                              │   └───── User/Supervisor: user-
     │                      │                              │          accessible?
     │                      │                              └───── Accessed: been read?
     │                      │                                     Dirty: been written?
     │                      │
     │                      └─ Physical frame number (left-shifted by 12
     │                         to get the physical address of the frame)
     │
     └─ NX (No Execute): if set, CPU will refuse to execute
        code from this page — defense against code injection

The Present bit is crucial. If it's 0, the page is not in RAM. Accessing it triggers a page fault. The kernel then decides what to do — load from swap, demand-allocate, or kill the process with SIGSEGV.


Why a single-level page table is impossible

Let's do the math for one flat table.

    48-bit virtual address space
    ÷ 4 KB page size (12-bit offset)
    = 2^36 virtual pages
    × 8 bytes per PTE
    = 2^39 bytes = 512 GB per process

    That's the size of the table alone. For ONE process.
    With 100 processes, you'd need 50 TB of RAM just for tables.

Clearly, this doesn't work. The solution: multi-level tables — only allocate the table pages that actually contain mappings.


The four-level page table walk

Here's the complete walk. The CPU's CR3 register holds the physical address of the PML4 table for the currently running process.

    CR3 register
    ┌──────────────────┐
    │ PML4 phys addr   │
    └────────┬─────────┘
             │
             ▼
    ┌──────────────────────────────────────────────┐
    │ PML4 Table (512 entries × 8 bytes = 4 KB)    │
    │                                              │
    │  [0]  [1]  [2] ... [pml4_idx] ... [511]     │
    │                        │                     │
    └────────────────────────┼─────────────────────┘
                             │  bits [47:39] of VA
                             ▼
    ┌──────────────────────────────────────────────┐
    │ PDPT Table (512 entries × 8 bytes = 4 KB)    │
    │                                              │
    │  [0]  [1]  [2] ... [pdpt_idx] ... [511]     │
    │                        │                     │
    └────────────────────────┼─────────────────────┘
                             │  bits [38:30] of VA
                             ▼
    ┌──────────────────────────────────────────────┐
    │ Page Directory (512 entries × 8 bytes = 4 KB)│
    │                                              │
    │  [0]  [1]  [2] ... [pd_idx]   ... [511]     │
    │                        │                     │
    └────────────────────────┼─────────────────────┘
                             │  bits [29:21] of VA
                             ▼
    ┌──────────────────────────────────────────────┐
    │ Page Table (512 entries × 8 bytes = 4 KB)    │
    │                                              │
    │  [0]  [1]  [2] ... [pt_idx]   ... [511]     │
    │                        │                     │
    └────────────────────────┼─────────────────────┘
                             │  bits [20:12] of VA
                             ▼
    ┌──────────────────────────────────────────────┐
    │ Physical Frame (4096 bytes)                  │
    │                                              │
    │  byte[0] ... byte[offset] ... byte[4095]    │
    │                    │                         │
    └────────────────────┼─────────────────────────┘
                         │  bits [11:0] of VA
                         ▼
                    Target Byte

Four memory reads to translate one virtual address. Each level is itself a 4 KB page in physical memory, containing 512 eight-byte entries.

🧠 What do you think happens?

At level 2 (PD), the CPU reads entry pd_idx and finds the Present bit is 0. What happens? Does the CPU try the next entry? Does it guess? Or does it immediately trap to the kernel?


Why this saves memory

A process that uses only a small range of addresses needs very few table pages:

    A process using addresses 0x400000 - 0x410000 (64 KB):

    PML4: 1 page  (always exists, pointed to by CR3)
    PDPT: 1 page  (only entry [0] is populated)
    PD:   1 page  (only entry [2] is populated)
    PT:   1 page  (only entries [0]-[15] are populated)

    Total: 4 pages × 4 KB = 16 KB of page tables for a 64 KB mapping

    A flat table would need: 512 GB
    Savings: 99.9999997%

Levels that aren't needed simply don't exist. Their parent's entry has Present=0, and that's that.


The TLB: making it fast

Four memory accesses per translation would be crippling. At ~100 ns per RAM access, that's 400 ns per instruction — you'd run at about 2.5 million instructions per second. A modern CPU does 4+ billion.

The fix: the Translation Lookaside Buffer (TLB). It's a small, fast cache of recent virtual-to-physical translations.

    Virtual address 0x7FFE12345678
           │
           ▼
    ┌─────────────────────────────────────┐
    │              TLB                     │
    │  ┌──────────────┬─────────────────┐ │
    │  │ Virtual Page │ Physical Frame  │ │
    │  ├──────────────┼─────────────────┤ │
    │  │ 0x7FFE12345  │ 0x3A201  ──────────── HIT! (~1 cycle)
    │  │ 0x55A3B2C02  │ 0x1F800        │ │
    │  │ 0x7F8A12040  │ 0x22100        │ │
    │  │     ...      │    ...         │ │
    │  └──────────────┴─────────────────┘ │
    │                                     │
    │  ~64-2048 entries (varies by CPU)   │
    └─────────────────────────────────────┘
           │
           │ MISS → full 4-level walk (~100 ns)
           ▼

TLB hit: ~1 cycle. Translation is essentially free. TLB miss: full 4-level page table walk. Hundreds of cycles.

Most programs have excellent TLB hit rates (>99%) because of locality — they access the same pages repeatedly.


Context switches flush the TLB

When the OS switches from Process A to Process B, it loads B's page table base into CR3. This invalidates the entire TLB — because A's translations are wrong for B.

    Running Process A:  TLB full of A's translations  (fast!)
           │
           │  Context switch → load B's CR3
           ▼
    Running Process B:  TLB is EMPTY  (cold start, slow!)
           │
           │  First ~1000 memory accesses → all TLB misses
           ▼
    Running Process B:  TLB warming up...              (getting faster)

This is one reason context switches are expensive. The process that resumes starts with a cold TLB and suffers many misses until it warms up. Modern CPUs have PCID (Process Context Identifiers) — tagging TLB entries with a process ID to avoid full flushes, but the TLB is still small.


Huge pages: fewer translations, fewer misses

Standard 4 KB pages mean a 1 GB dataset spans 262,144 pages. That's a lot of TLB entries. The TLB might only hold 2,048 — so you're constantly evicting and reloading translations.

Huge pages use larger page sizes:

    Page Size    Offset Bits    Levels Walked    TLB entries for 1 GB
    ─────────    ───────────    ─────────────    ────────────────────
    4 KB         12             4                262,144
    2 MB         21             3 (skip PT)      512
    1 GB         30             2 (skip PT+PD)   1

With 2 MB pages, the PD entry points directly to a 2 MB physical frame — the PT level is skipped. With 1 GB pages, even the PD level is skipped.

// Allocating huge pages in C (Linux)
#include <sys/mman.h>

void *p = mmap(NULL, 2 * 1024 * 1024,      // 2 MB
               PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
               -1, 0);

💡 Fun Fact: Database servers like PostgreSQL and Redis often use huge pages. A database with a 32 GB buffer pool would need 8 million TLB entries with 4 KB pages. With 2 MB huge pages, that drops to 16,384. The performance difference is measurable — 5-10% on memory-heavy workloads.


Rust: same hardware, same rules

fn main() {
    let addr: u64 = 0x00007FFE12345678;

    let offset   = addr & 0xFFF;
    let pt_idx   = (addr >> 12) & 0x1FF;
    let pd_idx   = (addr >> 21) & 0x1FF;
    let pdpt_idx = (addr >> 30) & 0x1FF;
    let pml4_idx = (addr >> 39) & 0x1FF;

    println!("Virtual address:  {:#018x}", addr);
    println!("PML4  index:      {:3} ({:#05x})", pml4_idx, pml4_idx);
    println!("PDPT  index:      {:3} ({:#05x})", pdpt_idx, pdpt_idx);
    println!("PD    index:      {:3} ({:#05x})", pd_idx, pd_idx);
    println!("PT    index:      {:3} ({:#05x})", pt_idx, pt_idx);
    println!("Page  offset:     {:3} ({:#05x})", offset, offset);
}

The page table structure doesn't care whether the process was written in C, Rust, Python, or assembly. It's hardware. Every byte your Rust program touches goes through the same four-level walk (or TLB hit).


🔧 Task: Walk a page table by hand

Take virtual address 0x00007FFE12345678. Work through the walk on paper:

  1. Split the address into its component fields:

    • Bits [47:39] → PML4 index = ?
    • Bits [38:30] → PDPT index = ?
    • Bits [29:21] → PD index = ?
    • Bits [20:12] → PT index = ?
    • Bits [11:0] → Page offset = ?
  2. Convert to binary first if it helps:

    0x00007FFE12345678
    = 0000 0000 0000 0000 0111 1111 1111 1110
      0001 0010 0011 0100 0101 0110 0111 1000
    
    Split:
    [47:39] = 0_1111_1111 = 0xFF = 255
    [38:30] = 1_1111_1000 = 0x1F8 = 504
    [29:21] = 0_1001_0001 = 0x091 = 145
    [20:12] = 1_0100_0101 = 0x345 = 837  (wait — is it 837 or 325?)
    [11:0]  = 0110_0111_1000 = 0x678 = 1656
    
  3. Verify your answers by running the C program from the top of this chapter.

  4. Try a different address: your own stack variable. Print its address, then decompose it. Does the PML4 index match what you'd expect for a stack address (high in the user-space range)?

  5. Check the real page tables (Linux, needs root):

    # Find a process's page table root
    $ sudo cat /proc/<pid>/pagemap | xxd | head
    

    The pagemap file lets you look up the physical frame for any virtual page. See Documentation/admin-guide/mm/pagemap.rst in the kernel source for the format.