Privilege, Protection, and System Calls

Type this right now

$ cat > hello.c << 'EOF'
#include <stdio.h>
int main() { printf("Hello from C!\n"); return 0; }
EOF
$ gcc -o hello hello.c
$ strace ./hello 2>&1 | tail -20

You'll see a flood of output like execve(...), brk(...), mmap(...), write(1, "Hello from C!\n", 14). Those are system calls — every interaction between your program and the OS. Your simple printf triggered dozens of them.


The two worlds

Your CPU has (at least) two privilege levels, enforced by the hardware itself:

    ┌─────────────────────────────────────────────────────────┐
    │                    Ring 3: User Mode                     │
    │                                                         │
    │  Your program lives here.                               │
    │  CAN: execute instructions, read/write own memory       │
    │  CANNOT: access hardware, read other processes' memory,  │
    │          change page tables, halt the CPU                │
    ├═════════════════════════════════════════════════════════╡
    │  ▲▲▲  HARDWARE-ENFORCED WALL  ▲▲▲                       │
    │  The ONLY way through: the syscall instruction.         │
    ├═════════════════════════════════════════════════════════╡
    │                    Ring 0: Kernel Mode                   │
    │                                                         │
    │  The OS kernel lives here.                              │
    │  CAN: everything — all memory, all hardware, all regs   │
    └─────────────────────────────────────────────────────────┘

This isn't a software convention. The CPU has a register bit tracking the current privilege level. When it says "Ring 3," certain instructions are physically impossible. The CPU will fault if you try.

What do you think happens? What stops a program from changing that privilege bit to Ring 0?

Reveal: The instruction to change privilege level is itself privileged. You can't run it from Ring 3. Hardware catch-22, by design. The only way into Ring 0 is through controlled entry points the kernel sets up at boot time.


The x86 protection rings

x86 defines four rings, but modern OSes use only two:

                    ┌───────────────────┐
                    │   Ring 0: Kernel  │  Full access
                    ├───────────────────┤
                    │   Ring 1 (unused) │
                    ├───────────────────┤
                    │   Ring 2 (unused) │
                    ├───────────────────┤
                    │   Ring 3: User    │  Your program
                    └───────────────────┘

Why you can't read another process's memory

Two mechanisms work together:

1. Virtual memory. Each process has its own page table. Process A's address 0x4000 maps to a completely different physical location than Process B's 0x4000.

2. Privilege levels. Page tables live in kernel memory. Only Ring 0 can modify them. You can't remap your own pages because writing to cr3 (page table base register) is privileged.

    Process A                    Process B
    ┌──────────┐                 ┌──────────┐
    │ 0x4000 ──┼──┐              │ 0x4000 ──┼──┐
    └──────────┘  │              └──────────┘  │
                  ▼ page table A              ▼ page table B
              ┌──────────┐               ┌──────────┐
              │ phys:    │               │ phys:    │
              │ 0x1A000  │               │ 0x3F000  │
              └──────────┘               └──────────┘

    Same virtual address → different physical memory.
    Page tables in kernel memory — Ring 3 can't touch them.

The CPU enforces this on every memory access. Hardware gate, not software check.


System calls: the controlled gateway

Your program can't read files, write to the screen, or allocate memory from the OS directly. It asks the kernel through system calls.

    Your program (Ring 3)         Kernel (Ring 0)
    ─────────────────────         ────────────────
    1. Set up arguments
       rax = syscall number
       rdi, rsi, rdx = args

    2. Execute 'syscall'  ──────► 3. CPU switches to Ring 0
                                     Kernel validates args
                                     Kernel does the work

    5. Continue in Ring 3 ◄────── 4. Result in rax
                                     CPU switches back to Ring 3

The syscall instruction atomically saves rip/rflags, switches to Ring 0, and jumps to the kernel's entry point. The kernel runs sysret to switch back.


Common system calls

    ┌───────────┬────────┬───────────────────────────────────────────┐
    │  Syscall  │ Number │ What it does                              │
    ├───────────┼────────┼───────────────────────────────────────────┤
    │  read     │   0    │ Read bytes from a file descriptor         │
    │  write    │   1    │ Write bytes to a file descriptor          │
    │  open     │   2    │ Open a file, get a file descriptor        │
    │  close    │   3    │ Close a file descriptor                   │
    │  mmap     │   9    │ Map a file or allocate memory pages       │
    │  mprotect │  10    │ Change memory protection (r/w/x)          │
    │  brk      │  12    │ Expand/shrink the heap                    │
    │  fork     │  57    │ Create a copy of the current process      │
    │  execve   │  59    │ Replace process with a new program        │
    │  exit     │  60    │ Terminate the process                     │
    └───────────┴────────┴───────────────────────────────────────────┘

strace: watching syscalls in real time

C hello world

$ strace ./hello 2>&1 | grep -E "execve|write|exit|mmap|brk"
execve("./hello", ["./hello"], [...])        = 0
brk(NULL)                                    = 0x55a8c3d2f000
mmap(NULL, 2228224, PROT_READ, ...)          = 0x7f3a...
mmap(0x7f3a..., 1540096, PROT_READ|PROT_EXEC, ...) = 0x7f3a...
brk(0x55a8c3d50000)                          = 0x55a8c3d50000
write(1, "Hello from C!\n", 14)              = 14
exit_group(0)                                = ?

Your one-line printf caused: execve (load program), mmap (load libc), brk (allocate buffers), write (the actual output), exit_group (terminate).

Rust hello world

fn main() { println!("Hello from Rust!"); }
$ rustc -o hello_rs hello.rs
$ strace ./hello_rs 2>&1 | grep -E "write|exit"
write(1, "Hello from Rust!\n", 17)           = 17
exit_group(0)                                = ?

Same write syscall. Both languages eventually call write(1, ...) to put text on your screen.

Fun Fact: A statically linked C hello world can have as few as 3-4 syscalls total. A statically linked Rust hello world typically has 20-30 because Rust's runtime initializes signal handlers, thread-local storage, and a panic handler.


What happens when you break the rules

int main() {
    int *p = (int *)0xDEADBEEF;  // an address we don't own
    return *p;                    // try to read it
}

Here's the chain of events:

    1. CPU executes: mov eax, [0xDEADBEEF]
    2. CPU consults page table → page not mapped
    3. CPU generates PAGE FAULT exception (hardware, not software)
    4. CPU switches to Ring 0, jumps to kernel's fault handler
    5. Kernel: address 0xDEADBEEF is not valid for this process
    6. Kernel sends SIGSEGV signal to the process
    7. Default handler: terminate + core dump
    8. You see: "Segmentation fault (core dumped)"

Verify with strace:

$ strace ./segfault 2>&1 | tail -3
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xdeadbeef} ---
+++ killed by SIGSEGV (core dumped) +++

SEGV_MAPERR = address not mapped. This is a hardware mechanism. The CPU caught it before the read completed.


Other protection violations

    ┌──────────────────┬──────────────────────────────────────────┐
    │  Violation        │ What happens                             │
    ├──────────────────┼──────────────────────────────────────────┤
    │  Read unmapped    │ Page fault → SIGSEGV (SEGV_MAPERR)      │
    │  memory           │                                          │
    │  Write to         │ Page fault → SIGSEGV (SEGV_ACCERR)      │
    │  read-only page   │ e.g., writing to string literals         │
    │  Execute non-     │ Page fault → SIGSEGV (NX bit)            │
    │  executable page  │                                          │
    │  Privileged       │ CPU exception → SIGILL                   │
    │  instruction      │ Ring 0 instr. from Ring 3                │
    └──────────────────┴──────────────────────────────────────────┘

All follow the same pattern: CPU detects violation in hardware, generates exception, kernel converts to signal, process dies.

In Rust, most of these are prevented at compile time. But via unsafe or FFI, the hardware mechanism is identical.


The complete picture

    ┌──────────────────────────────────────────────────┐
    │  Your Program (Ring 3)                            │
    │  ┌─────────────┐   ┌──────────────┐             │
    │  │ Your code    │──►│ C lib / Rust │             │
    │  │ (main, etc.) │   │ runtime      │             │
    │  └─────────────┘   └──────┬───────┘             │
    │                    printf / println               │
    │                    malloc / Vec::push             │
    │                           │                      │
    │                    ┌──────▼──────┐               │
    │                    │   syscall   │               │
    ├════════════════════╪════════════════════════════╡
    │  Kernel (Ring 0)   │                             │
    │              ┌─────▼──────┐                      │
    │              │  Syscall   │                      │
    │              │  handler   │                      │
    │              └──────┬─────┘                      │
    │           ┌─────────┼─────────┐                  │
    │           ▼         ▼         ▼                  │
    │      Filesystem  Memory    Network               │
    ├══════════╪═════════╪═════════╪══════════════════╡
    │  HW     Disk      RAM      NIC                   │
    └──────────────────────────────────────────────────┘

Your code calls a library. The library executes syscall. CPU switches to Ring 0. Kernel does the work. CPU switches back. Your program continues.

Every. Single. Time. No shortcuts, no backdoors.


Recap

    ┌──────────────────────────────────────────────────────────┐
    │  Ring 0 / Ring 3 = HARDWARE-enforced privilege levels     │
    │                                                          │
    │  Your program cannot:                                    │
    │  • Access another process's memory (page tables)         │
    │  • Access hardware directly (privileged instructions)    │
    │  • Modify its own page tables (privileged registers)     │
    │                                                          │
    │  syscall = the ONLY door to Ring 0                       │
    │  Segfaults = hardware events: page fault → SIGSEGV       │
    │  strace shows every syscall your program makes           │
    └──────────────────────────────────────────────────────────┘

Task

  1. Run strace -c ./hello (C) and strace -c ./hello_rs (Rust). Compare total syscall counts.
  2. Run strace -e write ./hello to filter for just write calls.
  3. Violations. Compile int main() { return *(int *)0; } and int main() { char *s = "hello"; s[0] = 'H'; return 0; }. Strace each. Note SEGV_MAPERR vs. SEGV_ACCERR.
  4. Bonus: Write assembly that executes hlt from user mode. What signal do you get?

Next: we leave the hardware behind and enter the illusion — how your program thinks memory is laid out.