Instructions and the x86-64 ISA

Type this right now

// save as add.c
int add(int a, int b) { return a + b; }
$ gcc -S -O1 -o add.s add.c
$ cat add.s

You just turned C into assembly. It'll be about 5-10 lines of actual instructions. Everything else is metadata. Let's learn to read it.


What is an ISA?

An Instruction Set Architecture is the set of instructions a CPU understands. x86-64 (also called AMD64) is the ISA on virtually every desktop, laptop, and server from Intel and AMD.

It's not abstract. It's a specific, concrete list of operations encoded as bytes in memory. When you compile C or Rust, the compiler translates your code into these exact instructions.


The essential instructions

You don't need hundreds. These 16 cover 90% of compiler output:

    ┌──────────┬──────────────────────────────────────────────────────┐
    │ Instr.   │ What it does                                        │
    ├──────────┼──────────────────────────────────────────────────────┤
    │ mov a, b │ Copy value from b into a                            │
    │ add a, b │ a = a + b                                           │
    │ sub a, b │ a = a - b                                           │
    │ lea a, b │ Load Effective Address: a = address of b            │
    │          │ (also used for quick math: lea rax, [rbx+rcx*4])    │
    │ push val │ Subtract 8 from rsp, store val at [rsp]             │
    │ pop  reg │ Load [rsp] into reg, add 8 to rsp                  │
    │ call lbl │ Push rip (return address), jump to lbl              │
    │ ret      │ Pop address from stack, jump to it                  │
    │ cmp a, b │ Compute a - b, set flags (don't store result)      │
    │ jmp lbl  │ Jump unconditionally (set rip = lbl)               │
    │ je  lbl  │ Jump if equal (zero flag set)                      │
    │ jne lbl  │ Jump if not equal                                   │
    │ jl  lbl  │ Jump if less than (signed)                          │
    │ jg  lbl  │ Jump if greater than (signed)                       │
    │ syscall  │ Transfer control to the kernel                      │
    │ nop      │ Do nothing (used for alignment)                     │
    └──────────┴──────────────────────────────────────────────────────┘

How a + b becomes assembly

int add(int a, int b) { return a + b; }
#![allow(unused)]
fn main() {
pub fn add(a: i32, b: i32) -> i32 { a + b }
}

Both produce (at -O1 / -C opt-level=1):

add:
    lea    eax, [rdi+rsi]
    ret

Two instructions. The System V AMD64 calling convention puts arg1 in rdi, arg2 in rsi. lea eax, [rdi+rsi] computes the sum in one instruction. Return value goes in eax.

Fun Fact: lea stands for "Load Effective Address." Designed for address math, but compilers abuse it as a fast calculator — addition and multiplication in one instruction without touching the flags register.


How if/else becomes assembly

int max(int a, int b) { return (a > b) ? a : b; }
#![allow(unused)]
fn main() {
pub fn max(a: i32, b: i32) -> i32 { if a > b { a } else { b } }
}

At -O1:

max:
    cmp     edi, esi       ; compare a and b
    mov     eax, esi       ; assume b is the answer
    cmovg   eax, edi       ; IF a > b, overwrite with a
    ret

What do you think happens? The compiler didn't use jmp or je. Why not?

Reveal: Branches cause pipeline stalls when mispredicted. cmovg (conditional move) avoids the branch entirely — the CPU always executes it but conditionally writes the result.

At -O0 you'd see the explicit version: cmp sets flags, jle conditionally jumps to the else branch. That's how every if works at the hardware level.


How function calls work

call does two things in one instruction: push the return address, then jump.

    Before CALL:              After CALL:
    rip = 0x401020            rip = 0x401200 (function start)
    rsp = 0x7fff10            rsp = 0x7fff08 (moved down 8)

    Stack:                    Stack:
    ┌──────────┐ 0x7fff10    ┌──────────┐ 0x7fff10
    │  ...     │             │  ...     │
    └──────────┘             ├──────────┤ 0x7fff08 ◄── rsp
                             │ 0x401025 │  return address
                             └──────────┘

ret does the reverse: pop the address, set rip to it.


The stack in assembly: prologue and epilogue

my_function:
    push   rbp            ; save caller's base pointer
    mov    rbp, rsp       ; set up our own base pointer
    sub    rsp, 32        ; room for local variables
    ; ... function body: locals at [rbp-4], [rbp-8], etc.
    mov    rsp, rbp       ; deallocate locals
    pop    rbp            ; restore caller's base pointer
    ret
    Higher addresses
    ┌─────────────────────┐
    │  caller's frame     │
    ├─────────────────────┤
    │  return address     │ ◄── pushed by 'call'
    ├─────────────────────┤
    │  saved rbp          │ ◄── rbp points here
    ├─────────────────────┤
    │  local var 1        │ [rbp-4]
    │  local var 2        │ [rbp-8]
    ├─────────────────────┤
    │                     │ ◄── rsp (top of stack)
    └─────────────────────┘
    Lower addresses

With optimization, the compiler often skips this entirely (frame pointer omission) — uses rsp directly, making code faster but debugging harder.


System calls: the ONLY door to the kernel

Your program (Ring 3) cannot access files, screens, or networks directly. The syscall instruction is the only way to ask the kernel.

Here's write(1, "hello\n", 6) in assembly:

mov    rax, 1          ; syscall number: write
mov    rdi, 1          ; fd: stdout
lea    rsi, [msg]      ; pointer to string
mov    rdx, 6          ; byte count
syscall                ; >>> kernel handles it, result in rax <<<

Convention: syscall number in rax, args in rdi, rsi, rdx, r10, r8, r9.


A complete "Hello, World" in x86-64 assembly

No C library. No runtime. Just raw syscalls.

; save as hello.s
; build: as -o hello.o hello.s && ld -o hello hello.o
        .section .text
        .global _start
_start:
        mov     rax, 1          ; syscall: write
        mov     rdi, 1          ; fd: stdout
        lea     rsi, [msg]      ; buffer
        mov     rdx, 14         ; count
        syscall
        mov     rax, 60         ; syscall: exit
        mov     rdi, 0          ; status: 0
        syscall

        .section .rodata
msg:    .ascii "Hello, World!\n"
$ as -o hello.o hello.s && ld -o hello hello.o && ./hello
Hello, World!

Thirteen lines. Two syscalls. A complete program at the lowest level before raw bytes.


Godbolt: the compiler explorer

godbolt.org — paste C or Rust on the left, see assembly on the right, live. Try int square(int x) { return x * x; } in C and pub fn square(x: i32) -> i32 { x * x } in Rust. Both produce mov eax, edi; imul eax, edi; ret. Same CPU language, different programmer languages.

Fun Fact: Godbolt color-codes which assembly lines map to which source lines.


Reading compiler output: a walkthrough

long sum_array(int *arr, int len) {
    long total = 0;
    for (int i = 0; i < len; i++) total += arr[i];
    return total;
}

At -O1:

sum_array:
        mov     eax, 0          ; total = 0
        mov     ecx, 0          ; i = 0
        jmp     .L2
.L3:
        movsx   rdx, DWORD PTR [rdi+rcx*4]   ; rdx = arr[i]
        add     rax, rdx        ; total += arr[i]
        add     ecx, 1          ; i++
.L2:
        cmp     ecx, esi        ; i < len?
        jl      .L3             ; if yes, loop body
        ret
    rdi = arr (1st arg)     esi = len (2nd arg)
    rax = total (return)    ecx = i (counter)
    [rdi+rcx*4] = arr + i*4 = arr[i]  (CPU does array indexing in hardware)

The *4 in [rdi+rcx*4] computes the byte offset — the CPU has built-in support for scaled-index addressing.


Recap

    ┌──────────────────────────────────────────────────────────┐
    │  x86-64 is the concrete ISA on your CPU.                 │
    │                                                          │
    │  ~16 instructions cover 90% of compiler output:          │
    │  mov, add, sub, lea, push, pop, call, ret,              │
    │  cmp, jmp, je/jne/jl/jg, syscall, nop                   │
    │                                                          │
    │  if/else → cmp + conditional jump (or cmov)              │
    │  function call → call (push rip, jump) / ret (pop rip)   │
    │  syscall → the ONLY way to talk to the kernel            │
    │                                                          │
    │  C and Rust produce the same assembly at the same        │
    │  optimization level.                                     │
    └──────────────────────────────────────────────────────────┘

Task

  1. Write a simple C function (try int factorial(int n) with a loop).
  2. Compile with gcc -S -O0 -o fact_O0.s fact.c and gcc -S -O2 -o fact_O2.s fact.c.
  3. In the -O0 version, identify: the prologue, where locals are stored, the cmp+jump for the loop, and the epilogue.
  4. In -O2, see how much the compiler eliminated. Can you still match assembly to source?
  5. Bonus: Paste the same function in Godbolt with both GCC and rustc at -O2. Compare.

Next: the wall between your code and the kernel — and why the CPU enforces it with hardware.