Instructions and the x86-64 ISA
Type this right now
// save as add.c
int add(int a, int b) { return a + b; }
$ gcc -S -O1 -o add.s add.c
$ cat add.s
You just turned C into assembly. It'll be about 5-10 lines of actual instructions. Everything else is metadata. Let's learn to read it.
What is an ISA?
An Instruction Set Architecture is the set of instructions a CPU understands. x86-64 (also called AMD64) is the ISA on virtually every desktop, laptop, and server from Intel and AMD.
It's not abstract. It's a specific, concrete list of operations encoded as bytes in memory. When you compile C or Rust, the compiler translates your code into these exact instructions.
The essential instructions
You don't need hundreds. These 16 cover 90% of compiler output:
┌──────────┬──────────────────────────────────────────────────────┐
│ Instr. │ What it does │
├──────────┼──────────────────────────────────────────────────────┤
│ mov a, b │ Copy value from b into a │
│ add a, b │ a = a + b │
│ sub a, b │ a = a - b │
│ lea a, b │ Load Effective Address: a = address of b │
│ │ (also used for quick math: lea rax, [rbx+rcx*4]) │
│ push val │ Subtract 8 from rsp, store val at [rsp] │
│ pop reg │ Load [rsp] into reg, add 8 to rsp │
│ call lbl │ Push rip (return address), jump to lbl │
│ ret │ Pop address from stack, jump to it │
│ cmp a, b │ Compute a - b, set flags (don't store result) │
│ jmp lbl │ Jump unconditionally (set rip = lbl) │
│ je lbl │ Jump if equal (zero flag set) │
│ jne lbl │ Jump if not equal │
│ jl lbl │ Jump if less than (signed) │
│ jg lbl │ Jump if greater than (signed) │
│ syscall │ Transfer control to the kernel │
│ nop │ Do nothing (used for alignment) │
└──────────┴──────────────────────────────────────────────────────┘
How a + b becomes assembly
int add(int a, int b) { return a + b; }
#![allow(unused)] fn main() { pub fn add(a: i32, b: i32) -> i32 { a + b } }
Both produce (at -O1 / -C opt-level=1):
add:
lea eax, [rdi+rsi]
ret
Two instructions. The System V AMD64 calling convention puts arg1 in rdi, arg2 in rsi. lea eax, [rdi+rsi] computes the sum in one instruction. Return value goes in eax.
Fun Fact:
leastands for "Load Effective Address." Designed for address math, but compilers abuse it as a fast calculator — addition and multiplication in one instruction without touching the flags register.
How if/else becomes assembly
int max(int a, int b) { return (a > b) ? a : b; }
#![allow(unused)] fn main() { pub fn max(a: i32, b: i32) -> i32 { if a > b { a } else { b } } }
At -O1:
max:
cmp edi, esi ; compare a and b
mov eax, esi ; assume b is the answer
cmovg eax, edi ; IF a > b, overwrite with a
ret
What do you think happens? The compiler didn't use
jmporje. Why not?Reveal: Branches cause pipeline stalls when mispredicted.
cmovg(conditional move) avoids the branch entirely — the CPU always executes it but conditionally writes the result.
At -O0 you'd see the explicit version: cmp sets flags, jle conditionally jumps to the else branch. That's how every if works at the hardware level.
How function calls work
call does two things in one instruction: push the return address, then jump.
Before CALL: After CALL:
rip = 0x401020 rip = 0x401200 (function start)
rsp = 0x7fff10 rsp = 0x7fff08 (moved down 8)
Stack: Stack:
┌──────────┐ 0x7fff10 ┌──────────┐ 0x7fff10
│ ... │ │ ... │
└──────────┘ ├──────────┤ 0x7fff08 ◄── rsp
│ 0x401025 │ return address
└──────────┘
ret does the reverse: pop the address, set rip to it.
The stack in assembly: prologue and epilogue
my_function:
push rbp ; save caller's base pointer
mov rbp, rsp ; set up our own base pointer
sub rsp, 32 ; room for local variables
; ... function body: locals at [rbp-4], [rbp-8], etc.
mov rsp, rbp ; deallocate locals
pop rbp ; restore caller's base pointer
ret
Higher addresses
┌─────────────────────┐
│ caller's frame │
├─────────────────────┤
│ return address │ ◄── pushed by 'call'
├─────────────────────┤
│ saved rbp │ ◄── rbp points here
├─────────────────────┤
│ local var 1 │ [rbp-4]
│ local var 2 │ [rbp-8]
├─────────────────────┤
│ │ ◄── rsp (top of stack)
└─────────────────────┘
Lower addresses
With optimization, the compiler often skips this entirely (frame pointer omission) — uses rsp directly, making code faster but debugging harder.
System calls: the ONLY door to the kernel
Your program (Ring 3) cannot access files, screens, or networks directly. The syscall instruction is the only way to ask the kernel.
Here's write(1, "hello\n", 6) in assembly:
mov rax, 1 ; syscall number: write
mov rdi, 1 ; fd: stdout
lea rsi, [msg] ; pointer to string
mov rdx, 6 ; byte count
syscall ; >>> kernel handles it, result in rax <<<
Convention: syscall number in rax, args in rdi, rsi, rdx, r10, r8, r9.
A complete "Hello, World" in x86-64 assembly
No C library. No runtime. Just raw syscalls.
; save as hello.s
; build: as -o hello.o hello.s && ld -o hello hello.o
.section .text
.global _start
_start:
mov rax, 1 ; syscall: write
mov rdi, 1 ; fd: stdout
lea rsi, [msg] ; buffer
mov rdx, 14 ; count
syscall
mov rax, 60 ; syscall: exit
mov rdi, 0 ; status: 0
syscall
.section .rodata
msg: .ascii "Hello, World!\n"
$ as -o hello.o hello.s && ld -o hello hello.o && ./hello
Hello, World!
Thirteen lines. Two syscalls. A complete program at the lowest level before raw bytes.
Godbolt: the compiler explorer
godbolt.org — paste C or Rust on the left, see assembly on the right, live. Try int square(int x) { return x * x; } in C and pub fn square(x: i32) -> i32 { x * x } in Rust. Both produce mov eax, edi; imul eax, edi; ret. Same CPU language, different programmer languages.
Fun Fact: Godbolt color-codes which assembly lines map to which source lines.
Reading compiler output: a walkthrough
long sum_array(int *arr, int len) {
long total = 0;
for (int i = 0; i < len; i++) total += arr[i];
return total;
}
At -O1:
sum_array:
mov eax, 0 ; total = 0
mov ecx, 0 ; i = 0
jmp .L2
.L3:
movsx rdx, DWORD PTR [rdi+rcx*4] ; rdx = arr[i]
add rax, rdx ; total += arr[i]
add ecx, 1 ; i++
.L2:
cmp ecx, esi ; i < len?
jl .L3 ; if yes, loop body
ret
rdi = arr (1st arg) esi = len (2nd arg)
rax = total (return) ecx = i (counter)
[rdi+rcx*4] = arr + i*4 = arr[i] (CPU does array indexing in hardware)
The *4 in [rdi+rcx*4] computes the byte offset — the CPU has built-in support for scaled-index addressing.
Recap
┌──────────────────────────────────────────────────────────┐
│ x86-64 is the concrete ISA on your CPU. │
│ │
│ ~16 instructions cover 90% of compiler output: │
│ mov, add, sub, lea, push, pop, call, ret, │
│ cmp, jmp, je/jne/jl/jg, syscall, nop │
│ │
│ if/else → cmp + conditional jump (or cmov) │
│ function call → call (push rip, jump) / ret (pop rip) │
│ syscall → the ONLY way to talk to the kernel │
│ │
│ C and Rust produce the same assembly at the same │
│ optimization level. │
└──────────────────────────────────────────────────────────┘
Task
- Write a simple C function (try
int factorial(int n)with a loop). - Compile with
gcc -S -O0 -o fact_O0.s fact.candgcc -S -O2 -o fact_O2.s fact.c. - In the
-O0version, identify: the prologue, where locals are stored, thecmp+jump for the loop, and the epilogue. - In
-O2, see how much the compiler eliminated. Can you still match assembly to source? - Bonus: Paste the same function in Godbolt with both GCC and rustc at
-O2. Compare.
Next: the wall between your code and the kernel — and why the CPU enforces it with hardware.