Compilation and Linking: Source to Binary

Type this right now

cat > greet.c << 'EOF'
#include <stdio.h>
void greet(const char *name) { printf("Hello, %s!\n", name); }
EOF

cat > main.c << 'EOF'
void greet(const char *name);
int main() { greet("world"); return 0; }
EOF

gcc -c greet.c -o greet.o
gcc -c main.c  -o main.o
gcc greet.o main.o -o hello

echo "=== main.o symbols ===" && nm main.o
echo "=== After linking ==="  && nm hello | grep -E 'main|greet'
=== main.o symbols ===
                 U greet
0000000000000000 T main
=== After linking ===
0000000000001149 T greet
0000000000001172 T main

main.o has an undefined symbol greet (the U). After linking, it has a real address. That's the entire purpose of the linker — resolving references between files.


The C compilation pipeline

  hello.c          Your source code
     │
     │  gcc -E      Preprocessor (#include, #define, #ifdef)
     v
  hello.i          Pure C, all macros expanded
     │
     │  gcc -S      Compiler (cc1) — C to assembly
     v
  hello.s          Human-readable assembly
     │
     │  gcc -c      Assembler (as) — assembly to machine code
     v
  hello.o          Object file (relocatable ELF)
     │
     │  gcc          Linker (ld) — resolves symbols, assigns addresses
     v
  a.out            Executable (final ELF)

You can stop at any stage. Let's see each output.

Preprocessing (gcc -E): A hello.c with #include <stdio.h> expands to ~700 lines. Every header pasted in, every macro expanded. The output is pure C.

Compilation (gcc -S):

main:
    endbr64
    pushq   %rbp
    movq    %rsp, %rbp
    leaq    .LC0(%rip), %rdi
    call    puts@PLT
    movl    $0, %eax
    popq    %rbp
    ret
.LC0:
    .string "Hello, ELF!"

Notice puts@PLT — the compiler doesn't know where puts lives. It emits a reference the linker will resolve.

Assembly (gcc -c):

hello.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped

The type is relocatable — machine code with provisional addresses starting at zero.

Linking (gcc):

hello: ELF 64-bit LSB pie executable, x86-64, dynamically linked,
       interpreter /lib64/ld-linux-x86-64.so.2, ...

The linker resolves all symbols, assigns final addresses, produces the executable.


The Rust pipeline

  hello.rs         Your source code
     │
     │  rustc        Parser → AST → HIR (desugared, type-checked)
     v                            → MIR (borrow-checked, monomorphized)
  LLVM IR          LLVM's intermediate representation
     │
     │  LLVM        Machine code generation
     v
  hello.o          Object file (same format as C!)
     │
     │  linker      Same system linker (ld / lld)
     v
  hello            Executable (final ELF)

After LLVM, Rust and C produce the same kind of object file. They use the same linker.

💡 Fun Fact: You can mix C and Rust in one binary. Compile C to .o, Rust to .o, link them together. The linker doesn't care what language produced the object file — it only sees symbols, sections, and relocations.


Object files: code with holes

objdump -d main.o
0000000000000000 <main>:
   0:   f3 0f 1e fa             endbr64
   4:   55                      push   rbp
   5:   48 89 e5                mov    rbp,rsp
   8:   48 8d 05 00 00 00 00    lea    rax,[rip+0x0]    # address TBD
   f:   48 89 c7                mov    rdi,rax
  12:   e8 00 00 00 00          call   17 <main+0x17>   # address TBD
  17:   b8 00 00 00 00          mov    eax,0x0
  1c:   5d                      pop    rbp
  1d:   c3                      ret

See the zeros at offsets 0x8 and 0x12? Those are holes. The lea needs the string address. The call needs greet's address. Neither is known yet.

Relocation entries tell the linker where to fill in:

readelf -r main.o
Relocation section '.rela.text' at offset 0x1e8 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000b  000500000002 R_X86_64_PC32     0000000000000000 .rodata - 4
000000000013  000600000004 R_X86_64_PLT32    0000000000000000 greet - 4

Two relocations. Two holes. The linker fills them.


Symbol types: the nm command

Symbol  Meaning
──────  ────────────────────────────────────
  T     Text (code) — defined in this file
  D     Data — initialized global variable
  B     BSS — uninitialized global variable
  R     Read-only data
  U     Undefined — referenced but not here
  W     Weak — can be overridden

main.o defines main (T) but needs greet (U). greet.o defines greet (T) but needs printf (U). The linker connects all the U's to the T's.

🧠 What do you think happens? What if a symbol is U in every object file and never T? The linker emits: undefined reference to 'greet'. The most common linker error. No file provided a definition.


Linking: three jobs

1. Symbol resolution — match undefined references to definitions:

main.o                     greet.o
┌────────────────┐         ┌────────────────┐
│ T main         │         │ T greet        │
│ U greet ───────┼────────►│                │
│                │         │ U printf ──────┼──► libc.so
└────────────────┘         └────────────────┘

2. Relocation — fill in placeholder addresses:

Before (main.o):   call 0x00000000    # placeholder
After  (hello):    call 0x00001149    # actual address of greet

3. Section merging — combine sections from all object files:

main.o  .text ──┐
                ├──► final .text
greet.o .text ─┘

Static vs dynamic linking

Static linking:                    Dynamic linking (default):

┌──────────────────────┐          ┌──────────────┐
│     hello_static     │          │    hello      │
│  main()              │          │  main()       │
│  greet()             │          │  greet()      │
│  printf()            │          │  printf ──────┼──► libc.so.6
│  ... all of libc ... │          └───────────────┘
└──────────────────────┘            ~16 KB, needs .so
  ~880 KB, no dependencies
gcc -static main.o greet.o -o hello_static
ldd hello_static    # "not a dynamic executable"
ldd hello           # lists libc.so.6, ld-linux, vdso

Static: everything baked in. Dynamic: smaller binary, resolved at runtime.


Rust's linking story

Rust statically links its own standard library but dynamically links the system C library — a hybrid approach.

ldd hello_rs   # shows libc.so.6, libgcc_s.so.1, ld-linux

💡 Fun Fact: Fully static Rust: rustup target add x86_64-unknown-linux-musl && rustc --target x86_64-unknown-linux-musl hello.rs. Zero dynamic dependencies. Runs on any Linux.


PLT/GOT: dynamic linking at a glance

When your binary calls printf dynamically, it uses two structures:

  • PLT (Procedure Linkage Table) — code trampolines
  • GOT (Global Offset Table) — writable address slots
Your code               PLT                    GOT
─────────              ─────                   ────
call puts@PLT ──────► puts@PLT:              ┌──────────────┐
                       jmp [GOT[puts]]───────►│ address of   │
                       ...                    │ puts in libc │
                                              └──────────────┘

On the first call, the GOT entry points to a resolver that finds the real puts, patches the GOT, and jumps there. Subsequent calls go directly through the patched GOT. Details in Chapter 14.


🔧 Task: Watch symbols resolve across files

  1. Create math.c:
    int add(int a, int b) { return a + b; }
    int multiply(int a, int b) { return a * b; }
    
  2. Create app.c:
    #include <stdio.h>
    int add(int a, int b);
    int multiply(int a, int b);
    int main() {
        printf("%d %d\n", add(3, 4), multiply(5, 6));
        return 0;
    }
    
  3. Compile separately: gcc -c math.c and gcc -c app.c
  4. Run nm math.oadd and multiply should be T (defined)
  5. Run nm app.oadd and multiply should be U (undefined)
  6. Link: gcc math.o app.o -o app
  7. Run nm app | grep -E 'add|multiply' — both now T with real addresses
  8. Run ./app — output: 7 30
  9. Bonus: Delete math.o, try linking with just app.o. The error tells you exactly which symbols are missing.