Inline Assembly

Sometimes you need to drop below the language and talk directly to the CPU. Reading hardware registers, executing specific instructions the compiler does not emit, or inserting precise memory barriers -- these require inline assembly. This chapter shows how to embed assembly in both C (GCC extended asm) and Rust (the asm! macro), with real examples on x86-64.

When You Need Inline Assembly

Most code never needs it. But these situations demand it:

  • Reading CPU-specific registers (cycle counter, model info, control registers)
  • Memory and compiler barriers (preventing reordering in lock-free code)
  • Specific SIMD instructions that the compiler does not auto-vectorize to
  • Hardware I/O (in/out instructions for port-mapped I/O)
  • Atomic operations not provided by the language or library
  • System calls (the raw syscall instruction)

Driver Prep: Kernel code and device drivers use inline assembly for all of the above. The Linux kernel's arch/x86/include/asm/ directory is full of inline assembly wrappers. Understanding the constraint system is essential.

C: GCC Extended Assembly Syntax

The basic form:

asm volatile (
    "assembly template"
    : output operands
    : input operands
    : clobber list
);

Example: Reading the CPU Cycle Counter (RDTSC)

The rdtsc instruction reads the Time Stamp Counter into EDX:EAX (high 32 bits in EDX, low 32 bits in EAX).

/* rdtsc.c */
#include <stdio.h>
#include <stdint.h>

static inline uint64_t read_tsc(void)
{
    uint32_t lo, hi;
    asm volatile (
        "rdtsc"
        : "=a" (lo),   /* output: EAX -> lo */
          "=d" (hi)    /* output: EDX -> hi */
        :              /* no inputs */
        :              /* no extra clobbers */
    );
    return ((uint64_t)hi << 32) | lo;
}

int main(void)
{
    uint64_t start = read_tsc();

    /* Do some work */
    volatile int sum = 0;
    for (int i = 0; i < 1000000; i++)
        sum += i;

    uint64_t end = read_tsc();
    printf("Cycles: %lu\n", end - start);
    printf("Sum: %d\n", sum);
    return 0;
}

Compile and run:

gcc -O2 -Wall -o rdtsc rdtsc.c && ./rdtsc

The constraint letters tell GCC which registers to use:

+------------+---------------------------+
| Constraint | Meaning                   |
+------------+---------------------------+
| "=a"       | output in EAX             |
| "=d"       | output in EDX             |
| "=r"       | output in any GPR         |
| "=m"       | output in memory          |
| "r"        | input in any GPR          |
| "i"        | immediate constant        |
| "m"        | input in memory           |
+------------+---------------------------+
| Modifiers                              |
+------------+---------------------------+
| "="        | write-only output         |
| "+"        | read-write operand        |
| "&"        | early-clobber output      |
+------------+---------------------------+

Example: CPUID

The cpuid instruction returns CPU identification data. It reads EAX as a "leaf" selector and writes results to EAX, EBX, ECX, EDX.

/* cpuid.c */
#include <stdio.h>
#include <stdint.h>
#include <string.h>

void get_cpu_vendor(char *vendor)
{
    uint32_t ebx, ecx, edx;
    asm volatile (
        "cpuid"
        : "=b" (ebx),
          "=c" (ecx),
          "=d" (edx)
        : "a" (0)       /* input: leaf 0 */
        :
    );
    /* Vendor string is in EBX:EDX:ECX (yes, that order) */
    memcpy(vendor + 0, &ebx, 4);
    memcpy(vendor + 4, &edx, 4);
    memcpy(vendor + 8, &ecx, 4);
    vendor[12] = '\0';
}

int main(void)
{
    char vendor[13];
    get_cpu_vendor(vendor);
    printf("CPU Vendor: %s\n", vendor);
    return 0;
}

Compile and run:

gcc -O2 -Wall -o cpuid cpuid.c && ./cpuid

The volatile Keyword

asm volatile tells the compiler: "Do not optimize this away, do not move it, do not assume it has no side effects." Without volatile, the compiler may remove the asm block, reorder it, or merge duplicates. Always use volatile unless the asm block is a pure function with no side effects (rare).

Caution: Even volatile does not prevent CPU-level reordering. For that you need memory barriers (fence instructions). volatile only constrains the compiler's optimizer.

Memory Barriers

On modern out-of-order CPUs, the processor can reorder memory operations. Compilers can also reorder loads and stores. Barriers prevent both.

/* barriers.c */
#include <stdio.h>
#include <stdint.h>

/* Compiler barrier -- prevents compiler reordering, not CPU reordering */
#define compiler_barrier()  asm volatile ("" ::: "memory")

/* Full memory fence -- prevents both compiler and CPU reordering */
#define full_fence()        asm volatile ("mfence" ::: "memory")

/* Store fence -- all prior stores complete before any later store */
#define store_fence()       asm volatile ("sfence" ::: "memory")

/* Load fence -- all prior loads complete before any later load */
#define load_fence()        asm volatile ("lfence" ::: "memory")

volatile int shared_flag = 0;
volatile int shared_data = 0;

void producer(void)
{
    shared_data = 42;       /* write data first */
    store_fence();          /* ensure data is visible before flag */
    shared_flag = 1;        /* signal that data is ready */
}

void consumer(void)
{
    while (!shared_flag)    /* wait for flag */
        ;
    load_fence();           /* ensure we read data after flag */
    printf("Data: %d\n", shared_data);  /* guaranteed to see 42 */
}

int main(void)
{
    /* Single-threaded demo -- the barriers matter in multi-threaded code */
    producer();
    consumer();
    return 0;
}

Compile and run:

gcc -O2 -Wall -o barriers barriers.c && ./barriers

The "memory" clobber tells the compiler that the asm block may read or write any memory, so it must not reorder loads/stores across it.

  Without barrier:           With barrier:

  store data = 42            store data = 42
  store flag = 1             sfence
  (CPU may reorder these!)   store flag = 1
                             (stores are ordered)

Driver Prep: The Linux kernel defines mb(), rmb(), wmb() (full, read, write memory barriers) and smp_mb(), smp_rmb(), smp_wmb() (SMP variants). These are thin wrappers around inline assembly fence instructions.

C: Inline Assembly for a System Call

On x86-64 Linux, system calls use the syscall instruction:

/* raw_syscall.c */
#include <stdio.h>
#include <stdint.h>

/* write(fd, buf, count) -- syscall number 1 on x86-64 */
static long raw_write(int fd, const void *buf, unsigned long count)
{
    long ret;
    asm volatile (
        "syscall"
        : "=a" (ret)                /* output: return value in RAX */
        : "a" (1),                  /* input: syscall number in RAX */
          "D" ((long)fd),           /* input: arg1 in RDI */
          "S" ((long)buf),          /* input: arg2 in RSI */
          "d" ((long)count)         /* input: arg3 in RDX */
        : "rcx", "r11", "memory"   /* clobbers: syscall destroys RCX, R11 */
    );
    return ret;
}

int main(void)
{
    const char msg[] = "Hello from raw syscall!\n";
    long written = raw_write(1, msg, sizeof(msg) - 1);
    printf("Bytes written: %ld\n", written);
    return 0;
}

Compile and run:

gcc -O2 -Wall -o raw_syscall raw_syscall.c && ./raw_syscall

Output:

Hello from raw syscall!
Bytes written: 23

The x86-64 Linux syscall convention: RAX = syscall number and return value, arguments in RDI, RSI, RDX, R10, R8, R9. The CPU clobbers RCX and R11.

Try It: Implement a raw_exit(int code) function using the syscall instruction (syscall number 60 on x86-64). Call it instead of return 0 and verify the exit code with echo $?.

Rust: The asm! Macro

Rust stabilized inline assembly in Rust 1.59. The syntax is different from GCC's but the concept is the same.

Example: Reading the CPU Cycle Counter

// rdtsc_rust.rs
use std::arch::asm;

fn read_tsc() -> u64 {
    let lo: u32;
    let hi: u32;
    unsafe {
        asm!(
            "rdtsc",
            out("eax") lo,
            out("edx") hi,
            options(nostack, nomem),
        );
    }
    ((hi as u64) << 32) | (lo as u64)
}

fn main() {
    let start = read_tsc();

    let mut sum: u64 = 0;
    for i in 0..1_000_000u64 {
        sum = sum.wrapping_add(i);
    }

    let end = read_tsc();
    println!("Cycles: {}", end - start);
    println!("Sum: {}", sum);
}

Compile and run:

rustc -O rdtsc_rust.rs && ./rdtsc_rust

Key differences from GCC syntax:

+---------------------------+--------------------------------+
| GCC (C)                   | Rust asm!                      |
+---------------------------+--------------------------------+
| "=a" (lo)                 | out("eax") lo                  |
| "=d" (hi)                 | out("edx") hi                  |
| : "memory"                | options(nomem) if no memory    |
|                           | access, otherwise omit         |
| asm volatile              | asm! is volatile by default    |
| "r" (input)               | in(reg) input                  |
+---------------------------+--------------------------------+

Example: CPUID in Rust

// cpuid_rust.rs
use std::arch::asm;

fn cpu_vendor() -> String {
    let ebx: u32;
    let ecx: u32;
    let edx: u32;

    unsafe {
        asm!(
            "cpuid",
            inout("eax") 0u32 => _,
            out("ebx") ebx,
            out("ecx") ecx,
            out("edx") edx,
            options(nostack),
        );
    }

    let mut vendor = [0u8; 12];
    vendor[0..4].copy_from_slice(&ebx.to_le_bytes());
    vendor[4..8].copy_from_slice(&edx.to_le_bytes());
    vendor[8..12].copy_from_slice(&ecx.to_le_bytes());
    String::from_utf8_lossy(&vendor).to_string()
}

fn main() {
    println!("CPU Vendor: {}", cpu_vendor());
}

Compile and run:

rustc -O cpuid_rust.rs && ./cpuid_rust

Rust asm! Operand and Option Reference

+-------------------+------------------------------------------+
| Operand           | Meaning                                  |
+-------------------+------------------------------------------+
| in(reg) expr      | Input in any general-purpose register    |
| in("eax") expr    | Input in a specific register             |
| out(reg) var      | Output to any GPR                        |
| out("edx") var    | Output to a specific register            |
| inout(reg) var    | Read-write operand                       |
| inout("eax") x=>y | Input x, output y, same register        |
| out(reg) _        | Clobbered register (output discarded)    |
+-------------------+------------------------------------------+

+-------------------+------------------------------------------+
| Option            | Meaning                                  |
+-------------------+------------------------------------------+
| nomem             | Asm does not read/write memory           |
| nostack           | Asm does not use the stack               |
| pure              | No side effects (allows optimization)    |
| preserves_flags   | Does not modify CPU flags (EFLAGS)       |
| att_syntax        | Use AT&T syntax instead of Intel         |
+-------------------+------------------------------------------+

By default, asm! blocks are treated as volatile (they will not be removed or reordered). Adding pure and nomem together allows the compiler to optimize like a regular function call.

Rust: A Raw System Call

// raw_syscall_rust.rs
use std::arch::asm;

/// Perform a raw write() system call on x86-64 Linux.
unsafe fn raw_write(fd: u64, buf: *const u8, count: u64) -> i64 {
    let ret: i64;
    asm!(
        "syscall",
        in("rax") 1u64,      // syscall number: write = 1
        in("rdi") fd,         // arg1: file descriptor
        in("rsi") buf as u64, // arg2: buffer pointer
        in("rdx") count,      // arg3: byte count
        out("rcx") _,         // clobbered by syscall
        out("r11") _,         // clobbered by syscall
        lateout("rax") ret,   // return value
        options(nostack),
    );
    ret
}

fn main() {
    let msg = b"Hello from Rust raw syscall!\n";
    let written = unsafe { raw_write(1, msg.as_ptr(), msg.len() as u64) };
    println!("Bytes written: {}", written);
}

Compile and run:

rustc -O raw_syscall_rust.rs && ./raw_syscall_rust

Note lateout("rax") instead of out("rax"). This tells the compiler that the output is written late (after inputs are consumed), which is necessary because rax is also used as an input.

Rust Note: In practice, use std::sync::atomic with proper Ordering values (SeqCst, Acquire, Release) instead of raw fence instructions for memory barriers. The atomic types generate the correct barriers automatically. Inline assembly for barriers is only needed when interfacing with hardware or writing the lowest levels of a synchronization library.

SIMD: A Practical Example

Let us use SSSE3 to sum an array of four 32-bit integers using 128-bit SIMD registers.

/* simd_sum.c */
#include <stdio.h>
#include <stdint.h>

int32_t simd_sum_4(const int32_t vals[4])
{
    int32_t result;
    asm volatile (
        "movdqu (%1), %%xmm0\n\t"     /* load 4 ints into xmm0 */
        "phaddd %%xmm0, %%xmm0\n\t"   /* horizontal add pairs */
        "phaddd %%xmm0, %%xmm0\n\t"   /* horizontal add again */
        "movd   %%xmm0, %0\n\t"       /* extract low 32 bits */
        : "=r" (result)
        : "r" (vals)
        : "xmm0", "memory"
    );
    return result;
}

int main(void)
{
    int32_t data[4] = {10, 20, 30, 40};
    printf("SIMD sum: %d\n", simd_sum_4(data));
    printf("Expected: %d\n", 10 + 20 + 30 + 40);
    return 0;
}

Compile and run (requires SSSE3):

gcc -O2 -Wall -mssse3 -o simd_sum simd_sum.c && ./simd_sum

The equivalent in Rust:

// simd_sum_rust.rs
use std::arch::asm;

fn simd_sum_4(vals: &[i32; 4]) -> i32 {
    let result: i32;
    unsafe {
        asm!(
            "movdqu ({ptr}), %xmm0",
            "phaddd %xmm0, %xmm0",
            "phaddd %xmm0, %xmm0",
            "movd   %xmm0, {out}",
            ptr = in(reg) vals.as_ptr(),
            out = out(reg) result,
            out("xmm0") _,
            options(att_syntax, nostack),
        );
    }
    result
}

fn main() {
    let data = [10i32, 20, 30, 40];
    println!("SIMD sum: {}", simd_sum_4(&data));
    println!("Expected: {}", 10 + 20 + 30 + 40);
}

Compile and run:

RUSTFLAGS="-C target-feature=+ssse3" rustc -O simd_sum_rust.rs && ./simd_sum_rust

Rust Note: For SIMD in production Rust code, prefer the std::arch intrinsics (like _mm_hadd_epi32) or the portable std::simd module (nightly). Use inline assembly only when the specific instruction you need has no intrinsic wrapper.

Try It: Modify the SIMD example to sum 8 integers using two movdqu loads and a paddd to combine them before the horizontal adds.

Safety Considerations

Inline assembly bypasses every safety guarantee both languages provide.

+----------------------------------+-----------------------------------+
| Risk                             | Mitigation                        |
+----------------------------------+-----------------------------------+
| Wrong register constraints       | Test on multiple opt levels       |
| Missing clobber declaration      | List ALL modified registers       |
| Stack misalignment               | Use nostack or align manually     |
| Forgetting "memory" clobber      | Add if asm touches any memory     |
| Platform-specific code           | Guard with #ifdef / #[cfg()]      |
| Compiler upgrades break asm      | Minimize asm surface area         |
+----------------------------------+-----------------------------------+

Caution: Incorrect constraints are silent killers. The assembler will not warn you. The program will appear to work, then break under different optimization levels or with different surrounding code. Always test with -O0, -O2, and -O3.

Wrap each asm block in a small, well-named inline function. Keep the asm to the one or two instructions you actually need, and let the compiler handle everything else. The compiler is better at register allocation and instruction scheduling than you are.

Knowledge Check

  1. Why is asm volatile used instead of plain asm for reading hardware counters?

  2. What happens if you forget to list a register in the clobber list that your assembly modifies?

  3. In Rust's asm! macro, what is the difference between out("rax") and lateout("rax")?

Common Pitfalls

  • Missing volatile. Without it, the compiler may eliminate or move your asm block. Use volatile for anything with side effects.
  • Incomplete clobber lists. If your assembly modifies a register and you did not declare it, the compiler may store a value there that gets silently corrupted.
  • Forgetting the "memory" clobber. If your assembly reads or writes memory through a pointer, you must include "memory" in the clobber list (or use nomem/omit it appropriately in Rust).
  • Assuming AT&T vs Intel syntax. GCC uses AT&T syntax by default (src, dst). Rust's asm! uses Intel syntax by default (dst, src). Use the att_syntax option in Rust if you prefer AT&T.
  • Writing complex logic in assembly. Keep asm blocks to one or two instructions. Let the compiler handle the rest.
  • Not guarding platform-specific code. Wrap all inline assembly in #ifdef __x86_64__ (C) or #[cfg(target_arch = "x86_64")] (Rust) so the code does not break on ARM or other architectures.