Inline Assembly
Sometimes you need to drop below the language and talk directly to the CPU.
Reading hardware registers, executing specific instructions the compiler does
not emit, or inserting precise memory barriers -- these require inline assembly.
This chapter shows how to embed assembly in both C (GCC extended asm) and
Rust (the asm! macro), with real examples on x86-64.
When You Need Inline Assembly
Most code never needs it. But these situations demand it:
- Reading CPU-specific registers (cycle counter, model info, control registers)
- Memory and compiler barriers (preventing reordering in lock-free code)
- Specific SIMD instructions that the compiler does not auto-vectorize to
- Hardware I/O (
in/outinstructions for port-mapped I/O) - Atomic operations not provided by the language or library
- System calls (the raw
syscallinstruction)
Driver Prep: Kernel code and device drivers use inline assembly for all of the above. The Linux kernel's
arch/x86/include/asm/directory is full of inline assembly wrappers. Understanding the constraint system is essential.
C: GCC Extended Assembly Syntax
The basic form:
asm volatile (
"assembly template"
: output operands
: input operands
: clobber list
);
Example: Reading the CPU Cycle Counter (RDTSC)
The rdtsc instruction reads the Time Stamp Counter into EDX:EAX (high 32
bits in EDX, low 32 bits in EAX).
/* rdtsc.c */
#include <stdio.h>
#include <stdint.h>
static inline uint64_t read_tsc(void)
{
uint32_t lo, hi;
asm volatile (
"rdtsc"
: "=a" (lo), /* output: EAX -> lo */
"=d" (hi) /* output: EDX -> hi */
: /* no inputs */
: /* no extra clobbers */
);
return ((uint64_t)hi << 32) | lo;
}
int main(void)
{
uint64_t start = read_tsc();
/* Do some work */
volatile int sum = 0;
for (int i = 0; i < 1000000; i++)
sum += i;
uint64_t end = read_tsc();
printf("Cycles: %lu\n", end - start);
printf("Sum: %d\n", sum);
return 0;
}
Compile and run:
gcc -O2 -Wall -o rdtsc rdtsc.c && ./rdtsc
The constraint letters tell GCC which registers to use:
+------------+---------------------------+
| Constraint | Meaning |
+------------+---------------------------+
| "=a" | output in EAX |
| "=d" | output in EDX |
| "=r" | output in any GPR |
| "=m" | output in memory |
| "r" | input in any GPR |
| "i" | immediate constant |
| "m" | input in memory |
+------------+---------------------------+
| Modifiers |
+------------+---------------------------+
| "=" | write-only output |
| "+" | read-write operand |
| "&" | early-clobber output |
+------------+---------------------------+
Example: CPUID
The cpuid instruction returns CPU identification data. It reads EAX as a
"leaf" selector and writes results to EAX, EBX, ECX, EDX.
/* cpuid.c */
#include <stdio.h>
#include <stdint.h>
#include <string.h>
void get_cpu_vendor(char *vendor)
{
uint32_t ebx, ecx, edx;
asm volatile (
"cpuid"
: "=b" (ebx),
"=c" (ecx),
"=d" (edx)
: "a" (0) /* input: leaf 0 */
:
);
/* Vendor string is in EBX:EDX:ECX (yes, that order) */
memcpy(vendor + 0, &ebx, 4);
memcpy(vendor + 4, &edx, 4);
memcpy(vendor + 8, &ecx, 4);
vendor[12] = '\0';
}
int main(void)
{
char vendor[13];
get_cpu_vendor(vendor);
printf("CPU Vendor: %s\n", vendor);
return 0;
}
Compile and run:
gcc -O2 -Wall -o cpuid cpuid.c && ./cpuid
The volatile Keyword
asm volatile tells the compiler: "Do not optimize this away, do not move it,
do not assume it has no side effects." Without volatile, the compiler may
remove the asm block, reorder it, or merge duplicates. Always use volatile
unless the asm block is a pure function with no side effects (rare).
Caution: Even
volatiledoes not prevent CPU-level reordering. For that you need memory barriers (fence instructions).volatileonly constrains the compiler's optimizer.
Memory Barriers
On modern out-of-order CPUs, the processor can reorder memory operations. Compilers can also reorder loads and stores. Barriers prevent both.
/* barriers.c */
#include <stdio.h>
#include <stdint.h>
/* Compiler barrier -- prevents compiler reordering, not CPU reordering */
#define compiler_barrier() asm volatile ("" ::: "memory")
/* Full memory fence -- prevents both compiler and CPU reordering */
#define full_fence() asm volatile ("mfence" ::: "memory")
/* Store fence -- all prior stores complete before any later store */
#define store_fence() asm volatile ("sfence" ::: "memory")
/* Load fence -- all prior loads complete before any later load */
#define load_fence() asm volatile ("lfence" ::: "memory")
volatile int shared_flag = 0;
volatile int shared_data = 0;
void producer(void)
{
shared_data = 42; /* write data first */
store_fence(); /* ensure data is visible before flag */
shared_flag = 1; /* signal that data is ready */
}
void consumer(void)
{
while (!shared_flag) /* wait for flag */
;
load_fence(); /* ensure we read data after flag */
printf("Data: %d\n", shared_data); /* guaranteed to see 42 */
}
int main(void)
{
/* Single-threaded demo -- the barriers matter in multi-threaded code */
producer();
consumer();
return 0;
}
Compile and run:
gcc -O2 -Wall -o barriers barriers.c && ./barriers
The "memory" clobber tells the compiler that the asm block may read or write
any memory, so it must not reorder loads/stores across it.
Without barrier: With barrier:
store data = 42 store data = 42
store flag = 1 sfence
(CPU may reorder these!) store flag = 1
(stores are ordered)
Driver Prep: The Linux kernel defines
mb(),rmb(),wmb()(full, read, write memory barriers) andsmp_mb(),smp_rmb(),smp_wmb()(SMP variants). These are thin wrappers around inline assembly fence instructions.
C: Inline Assembly for a System Call
On x86-64 Linux, system calls use the syscall instruction:
/* raw_syscall.c */
#include <stdio.h>
#include <stdint.h>
/* write(fd, buf, count) -- syscall number 1 on x86-64 */
static long raw_write(int fd, const void *buf, unsigned long count)
{
long ret;
asm volatile (
"syscall"
: "=a" (ret) /* output: return value in RAX */
: "a" (1), /* input: syscall number in RAX */
"D" ((long)fd), /* input: arg1 in RDI */
"S" ((long)buf), /* input: arg2 in RSI */
"d" ((long)count) /* input: arg3 in RDX */
: "rcx", "r11", "memory" /* clobbers: syscall destroys RCX, R11 */
);
return ret;
}
int main(void)
{
const char msg[] = "Hello from raw syscall!\n";
long written = raw_write(1, msg, sizeof(msg) - 1);
printf("Bytes written: %ld\n", written);
return 0;
}
Compile and run:
gcc -O2 -Wall -o raw_syscall raw_syscall.c && ./raw_syscall
Output:
Hello from raw syscall!
Bytes written: 23
The x86-64 Linux syscall convention: RAX = syscall number and return value, arguments in RDI, RSI, RDX, R10, R8, R9. The CPU clobbers RCX and R11.
Try It: Implement a
raw_exit(int code)function using thesyscallinstruction (syscall number 60 on x86-64). Call it instead ofreturn 0and verify the exit code withecho $?.
Rust: The asm! Macro
Rust stabilized inline assembly in Rust 1.59. The syntax is different from GCC's but the concept is the same.
Example: Reading the CPU Cycle Counter
// rdtsc_rust.rs use std::arch::asm; fn read_tsc() -> u64 { let lo: u32; let hi: u32; unsafe { asm!( "rdtsc", out("eax") lo, out("edx") hi, options(nostack, nomem), ); } ((hi as u64) << 32) | (lo as u64) } fn main() { let start = read_tsc(); let mut sum: u64 = 0; for i in 0..1_000_000u64 { sum = sum.wrapping_add(i); } let end = read_tsc(); println!("Cycles: {}", end - start); println!("Sum: {}", sum); }
Compile and run:
rustc -O rdtsc_rust.rs && ./rdtsc_rust
Key differences from GCC syntax:
+---------------------------+--------------------------------+
| GCC (C) | Rust asm! |
+---------------------------+--------------------------------+
| "=a" (lo) | out("eax") lo |
| "=d" (hi) | out("edx") hi |
| : "memory" | options(nomem) if no memory |
| | access, otherwise omit |
| asm volatile | asm! is volatile by default |
| "r" (input) | in(reg) input |
+---------------------------+--------------------------------+
Example: CPUID in Rust
// cpuid_rust.rs use std::arch::asm; fn cpu_vendor() -> String { let ebx: u32; let ecx: u32; let edx: u32; unsafe { asm!( "cpuid", inout("eax") 0u32 => _, out("ebx") ebx, out("ecx") ecx, out("edx") edx, options(nostack), ); } let mut vendor = [0u8; 12]; vendor[0..4].copy_from_slice(&ebx.to_le_bytes()); vendor[4..8].copy_from_slice(&edx.to_le_bytes()); vendor[8..12].copy_from_slice(&ecx.to_le_bytes()); String::from_utf8_lossy(&vendor).to_string() } fn main() { println!("CPU Vendor: {}", cpu_vendor()); }
Compile and run:
rustc -O cpuid_rust.rs && ./cpuid_rust
Rust asm! Operand and Option Reference
+-------------------+------------------------------------------+
| Operand | Meaning |
+-------------------+------------------------------------------+
| in(reg) expr | Input in any general-purpose register |
| in("eax") expr | Input in a specific register |
| out(reg) var | Output to any GPR |
| out("edx") var | Output to a specific register |
| inout(reg) var | Read-write operand |
| inout("eax") x=>y | Input x, output y, same register |
| out(reg) _ | Clobbered register (output discarded) |
+-------------------+------------------------------------------+
+-------------------+------------------------------------------+
| Option | Meaning |
+-------------------+------------------------------------------+
| nomem | Asm does not read/write memory |
| nostack | Asm does not use the stack |
| pure | No side effects (allows optimization) |
| preserves_flags | Does not modify CPU flags (EFLAGS) |
| att_syntax | Use AT&T syntax instead of Intel |
+-------------------+------------------------------------------+
By default, asm! blocks are treated as volatile (they will not be removed or
reordered). Adding pure and nomem together allows the compiler to optimize
like a regular function call.
Rust: A Raw System Call
// raw_syscall_rust.rs use std::arch::asm; /// Perform a raw write() system call on x86-64 Linux. unsafe fn raw_write(fd: u64, buf: *const u8, count: u64) -> i64 { let ret: i64; asm!( "syscall", in("rax") 1u64, // syscall number: write = 1 in("rdi") fd, // arg1: file descriptor in("rsi") buf as u64, // arg2: buffer pointer in("rdx") count, // arg3: byte count out("rcx") _, // clobbered by syscall out("r11") _, // clobbered by syscall lateout("rax") ret, // return value options(nostack), ); ret } fn main() { let msg = b"Hello from Rust raw syscall!\n"; let written = unsafe { raw_write(1, msg.as_ptr(), msg.len() as u64) }; println!("Bytes written: {}", written); }
Compile and run:
rustc -O raw_syscall_rust.rs && ./raw_syscall_rust
Note lateout("rax") instead of out("rax"). This tells the compiler that the
output is written late (after inputs are consumed), which is necessary because
rax is also used as an input.
Rust Note: In practice, use
std::sync::atomicwith properOrderingvalues (SeqCst,Acquire,Release) instead of raw fence instructions for memory barriers. The atomic types generate the correct barriers automatically. Inline assembly for barriers is only needed when interfacing with hardware or writing the lowest levels of a synchronization library.
SIMD: A Practical Example
Let us use SSSE3 to sum an array of four 32-bit integers using 128-bit SIMD registers.
/* simd_sum.c */
#include <stdio.h>
#include <stdint.h>
int32_t simd_sum_4(const int32_t vals[4])
{
int32_t result;
asm volatile (
"movdqu (%1), %%xmm0\n\t" /* load 4 ints into xmm0 */
"phaddd %%xmm0, %%xmm0\n\t" /* horizontal add pairs */
"phaddd %%xmm0, %%xmm0\n\t" /* horizontal add again */
"movd %%xmm0, %0\n\t" /* extract low 32 bits */
: "=r" (result)
: "r" (vals)
: "xmm0", "memory"
);
return result;
}
int main(void)
{
int32_t data[4] = {10, 20, 30, 40};
printf("SIMD sum: %d\n", simd_sum_4(data));
printf("Expected: %d\n", 10 + 20 + 30 + 40);
return 0;
}
Compile and run (requires SSSE3):
gcc -O2 -Wall -mssse3 -o simd_sum simd_sum.c && ./simd_sum
The equivalent in Rust:
// simd_sum_rust.rs use std::arch::asm; fn simd_sum_4(vals: &[i32; 4]) -> i32 { let result: i32; unsafe { asm!( "movdqu ({ptr}), %xmm0", "phaddd %xmm0, %xmm0", "phaddd %xmm0, %xmm0", "movd %xmm0, {out}", ptr = in(reg) vals.as_ptr(), out = out(reg) result, out("xmm0") _, options(att_syntax, nostack), ); } result } fn main() { let data = [10i32, 20, 30, 40]; println!("SIMD sum: {}", simd_sum_4(&data)); println!("Expected: {}", 10 + 20 + 30 + 40); }
Compile and run:
RUSTFLAGS="-C target-feature=+ssse3" rustc -O simd_sum_rust.rs && ./simd_sum_rust
Rust Note: For SIMD in production Rust code, prefer the
std::archintrinsics (like_mm_hadd_epi32) or the portablestd::simdmodule (nightly). Use inline assembly only when the specific instruction you need has no intrinsic wrapper.
Try It: Modify the SIMD example to sum 8 integers using two
movdquloads and apadddto combine them before the horizontal adds.
Safety Considerations
Inline assembly bypasses every safety guarantee both languages provide.
+----------------------------------+-----------------------------------+
| Risk | Mitigation |
+----------------------------------+-----------------------------------+
| Wrong register constraints | Test on multiple opt levels |
| Missing clobber declaration | List ALL modified registers |
| Stack misalignment | Use nostack or align manually |
| Forgetting "memory" clobber | Add if asm touches any memory |
| Platform-specific code | Guard with #ifdef / #[cfg()] |
| Compiler upgrades break asm | Minimize asm surface area |
+----------------------------------+-----------------------------------+
Caution: Incorrect constraints are silent killers. The assembler will not warn you. The program will appear to work, then break under different optimization levels or with different surrounding code. Always test with
-O0,-O2, and-O3.
Wrap each asm block in a small, well-named inline function. Keep the asm to the one or two instructions you actually need, and let the compiler handle everything else. The compiler is better at register allocation and instruction scheduling than you are.
Knowledge Check
-
Why is
asm volatileused instead of plainasmfor reading hardware counters? -
What happens if you forget to list a register in the clobber list that your assembly modifies?
-
In Rust's
asm!macro, what is the difference betweenout("rax")andlateout("rax")?
Common Pitfalls
- Missing
volatile. Without it, the compiler may eliminate or move your asm block. Usevolatilefor anything with side effects. - Incomplete clobber lists. If your assembly modifies a register and you did not declare it, the compiler may store a value there that gets silently corrupted.
- Forgetting the
"memory"clobber. If your assembly reads or writes memory through a pointer, you must include"memory"in the clobber list (or usenomem/omit it appropriately in Rust). - Assuming AT&T vs Intel syntax. GCC uses AT&T syntax by default
(
src, dst). Rust'sasm!uses Intel syntax by default (dst, src). Use theatt_syntaxoption in Rust if you prefer AT&T. - Writing complex logic in assembly. Keep asm blocks to one or two instructions. Let the compiler handle the rest.
- Not guarding platform-specific code. Wrap all inline assembly in
#ifdef __x86_64__(C) or#[cfg(target_arch = "x86_64")](Rust) so the code does not break on ARM or other architectures.