C Optimization Techniques
Performance matters in systems programming. This chapter covers the tools and techniques that separate amateur C from production C: compiler flags, profiling, cache-aware data layout, and branch prediction hints. The rule is always the same: measure first, optimize second.
Compiler Optimization Levels
GCC and Clang accept -O flags that control how aggressively the compiler
transforms your code.
| Flag | Effect |
|---|---|
-O0 | No optimization. Fastest compile, debuggable. |
-O1 | Basic optimizations. Smaller code. |
-O2 | Most optimizations. Good default for release. |
-O3 | Aggressive: vectorization, inlining, unrolling. |
-Os | Optimize for size (like -O2 minus bloat). |
-Og | Optimize for debugging experience. |
Let's see the difference on a trivial loop.
/* opt_levels.c */
#include <stdio.h>
#include <time.h>
static long sum_array(const int *arr, int n) {
long total = 0;
for (int i = 0; i < n; i++) {
total += arr[i];
}
return total;
}
int main(void) {
enum { N = 100000000 };
static int data[N];
for (int i = 0; i < N; i++)
data[i] = i & 0xFF;
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
long result = sum_array(data, N);
clock_gettime(CLOCK_MONOTONIC, &t1);
double elapsed = (t1.tv_sec - t0.tv_sec)
+ (t1.tv_nsec - t0.tv_nsec) / 1e9;
printf("sum = %ld, time = %.4f s\n", result, elapsed);
return 0;
}
Compile and run at different levels:
$ gcc -O0 -o opt0 opt_levels.c && ./opt0
sum = 12700000000, time = 0.2510 s
$ gcc -O2 -o opt2 opt_levels.c && ./opt2
sum = 12700000000, time = 0.0380 s
$ gcc -O3 -o opt3 opt_levels.c && ./opt3
sum = 12700000000, time = 0.0120 s
At -O3, the compiler auto-vectorizes the loop using SIMD instructions.
Caution:
-O3can change behavior of code that relies on undefined behavior. If your program works at-O0but breaks at-O2, you have a bug, not a compiler problem.
Looking at What the Compiler Did
Use -S to see assembly, or objdump -d on the binary:
$ gcc -O2 -S -o opt2.s opt_levels.c
$ grep -A5 'sum_array' opt2.s
The Compiler Explorer (godbolt.org) is invaluable for comparing output across flags and compilers. Use it.
Profile-Guided Optimization (PGO)
PGO lets the compiler observe real execution patterns, then recompile with that data.
# Step 1: Instrument
$ gcc -O2 -fprofile-generate -o opt_pgo_gen opt_levels.c
# Step 2: Run with representative input
$ ./opt_pgo_gen
# Step 3: Recompile using the profile
$ gcc -O2 -fprofile-use -o opt_pgo opt_levels.c
PGO helps the compiler make better inlining, branching, and layout decisions. Typical improvement: 5-20% on real workloads.
Profiling with perf
Never guess where time is spent. Use perf.
$ gcc -O2 -g -o opt2 opt_levels.c
$ perf stat ./opt2
This gives you cycle counts, cache misses, branch mispredictions, and IPC (instructions per cycle).
For function-level profiling:
$ perf record -g ./opt2
$ perf report
perf report shows a call-graph breakdown. Look for the hottest functions
first.
Profiling with gprof
$ gcc -O2 -pg -o opt_gprof opt_levels.c
$ ./opt_gprof
$ gprof opt_gprof gmon.out | head -30
gprof adds instrumentation overhead but gives call counts and cumulative time.
Profiling with Valgrind/Callgrind
$ gcc -O2 -g -o opt2 opt_levels.c
$ valgrind --tool=callgrind ./opt2
$ callgrind_annotate callgrind.out.* | head -40
Callgrind simulates the CPU cache hierarchy. It's slow (20-50x), but gives exact instruction counts and cache miss data.
Try It: Compile
opt_levels.cat-O0and-O3. Runperf staton both. Compare the "instructions" and "cache-misses" lines.
Cache-Friendly Data Layout
Modern CPUs are fast. Memory is slow. A cache miss costs 100+ cycles. Data layout determines cache behavior.
Array of Structs (AoS) vs Struct of Arrays (SoA)
AoS (Array of Structs):
+------+------+------+------+------+------+------+------+
| x[0] | y[0] | z[0] | w[0] | x[1] | y[1] | z[1] | w[1] | ...
+------+------+------+------+------+------+------+------+
SoA (Struct of Arrays):
+------+------+------+------+------+------+------+------+
| x[0] | x[1] | x[2] | x[3] | y[0] | y[1] | y[2] | y[3] | ...
+------+------+------+------+------+------+------+------+
If you iterate over all elements but only touch x, SoA wins because cache
lines contain only x values.
/* cache_layout.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N 10000000
/* Array of Structs */
struct Particle_AoS {
float x, y, z;
float vx, vy, vz;
float mass;
float pad; /* 32 bytes total */
};
/* Struct of Arrays */
struct Particles_SoA {
float *x, *y, *z;
float *vx, *vy, *vz;
float *mass;
};
static double now(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec + ts.tv_nsec / 1e9;
}
int main(void) {
/* AoS test */
struct Particle_AoS *aos = malloc(N * sizeof(*aos));
for (int i = 0; i < N; i++) {
aos[i].x = (float)i;
aos[i].mass = 1.0f;
}
double t0 = now();
float sum_aos = 0;
for (int i = 0; i < N; i++)
sum_aos += aos[i].x * aos[i].mass;
double t1 = now();
printf("AoS: sum=%.0f time=%.4f s\n", sum_aos, t1 - t0);
/* SoA test */
struct Particles_SoA soa;
soa.x = malloc(N * sizeof(float));
soa.mass = malloc(N * sizeof(float));
for (int i = 0; i < N; i++) {
soa.x[i] = (float)i;
soa.mass[i] = 1.0f;
}
t0 = now();
float sum_soa = 0;
for (int i = 0; i < N; i++)
sum_soa += soa.x[i] * soa.mass[i];
t1 = now();
printf("SoA: sum=%.0f time=%.4f s\n", sum_soa, t1 - t0);
free(aos);
free(soa.x);
free(soa.mass);
return 0;
}
$ gcc -O2 -o cache_layout cache_layout.c -lm
$ ./cache_layout
AoS: sum=... time=0.0280 s
SoA: sum=... time=0.0090 s
SoA wins because only 8 bytes per element touch the cache (x + mass), not 32.
Driver Prep: Kernel DMA buffer layout affects device performance. The same AoS-vs-SoA trade-off applies to descriptor rings in network drivers.
Branch Prediction Hints
CPUs predict branches. Mispredictions cost ~15 cycles. You can hint the
compiler with __builtin_expect.
/* branch_hints.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
static long process(const int *data, int n) {
long sum = 0;
for (int i = 0; i < n; i++) {
if (unlikely(data[i] < 0)) {
/* Error path: rarely taken */
sum -= data[i];
} else {
sum += data[i];
}
}
return sum;
}
int main(void) {
enum { N = 100000000 };
int *data = malloc(N * sizeof(int));
/* 99.9% positive values */
for (int i = 0; i < N; i++)
data[i] = (i % 1000 == 0) ? -1 : i & 0xFF;
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
long result = process(data, N);
clock_gettime(CLOCK_MONOTONIC, &t1);
double elapsed = (t1.tv_sec - t0.tv_sec)
+ (t1.tv_nsec - t0.tv_nsec) / 1e9;
printf("sum=%ld time=%.4f s\n", result, elapsed);
free(data);
return 0;
}
The Linux kernel defines likely() and unlikely() macros everywhere. Use
them on error-checking branches.
The restrict Keyword
restrict tells the compiler that two pointers don't alias (don't point to
overlapping memory). This enables vectorization.
/* restrict_demo.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
void add_arrays(float *restrict dst,
const float *restrict a,
const float *restrict b,
int n) {
for (int i = 0; i < n; i++)
dst[i] = a[i] + b[i];
}
void add_arrays_no_restrict(float *dst,
const float *a,
const float *b,
int n) {
for (int i = 0; i < n; i++)
dst[i] = a[i] + b[i];
}
int main(void) {
enum { N = 50000000 };
float *a = malloc(N * sizeof(float));
float *b = malloc(N * sizeof(float));
float *c = malloc(N * sizeof(float));
for (int i = 0; i < N; i++) {
a[i] = (float)i;
b[i] = (float)(N - i);
}
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
add_arrays(c, a, b, N);
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("restrict: %.4f s\n",
(t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);
clock_gettime(CLOCK_MONOTONIC, &t0);
add_arrays_no_restrict(c, a, b, N);
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("no restrict: %.4f s\n",
(t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);
free(a); free(b); free(c);
return 0;
}
Without restrict, the compiler must assume dst might overlap a or b,
preventing SIMD optimization.
Caution: If you lie to the compiler with
restrictand the pointers actually alias, you get undefined behavior. The compiler will generate wrong code.
Loop Unrolling
The compiler can unroll loops at -O2/-O3, but you can also do it manually
or with pragmas:
/* unroll.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
long sum_unrolled(const int *arr, int n) {
long s0 = 0, s1 = 0, s2 = 0, s3 = 0;
int i = 0;
/* Process 4 elements per iteration */
for (; i + 3 < n; i += 4) {
s0 += arr[i];
s1 += arr[i + 1];
s2 += arr[i + 2];
s3 += arr[i + 3];
}
/* Handle remainder */
for (; i < n; i++)
s0 += arr[i];
return s0 + s1 + s2 + s3;
}
long sum_simple(const int *arr, int n) {
long total = 0;
for (int i = 0; i < n; i++)
total += arr[i];
return total;
}
int main(void) {
enum { N = 100000000 };
int *data = malloc(N * sizeof(int));
for (int i = 0; i < N; i++)
data[i] = i & 0xFF;
struct timespec t0, t1;
clock_gettime(CLOCK_MONOTONIC, &t0);
long r1 = sum_simple(data, N);
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("simple: sum=%ld %.4f s\n", r1,
(t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);
clock_gettime(CLOCK_MONOTONIC, &t0);
long r2 = sum_unrolled(data, N);
clock_gettime(CLOCK_MONOTONIC, &t1);
printf("unrolled: sum=%ld %.4f s\n", r2,
(t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);
free(data);
return 0;
}
Manual unrolling with multiple accumulators (s0-s3) breaks data
dependencies and lets the CPU pipeline fill.
GCC also supports:
#pragma GCC unroll 8
for (int i = 0; i < n; i++)
total += arr[i];
Rust Optimization
Rust uses LLVM. The same optimization principles apply.
# Debug build (like -O0)
$ cargo build
# Release build (like -O2 + LTO)
$ cargo build --release
In Cargo.toml:
[profile.release]
opt-level = 3
lto = true
codegen-units = 1
// src/main.rs — cache-friendly iteration use std::time::Instant; const N: usize = 10_000_000; struct ParticlesAoS { data: Vec<(f32, f32, f32, f32)>, // x, y, z, mass } struct ParticlesSoA { x: Vec<f32>, mass: Vec<f32>, } fn main() { // AoS let aos = ParticlesAoS { data: (0..N).map(|i| (i as f32, 0.0, 0.0, 1.0)).collect(), }; let t0 = Instant::now(); let sum_aos: f32 = aos.data.iter().map(|(x, _, _, m)| x * m).sum(); let d_aos = t0.elapsed(); // SoA let soa = ParticlesSoA { x: (0..N).map(|i| i as f32).collect(), mass: vec![1.0; N], }; let t0 = Instant::now(); let sum_soa: f32 = soa.x.iter().zip(&soa.mass).map(|(x, m)| x * m).sum(); let d_soa = t0.elapsed(); println!("AoS: sum={sum_aos:.0} time={d_aos:?}"); println!("SoA: sum={sum_soa:.0} time={d_soa:?}"); }
Rust Note: Rust iterators compile to the same tight loops as C
forloops at-O2. No overhead. The compiler auto-vectorizes them just like it would a C loop.
Rust: Profiling
Use cargo flamegraph or perf directly:
$ cargo build --release
$ perf stat ./target/release/myapp
$ perf record -g ./target/release/myapp
$ perf report
For Criterion-based benchmarks:
#![allow(unused)] fn main() { // benches/my_bench.rs use criterion::{black_box, criterion_group, criterion_main, Criterion}; fn bench_sum(c: &mut Criterion) { let data: Vec<i32> = (0..1_000_000).map(|i| (i & 0xFF) as i32).collect(); c.bench_function("sum", |b| { b.iter(|| { let sum: i64 = data.iter().map(|&x| x as i64).sum(); black_box(sum) }) }); } criterion_group!(benches, bench_sum); criterion_main!(benches); }
black_box prevents the compiler from optimizing away the computation.
Rust Note: Rust has no direct equivalent of
restrict. The borrow checker ensures&mutreferences are unique, which gives LLVM the same aliasing information automatically.
Measuring: The Golden Rule
Rule: If you didn't measure it, you don't know it's slow.
Rule: If you didn't measure it after, you don't know you fixed it.
Micro-benchmark checklist
- Warm up the cache first (run the function once before timing).
- Use
clock_gettime(CLOCK_MONOTONIC)in C,Instant::now()in Rust. - Run multiple iterations and take the median, not the mean.
- Disable CPU frequency scaling during benchmarks.
- Use
volatileorblack_boxto prevent dead-code elimination.
/* prevent_dce.c — prevent dead code elimination */
#include <stdio.h>
#include <time.h>
/* Force the compiler to keep the result */
static void do_not_optimize(void *p) {
__asm__ volatile("" : : "g"(p) : "memory");
}
int main(void) {
struct timespec t0, t1;
long sum = 0;
clock_gettime(CLOCK_MONOTONIC, &t0);
for (int i = 0; i < 100000000; i++)
sum += i;
do_not_optimize(&sum);
clock_gettime(CLOCK_MONOTONIC, &t1);
double elapsed = (t1.tv_sec - t0.tv_sec)
+ (t1.tv_nsec - t0.tv_nsec) / 1e9;
printf("sum=%ld time=%.4f s\n", sum, elapsed);
return 0;
}
Try It: Compile
prevent_dce.cwith-O3but without thedo_not_optimizecall. What happens to the loop? Check the assembly.
Optimization Decision Flowchart
Is it fast enough?
|
+-- YES --> Stop. Ship it.
|
+-- NO --> Profile it.
|
Where is the time?
|
+-- CPU bound --> Check -O level, restrict, SIMD
|
+-- Memory bound --> Check data layout, cache misses
|
+-- I/O bound --> Check syscall count, buffering
|
+-- Branch misses --> Check branch patterns, likely/unlikely
Quick Knowledge Check
- Why is
-O0useful even though it produces slow code? - What does
restrictpromise the compiler, and what happens if you lie? - When does SoA beat AoS?
Common Pitfalls
- Optimizing without profiling. You will optimize the wrong thing.
- Benchmarking at
-O0. That measures the interpreter, not your algorithm. - Forgetting warm-up. Cold caches give misleading first-run numbers.
- Using
gettimeofdayfor benchmarks. It's not monotonic. Useclock_gettime(CLOCK_MONOTONIC). - Assuming
-O3is always better than-O2. Aggressive inlining can blow up the instruction cache. restricton aliased pointers. Undefined behavior, silently wrong.- Optimizing for one CPU.
-march=nativewon't run on other machines.