C Optimization Techniques

Performance matters in systems programming. This chapter covers the tools and techniques that separate amateur C from production C: compiler flags, profiling, cache-aware data layout, and branch prediction hints. The rule is always the same: measure first, optimize second.

Compiler Optimization Levels

GCC and Clang accept -O flags that control how aggressively the compiler transforms your code.

FlagEffect
-O0No optimization. Fastest compile, debuggable.
-O1Basic optimizations. Smaller code.
-O2Most optimizations. Good default for release.
-O3Aggressive: vectorization, inlining, unrolling.
-OsOptimize for size (like -O2 minus bloat).
-OgOptimize for debugging experience.

Let's see the difference on a trivial loop.

/* opt_levels.c */
#include <stdio.h>
#include <time.h>

static long sum_array(const int *arr, int n) {
    long total = 0;
    for (int i = 0; i < n; i++) {
        total += arr[i];
    }
    return total;
}

int main(void) {
    enum { N = 100000000 };
    static int data[N];

    for (int i = 0; i < N; i++)
        data[i] = i & 0xFF;

    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);

    long result = sum_array(data, N);

    clock_gettime(CLOCK_MONOTONIC, &t1);

    double elapsed = (t1.tv_sec - t0.tv_sec)
                   + (t1.tv_nsec - t0.tv_nsec) / 1e9;

    printf("sum = %ld, time = %.4f s\n", result, elapsed);
    return 0;
}

Compile and run at different levels:

$ gcc -O0 -o opt0 opt_levels.c && ./opt0
sum = 12700000000, time = 0.2510 s

$ gcc -O2 -o opt2 opt_levels.c && ./opt2
sum = 12700000000, time = 0.0380 s

$ gcc -O3 -o opt3 opt_levels.c && ./opt3
sum = 12700000000, time = 0.0120 s

At -O3, the compiler auto-vectorizes the loop using SIMD instructions.

Caution: -O3 can change behavior of code that relies on undefined behavior. If your program works at -O0 but breaks at -O2, you have a bug, not a compiler problem.

Looking at What the Compiler Did

Use -S to see assembly, or objdump -d on the binary:

$ gcc -O2 -S -o opt2.s opt_levels.c
$ grep -A5 'sum_array' opt2.s

The Compiler Explorer (godbolt.org) is invaluable for comparing output across flags and compilers. Use it.

Profile-Guided Optimization (PGO)

PGO lets the compiler observe real execution patterns, then recompile with that data.

# Step 1: Instrument
$ gcc -O2 -fprofile-generate -o opt_pgo_gen opt_levels.c

# Step 2: Run with representative input
$ ./opt_pgo_gen

# Step 3: Recompile using the profile
$ gcc -O2 -fprofile-use -o opt_pgo opt_levels.c

PGO helps the compiler make better inlining, branching, and layout decisions. Typical improvement: 5-20% on real workloads.

Profiling with perf

Never guess where time is spent. Use perf.

$ gcc -O2 -g -o opt2 opt_levels.c
$ perf stat ./opt2

This gives you cycle counts, cache misses, branch mispredictions, and IPC (instructions per cycle).

For function-level profiling:

$ perf record -g ./opt2
$ perf report

perf report shows a call-graph breakdown. Look for the hottest functions first.

Profiling with gprof

$ gcc -O2 -pg -o opt_gprof opt_levels.c
$ ./opt_gprof
$ gprof opt_gprof gmon.out | head -30

gprof adds instrumentation overhead but gives call counts and cumulative time.

Profiling with Valgrind/Callgrind

$ gcc -O2 -g -o opt2 opt_levels.c
$ valgrind --tool=callgrind ./opt2
$ callgrind_annotate callgrind.out.* | head -40

Callgrind simulates the CPU cache hierarchy. It's slow (20-50x), but gives exact instruction counts and cache miss data.

Try It: Compile opt_levels.c at -O0 and -O3. Run perf stat on both. Compare the "instructions" and "cache-misses" lines.

Cache-Friendly Data Layout

Modern CPUs are fast. Memory is slow. A cache miss costs 100+ cycles. Data layout determines cache behavior.

Array of Structs (AoS) vs Struct of Arrays (SoA)

AoS (Array of Structs):
+------+------+------+------+------+------+------+------+
| x[0] | y[0] | z[0] | w[0] | x[1] | y[1] | z[1] | w[1] | ...
+------+------+------+------+------+------+------+------+

SoA (Struct of Arrays):
+------+------+------+------+------+------+------+------+
| x[0] | x[1] | x[2] | x[3] | y[0] | y[1] | y[2] | y[3] | ...
+------+------+------+------+------+------+------+------+

If you iterate over all elements but only touch x, SoA wins because cache lines contain only x values.

/* cache_layout.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 10000000

/* Array of Structs */
struct Particle_AoS {
    float x, y, z;
    float vx, vy, vz;
    float mass;
    float pad; /* 32 bytes total */
};

/* Struct of Arrays */
struct Particles_SoA {
    float *x, *y, *z;
    float *vx, *vy, *vz;
    float *mass;
};

static double now(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec / 1e9;
}

int main(void) {
    /* AoS test */
    struct Particle_AoS *aos = malloc(N * sizeof(*aos));
    for (int i = 0; i < N; i++) {
        aos[i].x = (float)i;
        aos[i].mass = 1.0f;
    }

    double t0 = now();
    float sum_aos = 0;
    for (int i = 0; i < N; i++)
        sum_aos += aos[i].x * aos[i].mass;
    double t1 = now();
    printf("AoS: sum=%.0f  time=%.4f s\n", sum_aos, t1 - t0);

    /* SoA test */
    struct Particles_SoA soa;
    soa.x    = malloc(N * sizeof(float));
    soa.mass = malloc(N * sizeof(float));
    for (int i = 0; i < N; i++) {
        soa.x[i] = (float)i;
        soa.mass[i] = 1.0f;
    }

    t0 = now();
    float sum_soa = 0;
    for (int i = 0; i < N; i++)
        sum_soa += soa.x[i] * soa.mass[i];
    t1 = now();
    printf("SoA: sum=%.0f  time=%.4f s\n", sum_soa, t1 - t0);

    free(aos);
    free(soa.x);
    free(soa.mass);
    return 0;
}
$ gcc -O2 -o cache_layout cache_layout.c -lm
$ ./cache_layout
AoS: sum=...  time=0.0280 s
SoA: sum=...  time=0.0090 s

SoA wins because only 8 bytes per element touch the cache (x + mass), not 32.

Driver Prep: Kernel DMA buffer layout affects device performance. The same AoS-vs-SoA trade-off applies to descriptor rings in network drivers.

Branch Prediction Hints

CPUs predict branches. Mispredictions cost ~15 cycles. You can hint the compiler with __builtin_expect.

/* branch_hints.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

static long process(const int *data, int n) {
    long sum = 0;
    for (int i = 0; i < n; i++) {
        if (unlikely(data[i] < 0)) {
            /* Error path: rarely taken */
            sum -= data[i];
        } else {
            sum += data[i];
        }
    }
    return sum;
}

int main(void) {
    enum { N = 100000000 };
    int *data = malloc(N * sizeof(int));

    /* 99.9% positive values */
    for (int i = 0; i < N; i++)
        data[i] = (i % 1000 == 0) ? -1 : i & 0xFF;

    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    long result = process(data, N);
    clock_gettime(CLOCK_MONOTONIC, &t1);

    double elapsed = (t1.tv_sec - t0.tv_sec)
                   + (t1.tv_nsec - t0.tv_nsec) / 1e9;
    printf("sum=%ld  time=%.4f s\n", result, elapsed);

    free(data);
    return 0;
}

The Linux kernel defines likely() and unlikely() macros everywhere. Use them on error-checking branches.

The restrict Keyword

restrict tells the compiler that two pointers don't alias (don't point to overlapping memory). This enables vectorization.

/* restrict_demo.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void add_arrays(float *restrict dst,
                const float *restrict a,
                const float *restrict b,
                int n) {
    for (int i = 0; i < n; i++)
        dst[i] = a[i] + b[i];
}

void add_arrays_no_restrict(float *dst,
                            const float *a,
                            const float *b,
                            int n) {
    for (int i = 0; i < n; i++)
        dst[i] = a[i] + b[i];
}

int main(void) {
    enum { N = 50000000 };
    float *a = malloc(N * sizeof(float));
    float *b = malloc(N * sizeof(float));
    float *c = malloc(N * sizeof(float));

    for (int i = 0; i < N; i++) {
        a[i] = (float)i;
        b[i] = (float)(N - i);
    }

    struct timespec t0, t1;

    clock_gettime(CLOCK_MONOTONIC, &t0);
    add_arrays(c, a, b, N);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("restrict:    %.4f s\n",
           (t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);

    clock_gettime(CLOCK_MONOTONIC, &t0);
    add_arrays_no_restrict(c, a, b, N);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("no restrict: %.4f s\n",
           (t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);

    free(a); free(b); free(c);
    return 0;
}

Without restrict, the compiler must assume dst might overlap a or b, preventing SIMD optimization.

Caution: If you lie to the compiler with restrict and the pointers actually alias, you get undefined behavior. The compiler will generate wrong code.

Loop Unrolling

The compiler can unroll loops at -O2/-O3, but you can also do it manually or with pragmas:

/* unroll.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

long sum_unrolled(const int *arr, int n) {
    long s0 = 0, s1 = 0, s2 = 0, s3 = 0;
    int i = 0;

    /* Process 4 elements per iteration */
    for (; i + 3 < n; i += 4) {
        s0 += arr[i];
        s1 += arr[i + 1];
        s2 += arr[i + 2];
        s3 += arr[i + 3];
    }
    /* Handle remainder */
    for (; i < n; i++)
        s0 += arr[i];

    return s0 + s1 + s2 + s3;
}

long sum_simple(const int *arr, int n) {
    long total = 0;
    for (int i = 0; i < n; i++)
        total += arr[i];
    return total;
}

int main(void) {
    enum { N = 100000000 };
    int *data = malloc(N * sizeof(int));
    for (int i = 0; i < N; i++)
        data[i] = i & 0xFF;

    struct timespec t0, t1;

    clock_gettime(CLOCK_MONOTONIC, &t0);
    long r1 = sum_simple(data, N);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("simple:   sum=%ld  %.4f s\n", r1,
           (t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);

    clock_gettime(CLOCK_MONOTONIC, &t0);
    long r2 = sum_unrolled(data, N);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("unrolled: sum=%ld  %.4f s\n", r2,
           (t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);

    free(data);
    return 0;
}

Manual unrolling with multiple accumulators (s0-s3) breaks data dependencies and lets the CPU pipeline fill.

GCC also supports:

#pragma GCC unroll 8
for (int i = 0; i < n; i++)
    total += arr[i];

Rust Optimization

Rust uses LLVM. The same optimization principles apply.

# Debug build (like -O0)
$ cargo build

# Release build (like -O2 + LTO)
$ cargo build --release

In Cargo.toml:

[profile.release]
opt-level = 3
lto = true
codegen-units = 1
// src/main.rs — cache-friendly iteration
use std::time::Instant;

const N: usize = 10_000_000;

struct ParticlesAoS {
    data: Vec<(f32, f32, f32, f32)>, // x, y, z, mass
}

struct ParticlesSoA {
    x: Vec<f32>,
    mass: Vec<f32>,
}

fn main() {
    // AoS
    let aos = ParticlesAoS {
        data: (0..N).map(|i| (i as f32, 0.0, 0.0, 1.0)).collect(),
    };

    let t0 = Instant::now();
    let sum_aos: f32 = aos.data.iter().map(|(x, _, _, m)| x * m).sum();
    let d_aos = t0.elapsed();

    // SoA
    let soa = ParticlesSoA {
        x: (0..N).map(|i| i as f32).collect(),
        mass: vec![1.0; N],
    };

    let t0 = Instant::now();
    let sum_soa: f32 = soa.x.iter().zip(&soa.mass).map(|(x, m)| x * m).sum();
    let d_soa = t0.elapsed();

    println!("AoS: sum={sum_aos:.0} time={d_aos:?}");
    println!("SoA: sum={sum_soa:.0} time={d_soa:?}");
}

Rust Note: Rust iterators compile to the same tight loops as C for loops at -O2. No overhead. The compiler auto-vectorizes them just like it would a C loop.

Rust: Profiling

Use cargo flamegraph or perf directly:

$ cargo build --release
$ perf stat ./target/release/myapp
$ perf record -g ./target/release/myapp
$ perf report

For Criterion-based benchmarks:

#![allow(unused)]
fn main() {
// benches/my_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench_sum(c: &mut Criterion) {
    let data: Vec<i32> = (0..1_000_000).map(|i| (i & 0xFF) as i32).collect();
    c.bench_function("sum", |b| {
        b.iter(|| {
            let sum: i64 = data.iter().map(|&x| x as i64).sum();
            black_box(sum)
        })
    });
}

criterion_group!(benches, bench_sum);
criterion_main!(benches);
}

black_box prevents the compiler from optimizing away the computation.

Rust Note: Rust has no direct equivalent of restrict. The borrow checker ensures &mut references are unique, which gives LLVM the same aliasing information automatically.

Measuring: The Golden Rule

Rule: If you didn't measure it, you don't know it's slow.
Rule: If you didn't measure it after, you don't know you fixed it.

Micro-benchmark checklist

  1. Warm up the cache first (run the function once before timing).
  2. Use clock_gettime(CLOCK_MONOTONIC) in C, Instant::now() in Rust.
  3. Run multiple iterations and take the median, not the mean.
  4. Disable CPU frequency scaling during benchmarks.
  5. Use volatile or black_box to prevent dead-code elimination.
/* prevent_dce.c — prevent dead code elimination */
#include <stdio.h>
#include <time.h>

/* Force the compiler to keep the result */
static void do_not_optimize(void *p) {
    __asm__ volatile("" : : "g"(p) : "memory");
}

int main(void) {
    struct timespec t0, t1;
    long sum = 0;

    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int i = 0; i < 100000000; i++)
        sum += i;
    do_not_optimize(&sum);
    clock_gettime(CLOCK_MONOTONIC, &t1);

    double elapsed = (t1.tv_sec - t0.tv_sec)
                   + (t1.tv_nsec - t0.tv_nsec) / 1e9;
    printf("sum=%ld  time=%.4f s\n", sum, elapsed);
    return 0;
}

Try It: Compile prevent_dce.c with -O3 but without the do_not_optimize call. What happens to the loop? Check the assembly.

Optimization Decision Flowchart

Is it fast enough?
  |
  +-- YES --> Stop. Ship it.
  |
  +-- NO --> Profile it.
              |
              Where is the time?
              |
              +-- CPU bound --> Check -O level, restrict, SIMD
              |
              +-- Memory bound --> Check data layout, cache misses
              |
              +-- I/O bound --> Check syscall count, buffering
              |
              +-- Branch misses --> Check branch patterns, likely/unlikely

Quick Knowledge Check

  1. Why is -O0 useful even though it produces slow code?
  2. What does restrict promise the compiler, and what happens if you lie?
  3. When does SoA beat AoS?

Common Pitfalls

  • Optimizing without profiling. You will optimize the wrong thing.
  • Benchmarking at -O0. That measures the interpreter, not your algorithm.
  • Forgetting warm-up. Cold caches give misleading first-run numbers.
  • Using gettimeofday for benchmarks. It's not monotonic. Use clock_gettime(CLOCK_MONOTONIC).
  • Assuming -O3 is always better than -O2. Aggressive inlining can blow up the instruction cache.
  • restrict on aliased pointers. Undefined behavior, silently wrong.
  • Optimizing for one CPU. -march=native won't run on other machines.