C Optimization Techniques

Performance matters in systems programming. This chapter covers the tools and techniques that separate amateur C from production C: compiler flags, profiling, cache-aware data layout, and branch prediction hints. The rule is always the same: measure first, optimize second.

Compiler Optimization Levels

GCC and Clang accept -O flags that control how aggressively the compiler transforms your code.

Flag	Effect
`-O0`	No optimization. Fastest compile, debuggable.
`-O1`	Basic optimizations. Smaller code.
`-O2`	Most optimizations. Good default for release.
`-O3`	Aggressive: vectorization, inlining, unrolling.
`-Os`	Optimize for size (like `-O2` minus bloat).
`-Og`	Optimize for debugging experience.

Let's see the difference on a trivial loop.

/* opt_levels.c */
#include <stdio.h>
#include <time.h>

static long sum_array(const int *arr, int n) {
    long total = 0;
    for (int i = 0; i < n; i++) {
        total += arr[i];
    }
    return total;
}

int main(void) {
    enum { N = 100000000 };
    static int data[N];

    for (int i = 0; i < N; i++)
        data[i] = i & 0xFF;

    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);

    long result = sum_array(data, N);

    clock_gettime(CLOCK_MONOTONIC, &t1);

    double elapsed = (t1.tv_sec - t0.tv_sec)
                   + (t1.tv_nsec - t0.tv_nsec) / 1e9;

    printf("sum = %ld, time = %.4f s\n", result, elapsed);
    return 0;
}

Compile and run at different levels:

$ gcc -O0 -o opt0 opt_levels.c && ./opt0
sum = 12700000000, time = 0.2510 s

$ gcc -O2 -o opt2 opt_levels.c && ./opt2
sum = 12700000000, time = 0.0380 s

$ gcc -O3 -o opt3 opt_levels.c && ./opt3
sum = 12700000000, time = 0.0120 s

At -O3, the compiler auto-vectorizes the loop using SIMD instructions.

Caution: -O3 can change behavior of code that relies on undefined behavior. If your program works at -O0 but breaks at -O2, you have a bug, not a compiler problem.

Looking at What the Compiler Did

Use -S to see assembly, or objdump -d on the binary:

$ gcc -O2 -S -o opt2.s opt_levels.c
$ grep -A5 'sum_array' opt2.s

The Compiler Explorer (godbolt.org) is invaluable for comparing output across flags and compilers. Use it.

Profile-Guided Optimization (PGO)

PGO lets the compiler observe real execution patterns, then recompile with that data.

# Step 1: Instrument
$ gcc -O2 -fprofile-generate -o opt_pgo_gen opt_levels.c

# Step 2: Run with representative input
$ ./opt_pgo_gen

# Step 3: Recompile using the profile
$ gcc -O2 -fprofile-use -o opt_pgo opt_levels.c

PGO helps the compiler make better inlining, branching, and layout decisions. Typical improvement: 5-20% on real workloads.

Profiling with perf

Never guess where time is spent. Use perf.

$ gcc -O2 -g -o opt2 opt_levels.c
$ perf stat ./opt2

This gives you cycle counts, cache misses, branch mispredictions, and IPC (instructions per cycle).

For function-level profiling:

$ perf record -g ./opt2
$ perf report

perf report shows a call-graph breakdown. Look for the hottest functions first.

Profiling with gprof

$ gcc -O2 -pg -o opt_gprof opt_levels.c
$ ./opt_gprof
$ gprof opt_gprof gmon.out | head -30

gprof adds instrumentation overhead but gives call counts and cumulative time.

Profiling with Valgrind/Callgrind

$ gcc -O2 -g -o opt2 opt_levels.c
$ valgrind --tool=callgrind ./opt2
$ callgrind_annotate callgrind.out.* | head -40

Callgrind simulates the CPU cache hierarchy. It's slow (20-50x), but gives exact instruction counts and cache miss data.

Try It: Compile opt_levels.c at -O0 and -O3. Run perf stat on both. Compare the "instructions" and "cache-misses" lines.

Cache-Friendly Data Layout

Modern CPUs are fast. Memory is slow. A cache miss costs 100+ cycles. Data layout determines cache behavior.

Array of Structs (AoS) vs Struct of Arrays (SoA)

AoS (Array of Structs):
+------+------+------+------+------+------+------+------+
| x[0] | y[0] | z[0] | w[0] | x[1] | y[1] | z[1] | w[1] | ...
+------+------+------+------+------+------+------+------+

SoA (Struct of Arrays):
+------+------+------+------+------+------+------+------+
| x[0] | x[1] | x[2] | x[3] | y[0] | y[1] | y[2] | y[3] | ...
+------+------+------+------+------+------+------+------+

If you iterate over all elements but only touch x, SoA wins because cache lines contain only x values.

/* cache_layout.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 10000000

/* Array of Structs */
struct Particle_AoS {
    float x, y, z;
    float vx, vy, vz;
    float mass;
    float pad; /* 32 bytes total */
};

/* Struct of Arrays */
struct Particles_SoA {
    float *x, *y, *z;
    float *vx, *vy, *vz;
    float *mass;
};

static double now(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec / 1e9;
}

int main(void) {
    /* AoS test */
    struct Particle_AoS *aos = malloc(N * sizeof(*aos));
    for (int i = 0; i < N; i++) {
        aos[i].x = (float)i;
        aos[i].mass = 1.0f;
    }

    double t0 = now();
    float sum_aos = 0;
    for (int i = 0; i < N; i++)
        sum_aos += aos[i].x * aos[i].mass;
    double t1 = now();
    printf("AoS: sum=%.0f  time=%.4f s\n", sum_aos, t1 - t0);

    /* SoA test */
    struct Particles_SoA soa;
    soa.x    = malloc(N * sizeof(float));
    soa.mass = malloc(N * sizeof(float));
    for (int i = 0; i < N; i++) {
        soa.x[i] = (float)i;
        soa.mass[i] = 1.0f;
    }

    t0 = now();
    float sum_soa = 0;
    for (int i = 0; i < N; i++)
        sum_soa += soa.x[i] * soa.mass[i];
    t1 = now();
    printf("SoA: sum=%.0f  time=%.4f s\n", sum_soa, t1 - t0);

    free(aos);
    free(soa.x);
    free(soa.mass);
    return 0;
}

$ gcc -O2 -o cache_layout cache_layout.c -lm
$ ./cache_layout
AoS: sum=...  time=0.0280 s
SoA: sum=...  time=0.0090 s

SoA wins because only 8 bytes per element touch the cache (x + mass), not 32.

Driver Prep: Kernel DMA buffer layout affects device performance. The same AoS-vs-SoA trade-off applies to descriptor rings in network drivers.

Branch Prediction Hints

CPUs predict branches. Mispredictions cost ~15 cycles. You can hint the compiler with __builtin_expect.

/* branch_hints.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

static long process(const int *data, int n) {
    long sum = 0;
    for (int i = 0; i < n; i++) {
        if (unlikely(data[i] < 0)) {
            /* Error path: rarely taken */
            sum -= data[i];
        } else {
            sum += data[i];
        }
    }
    return sum;
}

int main(void) {
    enum { N = 100000000 };
    int *data = malloc(N * sizeof(int));

    /* 99.9% positive values */
    for (int i = 0; i < N; i++)
        data[i] = (i % 1000 == 0) ? -1 : i & 0xFF;

    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    long result = process(data, N);
    clock_gettime(CLOCK_MONOTONIC, &t1);

    double elapsed = (t1.tv_sec - t0.tv_sec)
                   + (t1.tv_nsec - t0.tv_nsec) / 1e9;
    printf("sum=%ld  time=%.4f s\n", result, elapsed);

    free(data);
    return 0;
}

The Linux kernel defines likely() and unlikely() macros everywhere. Use them on error-checking branches.

The restrict Keyword

restrict tells the compiler that two pointers don't alias (don't point to overlapping memory). This enables vectorization.

/* restrict_demo.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void add_arrays(float *restrict dst,
                const float *restrict a,
                const float *restrict b,
                int n) {
    for (int i = 0; i < n; i++)
        dst[i] = a[i] + b[i];
}

void add_arrays_no_restrict(float *dst,
                            const float *a,
                            const float *b,
                            int n) {
    for (int i = 0; i < n; i++)
        dst[i] = a[i] + b[i];
}

int main(void) {
    enum { N = 50000000 };
    float *a = malloc(N * sizeof(float));
    float *b = malloc(N * sizeof(float));
    float *c = malloc(N * sizeof(float));

    for (int i = 0; i < N; i++) {
        a[i] = (float)i;
        b[i] = (float)(N - i);
    }

    struct timespec t0, t1;

    clock_gettime(CLOCK_MONOTONIC, &t0);
    add_arrays(c, a, b, N);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("restrict:    %.4f s\n",
           (t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);

    clock_gettime(CLOCK_MONOTONIC, &t0);
    add_arrays_no_restrict(c, a, b, N);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("no restrict: %.4f s\n",
           (t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);

    free(a); free(b); free(c);
    return 0;
}

Without restrict, the compiler must assume dst might overlap a or b, preventing SIMD optimization.

Caution: If you lie to the compiler with restrict and the pointers actually alias, you get undefined behavior. The compiler will generate wrong code.

Loop Unrolling

The compiler can unroll loops at -O2/-O3, but you can also do it manually or with pragmas:

/* unroll.c */
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

long sum_unrolled(const int *arr, int n) {
    long s0 = 0, s1 = 0, s2 = 0, s3 = 0;
    int i = 0;

    /* Process 4 elements per iteration */
    for (; i + 3 < n; i += 4) {
        s0 += arr[i];
        s1 += arr[i + 1];
        s2 += arr[i + 2];
        s3 += arr[i + 3];
    }
    /* Handle remainder */
    for (; i < n; i++)
        s0 += arr[i];

    return s0 + s1 + s2 + s3;
}

long sum_simple(const int *arr, int n) {
    long total = 0;
    for (int i = 0; i < n; i++)
        total += arr[i];
    return total;
}

int main(void) {
    enum { N = 100000000 };
    int *data = malloc(N * sizeof(int));
    for (int i = 0; i < N; i++)
        data[i] = i & 0xFF;

    struct timespec t0, t1;

    clock_gettime(CLOCK_MONOTONIC, &t0);
    long r1 = sum_simple(data, N);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("simple:   sum=%ld  %.4f s\n", r1,
           (t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);

    clock_gettime(CLOCK_MONOTONIC, &t0);
    long r2 = sum_unrolled(data, N);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("unrolled: sum=%ld  %.4f s\n", r2,
           (t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9);

    free(data);
    return 0;
}

Manual unrolling with multiple accumulators (s0-s3) breaks data dependencies and lets the CPU pipeline fill.

GCC also supports:

#pragma GCC unroll 8
for (int i = 0; i < n; i++)
    total += arr[i];

Rust Optimization

Rust uses LLVM. The same optimization principles apply.

# Debug build (like -O0)
$ cargo build

# Release build (like -O2 + LTO)
$ cargo build --release

In Cargo.toml:

[profile.release]
opt-level = 3
lto = true
codegen-units = 1

// src/main.rs — cache-friendly iteration
use std::time::Instant;

const N: usize = 10_000_000;

struct ParticlesAoS {
    data: Vec<(f32, f32, f32, f32)>, // x, y, z, mass
}

struct ParticlesSoA {
    x: Vec<f32>,
    mass: Vec<f32>,
}

fn main() {
    // AoS
    let aos = ParticlesAoS {
        data: (0..N).map(|i| (i as f32, 0.0, 0.0, 1.0)).collect(),
    };

    let t0 = Instant::now();
    let sum_aos: f32 = aos.data.iter().map(|(x, _, _, m)| x * m).sum();
    let d_aos = t0.elapsed();

    // SoA
    let soa = ParticlesSoA {
        x: (0..N).map(|i| i as f32).collect(),
        mass: vec![1.0; N],
    };

    let t0 = Instant::now();
    let sum_soa: f32 = soa.x.iter().zip(&soa.mass).map(|(x, m)| x * m).sum();
    let d_soa = t0.elapsed();

    println!("AoS: sum={sum_aos:.0} time={d_aos:?}");
    println!("SoA: sum={sum_soa:.0} time={d_soa:?}");
}

Rust Note: Rust iterators compile to the same tight loops as C for loops at -O2. No overhead. The compiler auto-vectorizes them just like it would a C loop.

Rust: Profiling

Use cargo flamegraph or perf directly:

$ cargo build --release
$ perf stat ./target/release/myapp
$ perf record -g ./target/release/myapp
$ perf report

For Criterion-based benchmarks:

#![allow(unused)]
fn main() {
// benches/my_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench_sum(c: &mut Criterion) {
    let data: Vec<i32> = (0..1_000_000).map(|i| (i & 0xFF) as i32).collect();
    c.bench_function("sum", |b| {
        b.iter(|| {
            let sum: i64 = data.iter().map(|&x| x as i64).sum();
            black_box(sum)
        })
    });
}

criterion_group!(benches, bench_sum);
criterion_main!(benches);
}

black_box prevents the compiler from optimizing away the computation.

Rust Note: Rust has no direct equivalent of restrict. The borrow checker ensures &mut references are unique, which gives LLVM the same aliasing information automatically.

Measuring: The Golden Rule

Rule: If you didn't measure it, you don't know it's slow.
Rule: If you didn't measure it after, you don't know you fixed it.

Micro-benchmark checklist

Warm up the cache first (run the function once before timing).
Use clock_gettime(CLOCK_MONOTONIC) in C, Instant::now() in Rust.
Run multiple iterations and take the median, not the mean.
Disable CPU frequency scaling during benchmarks.
Use volatile or black_box to prevent dead-code elimination.

/* prevent_dce.c — prevent dead code elimination */
#include <stdio.h>
#include <time.h>

/* Force the compiler to keep the result */
static void do_not_optimize(void *p) {
    __asm__ volatile("" : : "g"(p) : "memory");
}

int main(void) {
    struct timespec t0, t1;
    long sum = 0;

    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int i = 0; i < 100000000; i++)
        sum += i;
    do_not_optimize(&sum);
    clock_gettime(CLOCK_MONOTONIC, &t1);

    double elapsed = (t1.tv_sec - t0.tv_sec)
                   + (t1.tv_nsec - t0.tv_nsec) / 1e9;
    printf("sum=%ld  time=%.4f s\n", sum, elapsed);
    return 0;
}

Try It: Compile prevent_dce.c with -O3 but without the do_not_optimize call. What happens to the loop? Check the assembly.

Optimization Decision Flowchart

Is it fast enough?
  |
  +-- YES --> Stop. Ship it.
  |
  +-- NO --> Profile it.
              |
              Where is the time?
              |
              +-- CPU bound --> Check -O level, restrict, SIMD
              |
              +-- Memory bound --> Check data layout, cache misses
              |
              +-- I/O bound --> Check syscall count, buffering
              |
              +-- Branch misses --> Check branch patterns, likely/unlikely

Quick Knowledge Check

Why is -O0 useful even though it produces slow code?
What does restrict promise the compiler, and what happens if you lie?
When does SoA beat AoS?

Common Pitfalls

Optimizing without profiling. You will optimize the wrong thing.
Benchmarking at -O0. That measures the interpreter, not your algorithm.
Forgetting warm-up. Cold caches give misleading first-run numbers.
Using gettimeofday for benchmarks. It's not monotonic. Use clock_gettime(CLOCK_MONOTONIC).
Assuming -O3 is always better than -O2. Aggressive inlining can blow up the instruction cache.
restrict on aliased pointers. Undefined behavior, silently wrong.
Optimizing for one CPU. -march=native won't run on other machines.

C, Rust, and Linux Systems Programming