The Memory Hierarchy: Why Speed Has Layers

Type this right now

// save as cache_test.c, compile: gcc -O2 -o cache_test cache_test.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define SIZE (64 * 1024 * 1024)  // 64 million ints = 256 MB

int main() {
    int *arr = malloc(SIZE * sizeof(int));
    for (int i = 0; i < SIZE; i++) arr[i] = i;

    clock_t start;
    long long sum;

    // Sequential access
    sum = 0; start = clock();
    for (int i = 0; i < SIZE; i++) sum += arr[i];
    double seq = (double)(clock() - start) / CLOCKS_PER_SEC;

    // Strided access (every 16th element = skip one cache line)
    sum = 0; start = clock();
    for (int i = 0; i < SIZE; i += 16) sum += arr[i];
    double stride = (double)(clock() - start) / CLOCKS_PER_SEC;

    printf("Sequential: %.3f sec  (sum=%lld)\n", seq, sum);
    printf("Stride-16:  %.3f sec  (sum=%lld)\n", stride, sum);
    printf("Stride does 16x fewer adds but is NOT 16x faster.\n");
    free(arr);
}

Run it. The stride-16 loop does 16 times fewer additions but is nowhere near 16x faster. Why? Because memory access time dominates, and every 16th element is a new cache line.


The brutal reality: speed vs. size

    ┌───────────────┬──────────────┬─────────────┬────────────────────┐
    │    Level      │   Latency    │    Size      │  Slower than regs? │
    ├───────────────┼──────────────┼─────────────┼────────────────────┤
    │  Registers    │    ~0.3 ns   │   ~128 B    │        1x          │
    │  L1 Cache     │     ~1 ns    │   32-64 KB  │        3x          │
    │  L2 Cache     │     ~4 ns    │   256 KB    │       13x          │
    │  L3 Cache     │    ~12 ns    │   8-32 MB   │       40x          │
    │  RAM (DDR5)   │   ~100 ns    │   8-128 GB  │      300x          │
    │  SSD (NVMe)   │ ~100,000 ns  │   0.5-4 TB  │   300,000x         │
    │  HDD          │~10,000,000 ns│   1-20 TB   │ 30,000,000x        │
    └───────────────┴──────────────┴─────────────┴────────────────────┘

If a register access were 1 second, RAM would be 5 minutes. An SSD would be 3.5 days. A hard drive seek would be almost an entire year.


The pyramid

                      ▲
                     /|\          Registers: 128 B, 0.3 ns
                    / | \
                   /  |  \        L1: 32-64 KB, 1 ns
                  /   |   \
                 /    |    \      L2: 256 KB, 4 ns
                /     |     \
               /      |      \    L3: 8-32 MB, 12 ns
              /       |       \
             /        |        \  RAM: 8-128 GB, 100 ns
            /         |         \
           /          |          \ SSD/HDD: TBs, us-ms
          /───────────────────────\

    FASTER, SMALLER ▲        ▼ SLOWER, BIGGER

Physics forces this tradeoff. Faster storage requires transistors closer to the CPU, which limits capacity. The speed of light itself imposes a minimum latency for reaching distant storage.


Cache lines: the unit of transfer

The CPU never reads a single byte from RAM. It reads in cache lines of 64 bytes.

When you access array[0], the CPU fetches a 64-byte block containing array[0] through array[15] (assuming 4-byte ints):

    You access array[0]. RAM sends back a 64-byte cache line:
    ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
    │ [0]│ [1]│ [2]│ [3]│ [4]│ [5]│ [6]│ [7]│ [8]│ [9]│[10]│[11]│[12]│[13]│[14]│[15]│
    └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
    ◄──────────────────── 64 bytes ─────────────────────►

    Now array[1] through array[15] are in L1 cache: ~1 ns instead of ~100 ns.

This is spatial locality: data near what you just accessed is already cached.


Two kinds of locality

Spatial locality — accessing nearby addresses in sequence:

for (int i = 0; i < N; i++)
    sum += array[i];  // Each access is 4 bytes after the last

You pay the ~100 ns RAM penalty once per cache line and get 15 more accesses essentially free.

Temporal locality — accessing the same data again soon:

int count = 0;
for (int i = 0; i < N; i++)
    count++;  // 'count' stays in L1 the entire loop

What do you think happens? Your local variables are on the stack. Is stack memory special hardware?

Reveal: Same RAM as everything else. But the top of the stack is accessed constantly (temporal locality) in a contiguous region (spatial locality). It's always in L1 cache. That's why local variables are fast.


Why linked lists are slow

Each node is somewhere on the heap. Following each next pointer is a potential cache miss:

    Linked list (pointer chasing):
    Node A @ 0x1000     Node B @ 0x5F00      Node C @ 0x3200
    ┌─────┬──────┐      ┌─────┬──────┐       ┌─────┬──────┐
    │ val │ next─┼─────►│ val │ next─┼──────►│ val │ NULL │
    └─────┴──────┘      └─────┴──────┘       └─────┴──────┘
    Each arrow = potential cache miss = ~100 ns

    Array with 3 elements @ 0x1000:
    ┌─────┬─────┬─────┐
    │ [0] │ [1] │ [2] │   All in one cache line = ~100 ns total
    └─────┴─────┴─────┘

This is why Vec<T> in Rust and arrays in C demolish linked lists in almost every benchmark.

Fun Fact: Bjarne Stroustrup (creator of C++) showed that even for insertion in the middle — the textbook linked-list use case — std::vector was faster because the O(n) memmove is sequential and cache-friendly, while the O(1) pointer update causes a cache miss.


Cache-friendly data layout

A game with 10,000 entities. Array of Structs — the obvious way:

struct Entity { float x, y, z; float vx, vy, vz; int health; };
struct Entity entities[10000];
#![allow(unused)]
fn main() {
struct Entity { x: f32, y: f32, z: f32, vx: f32, vy: f32, vz: f32, health: i32 }
let entities: Vec<Entity> = Vec::with_capacity(10000);
}

Struct of Arrays — cache-friendly when processing one field at a time:

struct Entities { float x[10000], y[10000], z[10000];
                  float vx[10000], vy[10000], vz[10000]; int health[10000]; };
#![allow(unused)]
fn main() {
struct Entities { x: Vec<f32>, y: Vec<f32>, z: Vec<f32>,
                  vx: Vec<f32>, vy: Vec<f32>, vz: Vec<f32>, health: Vec<i32> }
}

If your physics update only touches positions and velocities:

    AoS cache line (64 bytes):
    │ x₀ y₀ z₀ vx₀ vy₀ vz₀ hp₀ pad │ x₁ y₁ z₁ vx₁ vy₁ vz₁ hp₁ pad │
    2 entities per line, health + padding wasted

    SoA cache line (64 bytes):
    │ x₀ x₁ x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ x₁₀ x₁₁ x₁₂ x₁₃ x₁₄ x₁₅ │
    16 x-values per line, every byte useful

SoA can be 2-4x faster for batch processing.


The stride experiment

This program shows the cache hierarchy directly:

// save as stride.c, compile: gcc -O2 -o stride stride.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ARRAY_SIZE (32 * 1024 * 1024)
int main() {
    int *arr = malloc(ARRAY_SIZE * sizeof(int));
    for (int i = 0; i < ARRAY_SIZE; i++) arr[i] = 1;
    int strides[] = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024};
    for (int s = 0; s < 11; s++) {
        int stride = strides[s];
        volatile long long sum = 0;
        clock_t start = clock();
        for (int iter = 0; iter < 10; iter++)
            for (int i = 0; i < ARRAY_SIZE; i += stride) sum += arr[i];
        double elapsed = (double)(clock() - start) / CLOCKS_PER_SEC;
        long long accesses = (long long)10 * (ARRAY_SIZE / stride);
        printf("Stride %5d: %.2f ns/access\n",
               stride, elapsed * 1e9 / accesses);
    }
    free(arr);
}

Typical results:

    Stride     1: 0.68 ns/access    ◄── sequential, everything cached
    Stride    16: 2.85 ns/access    ◄── one access per cache line
    Stride   256: 8.43 ns/access
    Stride  1024: 9.02 ns/access    ◄── dominated by RAM latency

The same program in Rust produces the same results — same hardware, same cache hierarchy:

use std::time::Instant;
fn main() {
    let size: usize = 32 * 1024 * 1024;
    let arr: Vec<i32> = vec![1; size];
    let strides = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024];
    for &stride in &strides {
        let mut sum: i64 = 0;
        let start = Instant::now();
        for _ in 0..10 {
            let mut i = 0;
            while i < size { sum += arr[i] as i64; i += stride; }
        }
        let ns = start.elapsed().as_nanos() as f64 / (10 * size / stride) as f64;
        println!("Stride {:5}: {:.2} ns/access (sum={})", stride, ns, sum);
    }
}

Recap

    ┌────────────────────────────────────────────────────────────┐
    │  Memory is a hierarchy: fast-small at top, slow-huge below │
    │  CPU fetches data in 64-byte cache lines                   │
    │                                                            │
    │  Spatial locality: access nearby addresses (arrays win)    │
    │  Temporal locality: re-use recent data (loops, stack win)  │
    │                                                            │
    │  Stack is fast: always in L1 cache                         │
    │  Linked lists are slow: every node is a cache miss         │
    │  SoA can be 2-4x faster than AoS for batch processing     │
    └────────────────────────────────────────────────────────────┘

Task

  1. Compile and run stride.c (or the Rust version). Record ns/access for strides 1, 16, and 1024.
  2. Calculate: how many times slower is stride-1024 vs. stride-1?
  3. Run perf stat -e L1-dcache-load-misses ./stride and watch miss counts climb with stride.
  4. Challenge: Modify the program to use random access — allocate an array of random indices, access arr[indices[i]]. Compare ns/access to sequential. This is what pointer chasing looks like to the cache.

Next up: the actual instructions the CPU executes, and how your C and Rust code becomes assembly.