epoll: Scalable Event-Driven I/O

select and poll scan every file descriptor on every call. With 50,000 connections, most of them idle, you spend most of your CPU time checking fds that have nothing to report. epoll fixes this by maintaining a ready list inside the kernel. Only fds that actually have events appear in the results. This is O(1) with respect to the total number of monitored fds and O(k) with respect to the number of ready fds.

This chapter builds a complete single-threaded event loop from scratch, covers level-triggered vs edge-triggered semantics, and connects to the Rust ecosystem through the nix and mio crates.

The Three epoll Calls

  epoll_create1(flags)          --> returns an epoll fd
  epoll_ctl(epfd, op, fd, ev)   --> add/modify/remove a watched fd
  epoll_wait(epfd, events, max, timeout) --> wait for ready fds
  Kernel                          Userspace
  +------------------+
  | epoll instance   |
  |  interest list:  |            epoll_ctl(ADD, fd=5)
  |   [fd=5, fd=9]  | <--------- epoll_ctl(ADD, fd=9)
  |                  |
  |  ready list:     |            epoll_wait() blocks...
  |   [fd=5]         | ---------> returns: fd=5 has EPOLLIN
  +------------------+

Only fd 5 is ready. The kernel does not scan fd 9 at all. With 50,000 fds and 3 ready, epoll_wait returns immediately with just those 3.

A Complete epoll Echo Server in C

/* epoll_server.c -- single-threaded echo server using epoll */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/epoll.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>

#define MAX_EVENTS 64
#define BUF_SIZE   4096

static int set_nonblocking(int fd)
{
    int flags = fcntl(fd, F_GETFL, 0);
    if (flags < 0) return -1;
    return fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}

int main(void)
{
    int listen_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (listen_fd < 0) { perror("socket"); return 1; }

    int opt = 1;
    setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));

    struct sockaddr_in addr = {0};
    addr.sin_family      = AF_INET;
    addr.sin_addr.s_addr = htonl(INADDR_ANY);
    addr.sin_port        = htons(7878);

    if (bind(listen_fd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
        perror("bind"); return 1;
    }
    if (listen(listen_fd, 128) < 0) {
        perror("listen"); return 1;
    }
    set_nonblocking(listen_fd);

    /* Create epoll instance */
    int epfd = epoll_create1(0);
    if (epfd < 0) { perror("epoll_create1"); return 1; }

    struct epoll_event ev;
    ev.events  = EPOLLIN;
    ev.data.fd = listen_fd;
    epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);

    struct epoll_event events[MAX_EVENTS];
    printf("epoll echo server on port 7878\n");

    for (;;) {
        int nready = epoll_wait(epfd, events, MAX_EVENTS, -1);
        if (nready < 0) {
            if (errno == EINTR) continue;
            perror("epoll_wait");
            break;
        }

        for (int i = 0; i < nready; i++) {
            int fd = events[i].data.fd;

            if (fd == listen_fd) {
                /* Accept all pending connections */
                for (;;) {
                    struct sockaddr_in client;
                    socklen_t clen = sizeof(client);
                    int conn = accept(listen_fd,
                                      (struct sockaddr *)&client, &clen);
                    if (conn < 0) {
                        if (errno == EAGAIN || errno == EWOULDBLOCK)
                            break;  /* no more pending */
                        perror("accept");
                        break;
                    }
                    set_nonblocking(conn);

                    ev.events  = EPOLLIN | EPOLLET;  /* edge-triggered */
                    ev.data.fd = conn;
                    epoll_ctl(epfd, EPOLL_CTL_ADD, conn, &ev);

                    char ip[INET_ADDRSTRLEN];
                    inet_ntop(AF_INET, &client.sin_addr, ip, sizeof(ip));
                    printf("+ %s:%d (fd %d)\n",
                           ip, ntohs(client.sin_port), conn);
                }
            } else {
                /* Data from a client (edge-triggered: drain fully) */
                char buf[BUF_SIZE];
                for (;;) {
                    ssize_t n = read(fd, buf, sizeof(buf));
                    if (n < 0) {
                        if (errno == EAGAIN || errno == EWOULDBLOCK)
                            break;  /* no more data right now */
                        /* Real error */
                        close(fd);
                        break;
                    }
                    if (n == 0) {
                        /* Client disconnected */
                        printf("- fd %d\n", fd);
                        close(fd);
                        break;
                    }
                    /* Echo back */
                    ssize_t written = 0;
                    while (written < n) {
                        ssize_t w = write(fd, buf + written, n - written);
                        if (w < 0) {
                            if (errno == EAGAIN) break;
                            close(fd);
                            goto next_event;
                        }
                        written += w;
                    }
                }
                next_event: ;
            }
        }
    }

    close(epfd);
    close(listen_fd);
    return 0;
}

Compile and run: gcc -o epoll_server epoll_server.c && ./epoll_server. Test with multiple nc 127.0.0.1 7878 sessions.

struct epoll_event and the data Union

struct epoll_event {
    uint32_t     events;   /* EPOLLIN, EPOLLOUT, EPOLLET, ... */
    epoll_data_t data;     /* user data returned with event   */
};

typedef union epoll_data {
    void    *ptr;     /* pointer to your own struct */
    int      fd;      /* file descriptor */
    uint32_t u32;
    uint64_t u64;
} epoll_data_t;

The data field is your tag. The kernel passes it back to you untouched in epoll_wait. Most simple servers use data.fd. Complex servers store a pointer to a connection struct:

struct connection {
    int fd;
    char read_buf[4096];
    size_t read_len;
    /* ... */
};

struct connection *conn = malloc(sizeof(*conn));
conn->fd = accepted_fd;

struct epoll_event ev;
ev.events  = EPOLLIN | EPOLLET;
ev.data.ptr = conn;
epoll_ctl(epfd, EPOLL_CTL_ADD, accepted_fd, &ev);

Then in the event loop: struct connection *c = events[i].data.ptr;

Level-Triggered vs Edge-Triggered

This is the most important distinction in epoll.

  Level-triggered (default):
  "Notify me AS LONG AS the fd is ready"

  Edge-triggered (EPOLLET):
  "Notify me ONCE WHEN the fd BECOMES ready"
  Data arrives:    [####]

  Level-triggered:
    epoll_wait -> EPOLLIN   (data available)
    read(100 bytes)         (still 300 bytes left)
    epoll_wait -> EPOLLIN   (still ready -- data remains)
    read(300 bytes)
    epoll_wait -> blocks    (no more data)

  Edge-triggered:
    epoll_wait -> EPOLLIN   (data just arrived)
    read(100 bytes)         (still 300 bytes left)
    epoll_wait -> BLOCKS    (no NEW data arrived -- edge already fired)
    *** 300 bytes stuck in the buffer forever ***

Caution: With edge-triggered mode, you MUST read until EAGAIN on every notification. If you stop reading early, the remaining data is stranded. The kernel will not notify you again until NEW data arrives.

The Edge-Triggered + Non-Blocking Pattern

This is the canonical pattern that all high-performance servers use:

  1. Set the fd to O_NONBLOCK
  2. Register with EPOLLET
  3. On EPOLLIN, loop read() until it returns EAGAIN
  4. On EPOLLOUT, loop write() until it returns EAGAIN
  while (true) {
      n = read(fd, buf, sizeof(buf));
      if (n > 0) {
          process(buf, n);
          continue;
      }
      if (n < 0 && errno == EAGAIN) {
          break;   // <-- all data consumed, wait for next edge
      }
      if (n == 0) {
          close(fd);  // client disconnected
          break;
      }
      // n < 0 && errno != EAGAIN: real error
      close(fd);
      break;
  }

Why Edge-Triggered?

Level-triggered is simpler and less error-prone. So why bother with edge-triggered?

Thundering herd. If multiple threads each have their own epoll_wait on the same epoll fd (a common pattern), level-triggered wakes ALL of them when data arrives. Only one can read() successfully; the rest wake up for nothing. Edge-triggered fires only once, waking a single thread.

Efficiency. Level-triggered can cause redundant wake-ups. If you know you are going to drain the entire buffer anyway, edge-triggered avoids the kernel re-checking readiness on the next epoll_wait.

In practice, most applications start with level-triggered and switch to edge-triggered only when they need the performance.

EPOLLONESHOT

For multi-threaded servers where multiple threads call epoll_wait, EPOLLONESHOT disables the fd after one event fires. You must re-arm it with EPOLL_CTL_MOD after processing. This guarantees exactly one thread handles a given fd at a time.

/* Register with EPOLLONESHOT */
ev.events  = EPOLLIN | EPOLLET | EPOLLONESHOT;
ev.data.fd = conn_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, conn_fd, &ev);

/* After processing, re-arm */
ev.events  = EPOLLIN | EPOLLET | EPOLLONESHOT;
ev.data.fd = conn_fd;
epoll_ctl(epfd, EPOLL_CTL_MOD, conn_fd, &ev);

The Reactor Pattern

The event loop in the epoll server is an instance of the reactor pattern:

  +----------------------------+
  |       Event Loop           |
  |  +----------------------+  |
  |  |    epoll_wait()      |  |
  |  +----------+-----------+  |
  |             |              |
  |    +--------+--------+    |
  |    |                 |    |
  |  accept           read    |
  |  handler          handler |
  |    |                 |    |
  |  register         process |
  |  new fd           + reply |
  +----------------------------+

The reactor:

  1. Waits for events (demultiplexing)
  2. Dispatches each event to a handler
  3. Handlers are non-blocking and complete quickly
  4. Returns to step 1

This single-threaded design handles thousands of connections with one thread, zero locks, zero context switches.

In production, you use data.ptr to store per-connection state (read buffers, write queues, protocol state machines). The epoll echo server above uses data.fd for simplicity, but real servers like nginx, Redis, and memcached all use the pointer variant with handler dispatch. This is the skeleton every event-driven C server builds on.

Try It: Modify the epoll echo server to use data.ptr with a struct connection that includes a write buffer. When write() returns EAGAIN, store the remaining data and register for EPOLLOUT. When the fd becomes writable, flush the buffer and switch back to EPOLLIN.

Rust: epoll via the nix Crate

// epoll_server.rs -- event loop using nix::sys::epoll
// Cargo.toml: nix = { version = "0.29", features = ["epoll", "net", "fs"] }
use nix::sys::epoll::*;
use std::collections::HashMap;
use std::io::{self, Read, Write};
use std::net::{TcpListener, TcpStream};
use std::os::fd::{AsRawFd, RawFd};

fn set_nonblocking(stream: &TcpStream) {
    stream.set_nonblocking(true).expect("set_nonblocking");
}

fn main() -> io::Result<()> {
    let listener = TcpListener::bind("0.0.0.0:7878")?;
    listener.set_nonblocking(true)?;
    println!("Rust epoll server on port 7878");

    let epfd = Epoll::new(EpollCreateFlags::empty())
        .expect("epoll_create");

    let listen_fd = listener.as_raw_fd();
    epfd.add(
        &listener,
        EpollEvent::new(EpollFlags::EPOLLIN, listen_fd as u64),
    ).expect("epoll_add listener");

    let mut clients: HashMap<RawFd, TcpStream> = HashMap::new();
    let mut events = vec![EpollEvent::empty(); 64];

    loop {
        let n = epfd.wait(&mut events, -1)
            .expect("epoll_wait");

        for i in 0..n {
            let fd = events[i].data() as RawFd;

            if fd == listen_fd {
                loop {
                    match listener.accept() {
                        Ok((stream, addr)) => {
                            println!("+ {}", addr);
                            set_nonblocking(&stream);
                            let raw = stream.as_raw_fd();
                            epfd.add(
                                &stream,
                                EpollEvent::new(
                                    EpollFlags::EPOLLIN | EpollFlags::EPOLLET,
                                    raw as u64,
                                ),
                            ).expect("epoll_add client");
                            clients.insert(raw, stream);
                        }
                        Err(ref e) if e.kind() == io::ErrorKind::WouldBlock => {
                            break;
                        }
                        Err(e) => {
                            eprintln!("accept: {}", e);
                            break;
                        }
                    }
                }
            } else if let Some(stream) = clients.get_mut(&fd) {
                let mut buf = [0u8; 4096];
                loop {
                    match stream.read(&mut buf) {
                        Ok(0) => {
                            println!("- fd {}", fd);
                            clients.remove(&fd);
                            break;
                        }
                        Ok(n) => {
                            let _ = stream.write_all(&buf[..n]);
                        }
                        Err(ref e) if e.kind() == io::ErrorKind::WouldBlock => {
                            break;
                        }
                        Err(_) => {
                            clients.remove(&fd);
                            break;
                        }
                    }
                }
            }
        }
    }
}

Rust Note: When a TcpStream is removed from the HashMap, it is dropped, which closes the fd. The kernel automatically removes a closed fd from the epoll interest list. No explicit EPOLL_CTL_DEL needed.

Rust: mio for Portable Event Loops

epoll is Linux-only. kqueue is the equivalent on macOS/BSD. The mio crate abstracts over both, providing a single API.

// mio_server.rs -- portable event loop with mio
// Cargo.toml: mio = { version = "1", features = ["net", "os-poll"] }
use mio::net::{TcpListener, TcpStream};
use mio::{Events, Interest, Poll, Token};
use std::collections::HashMap;
use std::io::{self, Read, Write};

const LISTENER: Token = Token(0);

fn main() -> io::Result<()> {
    let mut poll = Poll::new()?;
    let mut events = Events::with_capacity(128);

    let addr = "0.0.0.0:7878".parse().unwrap();
    let mut listener = TcpListener::bind(addr)?;
    poll.registry().register(
        &mut listener, LISTENER, Interest::READABLE)?;

    let mut clients: HashMap<Token, TcpStream> = HashMap::new();
    let mut next_token = 1usize;

    println!("mio server on port 7878");

    loop {
        poll.poll(&mut events, None)?;

        for event in events.iter() {
            match event.token() {
                LISTENER => {
                    loop {
                        match listener.accept() {
                            Ok((mut stream, addr)) => {
                                let token = Token(next_token);
                                next_token += 1;
                                poll.registry().register(
                                    &mut stream,
                                    token,
                                    Interest::READABLE,
                                )?;
                                println!("+ {} (token {})", addr, token.0);
                                clients.insert(token, stream);
                            }
                            Err(ref e)
                                if e.kind() == io::ErrorKind::WouldBlock =>
                            {
                                break;
                            }
                            Err(e) => return Err(e),
                        }
                    }
                }
                token => {
                    let done = if let Some(stream) = clients.get_mut(&token) {
                        let mut buf = [0u8; 4096];
                        let mut closed = false;
                        loop {
                            match stream.read(&mut buf) {
                                Ok(0) => { closed = true; break; }
                                Ok(n) => {
                                    let _ = stream.write_all(&buf[..n]);
                                }
                                Err(ref e)
                                    if e.kind() == io::ErrorKind::WouldBlock =>
                                {
                                    break;
                                }
                                Err(_) => { closed = true; break; }
                            }
                        }
                        closed
                    } else {
                        false
                    };

                    if done {
                        if let Some(mut stream) = clients.remove(&token) {
                            poll.registry().deregister(&mut stream)?;
                            println!("- token {}", token.0);
                        }
                    }
                }
            }
        }
    }
}

Connection to tokio

tokio is Rust's most popular async runtime. Under the hood, it is an epoll (Linux) / kqueue (macOS) event loop built on mio. When you write:

#![allow(unused)]
fn main() {
// Conceptual -- requires tokio runtime setup
async fn handle(mut stream: tokio::net::TcpStream) {
    let (mut reader, mut writer) = stream.split();
    tokio::io::copy(&mut reader, &mut writer).await.unwrap();
}
}

...the .await suspends the task. tokio's reactor (an epoll event loop) resumes it when data arrives. There is no thread per connection, no manual epoll_ctl -- the async/await syntax hides the event loop plumbing you built in this chapter.

  Your code (this chapter):          tokio (same thing, hidden):
  +---------------------------+      +---------------------------+
  | epoll_wait()              |      | runtime.block_on(...)     |
  | -> fd ready               |      | -> task wakes up          |
  | -> call handler(fd)       |      | -> resume .await          |
  | -> handler does read/write|      | -> async fn does I/O      |
  | -> back to epoll_wait     |      | -> task yields at .await  |
  +---------------------------+      +---------------------------+

Driver Prep: The Linux kernel uses a similar event-driven model internally. The waitqueue mechanism wakes sleeping tasks when events occur. Kernel threads and work queues are the kernel's equivalent of the reactor pattern. Understanding epoll deeply prepares you for kernel-level event handling.

Performance Comparison

  10,000 idle connections, 100 active per second:

  select:   scans 10,000 fds per call       ~10,000 operations/call
  poll:     scans 10,000 pollfds per call    ~10,000 operations/call
  epoll:    returns only ~100 ready fds      ~100 operations/call

  At 100,000 connections: select/poll grind to a halt.
  epoll: barely notices.

This is why nginx, Redis, Node.js (libuv), and every modern event-driven server uses epoll on Linux.

Knowledge Check

  1. What is the difference between epoll_create1 and the older epoll_create?
  2. In edge-triggered mode, what happens if you read only part of the available data?
  3. Why does the reactor pattern avoid the need for mutexes?

Common Pitfalls

  • Edge-triggered without draining -- the most common epoll bug. Read until EAGAIN or you will lose data.
  • Forgetting O_NONBLOCK -- edge-triggered epoll with blocking fds deadlocks. A read() call blocks when there is no data, but you will never get another notification because the edge already fired.
  • Stale pointers in data.ptr -- if you free a connection struct but forget to remove the fd from epoll, the next event delivers a dangling pointer. Use-after-free.
  • EPOLL_CTL_DEL on a closed fd -- closing the fd automatically removes it from epoll (if it is the last reference). Calling EPOLL_CTL_DEL after close() returns EBADF. Close last, or skip the explicit delete.
  • Using EPOLLONESHOT without re-arming -- the fd goes silent forever. Every event handler must call EPOLL_CTL_MOD to re-enable.
  • Assuming portability -- epoll is Linux-only. Use kqueue on BSD/macOS, IOCP on Windows, or a library like mio or libuv for cross-platform code.