epoll: Scalable Event-Driven I/O
select and poll scan every file descriptor on every call. With 50,000 connections, most of them idle, you spend most of your CPU time checking fds that have nothing to report. epoll fixes this by maintaining a ready list inside the kernel. Only fds that actually have events appear in the results. This is O(1) with respect to the total number of monitored fds and O(k) with respect to the number of ready fds.
This chapter builds a complete single-threaded event loop from scratch, covers level-triggered vs edge-triggered semantics, and connects to the Rust ecosystem through the nix and mio crates.
The Three epoll Calls
epoll_create1(flags) --> returns an epoll fd
epoll_ctl(epfd, op, fd, ev) --> add/modify/remove a watched fd
epoll_wait(epfd, events, max, timeout) --> wait for ready fds
Kernel Userspace
+------------------+
| epoll instance |
| interest list: | epoll_ctl(ADD, fd=5)
| [fd=5, fd=9] | <--------- epoll_ctl(ADD, fd=9)
| |
| ready list: | epoll_wait() blocks...
| [fd=5] | ---------> returns: fd=5 has EPOLLIN
+------------------+
Only fd 5 is ready. The kernel does not scan fd 9 at all. With 50,000 fds and 3 ready, epoll_wait returns immediately with just those 3.
A Complete epoll Echo Server in C
/* epoll_server.c -- single-threaded echo server using epoll */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/epoll.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#define MAX_EVENTS 64
#define BUF_SIZE 4096
static int set_nonblocking(int fd)
{
int flags = fcntl(fd, F_GETFL, 0);
if (flags < 0) return -1;
return fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}
int main(void)
{
int listen_fd = socket(AF_INET, SOCK_STREAM, 0);
if (listen_fd < 0) { perror("socket"); return 1; }
int opt = 1;
setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
struct sockaddr_in addr = {0};
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_ANY);
addr.sin_port = htons(7878);
if (bind(listen_fd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
perror("bind"); return 1;
}
if (listen(listen_fd, 128) < 0) {
perror("listen"); return 1;
}
set_nonblocking(listen_fd);
/* Create epoll instance */
int epfd = epoll_create1(0);
if (epfd < 0) { perror("epoll_create1"); return 1; }
struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = listen_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
struct epoll_event events[MAX_EVENTS];
printf("epoll echo server on port 7878\n");
for (;;) {
int nready = epoll_wait(epfd, events, MAX_EVENTS, -1);
if (nready < 0) {
if (errno == EINTR) continue;
perror("epoll_wait");
break;
}
for (int i = 0; i < nready; i++) {
int fd = events[i].data.fd;
if (fd == listen_fd) {
/* Accept all pending connections */
for (;;) {
struct sockaddr_in client;
socklen_t clen = sizeof(client);
int conn = accept(listen_fd,
(struct sockaddr *)&client, &clen);
if (conn < 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK)
break; /* no more pending */
perror("accept");
break;
}
set_nonblocking(conn);
ev.events = EPOLLIN | EPOLLET; /* edge-triggered */
ev.data.fd = conn;
epoll_ctl(epfd, EPOLL_CTL_ADD, conn, &ev);
char ip[INET_ADDRSTRLEN];
inet_ntop(AF_INET, &client.sin_addr, ip, sizeof(ip));
printf("+ %s:%d (fd %d)\n",
ip, ntohs(client.sin_port), conn);
}
} else {
/* Data from a client (edge-triggered: drain fully) */
char buf[BUF_SIZE];
for (;;) {
ssize_t n = read(fd, buf, sizeof(buf));
if (n < 0) {
if (errno == EAGAIN || errno == EWOULDBLOCK)
break; /* no more data right now */
/* Real error */
close(fd);
break;
}
if (n == 0) {
/* Client disconnected */
printf("- fd %d\n", fd);
close(fd);
break;
}
/* Echo back */
ssize_t written = 0;
while (written < n) {
ssize_t w = write(fd, buf + written, n - written);
if (w < 0) {
if (errno == EAGAIN) break;
close(fd);
goto next_event;
}
written += w;
}
}
next_event: ;
}
}
}
close(epfd);
close(listen_fd);
return 0;
}
Compile and run: gcc -o epoll_server epoll_server.c && ./epoll_server. Test with multiple nc 127.0.0.1 7878 sessions.
struct epoll_event and the data Union
struct epoll_event {
uint32_t events; /* EPOLLIN, EPOLLOUT, EPOLLET, ... */
epoll_data_t data; /* user data returned with event */
};
typedef union epoll_data {
void *ptr; /* pointer to your own struct */
int fd; /* file descriptor */
uint32_t u32;
uint64_t u64;
} epoll_data_t;
The data field is your tag. The kernel passes it back to you untouched in epoll_wait. Most simple servers use data.fd. Complex servers store a pointer to a connection struct:
struct connection {
int fd;
char read_buf[4096];
size_t read_len;
/* ... */
};
struct connection *conn = malloc(sizeof(*conn));
conn->fd = accepted_fd;
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLET;
ev.data.ptr = conn;
epoll_ctl(epfd, EPOLL_CTL_ADD, accepted_fd, &ev);
Then in the event loop: struct connection *c = events[i].data.ptr;
Level-Triggered vs Edge-Triggered
This is the most important distinction in epoll.
Level-triggered (default):
"Notify me AS LONG AS the fd is ready"
Edge-triggered (EPOLLET):
"Notify me ONCE WHEN the fd BECOMES ready"
Data arrives: [####]
Level-triggered:
epoll_wait -> EPOLLIN (data available)
read(100 bytes) (still 300 bytes left)
epoll_wait -> EPOLLIN (still ready -- data remains)
read(300 bytes)
epoll_wait -> blocks (no more data)
Edge-triggered:
epoll_wait -> EPOLLIN (data just arrived)
read(100 bytes) (still 300 bytes left)
epoll_wait -> BLOCKS (no NEW data arrived -- edge already fired)
*** 300 bytes stuck in the buffer forever ***
Caution: With edge-triggered mode, you MUST read until
EAGAINon every notification. If you stop reading early, the remaining data is stranded. The kernel will not notify you again until NEW data arrives.
The Edge-Triggered + Non-Blocking Pattern
This is the canonical pattern that all high-performance servers use:
- Set the fd to
O_NONBLOCK - Register with
EPOLLET - On
EPOLLIN, loopread()until it returnsEAGAIN - On
EPOLLOUT, loopwrite()until it returnsEAGAIN
while (true) {
n = read(fd, buf, sizeof(buf));
if (n > 0) {
process(buf, n);
continue;
}
if (n < 0 && errno == EAGAIN) {
break; // <-- all data consumed, wait for next edge
}
if (n == 0) {
close(fd); // client disconnected
break;
}
// n < 0 && errno != EAGAIN: real error
close(fd);
break;
}
Why Edge-Triggered?
Level-triggered is simpler and less error-prone. So why bother with edge-triggered?
Thundering herd. If multiple threads each have their own epoll_wait on the same epoll fd (a common pattern), level-triggered wakes ALL of them when data arrives. Only one can read() successfully; the rest wake up for nothing. Edge-triggered fires only once, waking a single thread.
Efficiency. Level-triggered can cause redundant wake-ups. If you know you are going to drain the entire buffer anyway, edge-triggered avoids the kernel re-checking readiness on the next epoll_wait.
In practice, most applications start with level-triggered and switch to edge-triggered only when they need the performance.
EPOLLONESHOT
For multi-threaded servers where multiple threads call epoll_wait, EPOLLONESHOT disables the fd after one event fires. You must re-arm it with EPOLL_CTL_MOD after processing. This guarantees exactly one thread handles a given fd at a time.
/* Register with EPOLLONESHOT */
ev.events = EPOLLIN | EPOLLET | EPOLLONESHOT;
ev.data.fd = conn_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, conn_fd, &ev);
/* After processing, re-arm */
ev.events = EPOLLIN | EPOLLET | EPOLLONESHOT;
ev.data.fd = conn_fd;
epoll_ctl(epfd, EPOLL_CTL_MOD, conn_fd, &ev);
The Reactor Pattern
The event loop in the epoll server is an instance of the reactor pattern:
+----------------------------+
| Event Loop |
| +----------------------+ |
| | epoll_wait() | |
| +----------+-----------+ |
| | |
| +--------+--------+ |
| | | |
| accept read |
| handler handler |
| | | |
| register process |
| new fd + reply |
+----------------------------+
The reactor:
- Waits for events (demultiplexing)
- Dispatches each event to a handler
- Handlers are non-blocking and complete quickly
- Returns to step 1
This single-threaded design handles thousands of connections with one thread, zero locks, zero context switches.
In production, you use data.ptr to store per-connection state (read buffers, write queues, protocol state machines). The epoll echo server above uses data.fd for simplicity, but real servers like nginx, Redis, and memcached all use the pointer variant with handler dispatch. This is the skeleton every event-driven C server builds on.
Try It: Modify the epoll echo server to use
data.ptrwith astruct connectionthat includes a write buffer. Whenwrite()returnsEAGAIN, store the remaining data and register forEPOLLOUT. When the fd becomes writable, flush the buffer and switch back toEPOLLIN.
Rust: epoll via the nix Crate
// epoll_server.rs -- event loop using nix::sys::epoll // Cargo.toml: nix = { version = "0.29", features = ["epoll", "net", "fs"] } use nix::sys::epoll::*; use std::collections::HashMap; use std::io::{self, Read, Write}; use std::net::{TcpListener, TcpStream}; use std::os::fd::{AsRawFd, RawFd}; fn set_nonblocking(stream: &TcpStream) { stream.set_nonblocking(true).expect("set_nonblocking"); } fn main() -> io::Result<()> { let listener = TcpListener::bind("0.0.0.0:7878")?; listener.set_nonblocking(true)?; println!("Rust epoll server on port 7878"); let epfd = Epoll::new(EpollCreateFlags::empty()) .expect("epoll_create"); let listen_fd = listener.as_raw_fd(); epfd.add( &listener, EpollEvent::new(EpollFlags::EPOLLIN, listen_fd as u64), ).expect("epoll_add listener"); let mut clients: HashMap<RawFd, TcpStream> = HashMap::new(); let mut events = vec![EpollEvent::empty(); 64]; loop { let n = epfd.wait(&mut events, -1) .expect("epoll_wait"); for i in 0..n { let fd = events[i].data() as RawFd; if fd == listen_fd { loop { match listener.accept() { Ok((stream, addr)) => { println!("+ {}", addr); set_nonblocking(&stream); let raw = stream.as_raw_fd(); epfd.add( &stream, EpollEvent::new( EpollFlags::EPOLLIN | EpollFlags::EPOLLET, raw as u64, ), ).expect("epoll_add client"); clients.insert(raw, stream); } Err(ref e) if e.kind() == io::ErrorKind::WouldBlock => { break; } Err(e) => { eprintln!("accept: {}", e); break; } } } } else if let Some(stream) = clients.get_mut(&fd) { let mut buf = [0u8; 4096]; loop { match stream.read(&mut buf) { Ok(0) => { println!("- fd {}", fd); clients.remove(&fd); break; } Ok(n) => { let _ = stream.write_all(&buf[..n]); } Err(ref e) if e.kind() == io::ErrorKind::WouldBlock => { break; } Err(_) => { clients.remove(&fd); break; } } } } } } }
Rust Note: When a
TcpStreamis removed from theHashMap, it is dropped, which closes the fd. The kernel automatically removes a closed fd from the epoll interest list. No explicitEPOLL_CTL_DELneeded.
Rust: mio for Portable Event Loops
epoll is Linux-only. kqueue is the equivalent on macOS/BSD. The mio crate abstracts over both, providing a single API.
// mio_server.rs -- portable event loop with mio // Cargo.toml: mio = { version = "1", features = ["net", "os-poll"] } use mio::net::{TcpListener, TcpStream}; use mio::{Events, Interest, Poll, Token}; use std::collections::HashMap; use std::io::{self, Read, Write}; const LISTENER: Token = Token(0); fn main() -> io::Result<()> { let mut poll = Poll::new()?; let mut events = Events::with_capacity(128); let addr = "0.0.0.0:7878".parse().unwrap(); let mut listener = TcpListener::bind(addr)?; poll.registry().register( &mut listener, LISTENER, Interest::READABLE)?; let mut clients: HashMap<Token, TcpStream> = HashMap::new(); let mut next_token = 1usize; println!("mio server on port 7878"); loop { poll.poll(&mut events, None)?; for event in events.iter() { match event.token() { LISTENER => { loop { match listener.accept() { Ok((mut stream, addr)) => { let token = Token(next_token); next_token += 1; poll.registry().register( &mut stream, token, Interest::READABLE, )?; println!("+ {} (token {})", addr, token.0); clients.insert(token, stream); } Err(ref e) if e.kind() == io::ErrorKind::WouldBlock => { break; } Err(e) => return Err(e), } } } token => { let done = if let Some(stream) = clients.get_mut(&token) { let mut buf = [0u8; 4096]; let mut closed = false; loop { match stream.read(&mut buf) { Ok(0) => { closed = true; break; } Ok(n) => { let _ = stream.write_all(&buf[..n]); } Err(ref e) if e.kind() == io::ErrorKind::WouldBlock => { break; } Err(_) => { closed = true; break; } } } closed } else { false }; if done { if let Some(mut stream) = clients.remove(&token) { poll.registry().deregister(&mut stream)?; println!("- token {}", token.0); } } } } } } }
Connection to tokio
tokio is Rust's most popular async runtime. Under the hood, it is an epoll (Linux) / kqueue (macOS) event loop built on mio. When you write:
#![allow(unused)] fn main() { // Conceptual -- requires tokio runtime setup async fn handle(mut stream: tokio::net::TcpStream) { let (mut reader, mut writer) = stream.split(); tokio::io::copy(&mut reader, &mut writer).await.unwrap(); } }
...the .await suspends the task. tokio's reactor (an epoll event loop) resumes it when data arrives. There is no thread per connection, no manual epoll_ctl -- the async/await syntax hides the event loop plumbing you built in this chapter.
Your code (this chapter): tokio (same thing, hidden):
+---------------------------+ +---------------------------+
| epoll_wait() | | runtime.block_on(...) |
| -> fd ready | | -> task wakes up |
| -> call handler(fd) | | -> resume .await |
| -> handler does read/write| | -> async fn does I/O |
| -> back to epoll_wait | | -> task yields at .await |
+---------------------------+ +---------------------------+
Driver Prep: The Linux kernel uses a similar event-driven model internally. The
waitqueuemechanism wakes sleeping tasks when events occur. Kernel threads and work queues are the kernel's equivalent of the reactor pattern. Understanding epoll deeply prepares you for kernel-level event handling.
Performance Comparison
10,000 idle connections, 100 active per second:
select: scans 10,000 fds per call ~10,000 operations/call
poll: scans 10,000 pollfds per call ~10,000 operations/call
epoll: returns only ~100 ready fds ~100 operations/call
At 100,000 connections: select/poll grind to a halt.
epoll: barely notices.
This is why nginx, Redis, Node.js (libuv), and every modern event-driven server uses epoll on Linux.
Knowledge Check
- What is the difference between
epoll_create1and the olderepoll_create? - In edge-triggered mode, what happens if you read only part of the available data?
- Why does the reactor pattern avoid the need for mutexes?
Common Pitfalls
- Edge-triggered without draining -- the most common epoll bug. Read until
EAGAINor you will lose data. - Forgetting
O_NONBLOCK-- edge-triggered epoll with blocking fds deadlocks. Aread()call blocks when there is no data, but you will never get another notification because the edge already fired. - Stale pointers in
data.ptr-- if you free a connection struct but forget to remove the fd from epoll, the next event delivers a dangling pointer. Use-after-free. EPOLL_CTL_DELon a closed fd -- closing the fd automatically removes it from epoll (if it is the last reference). CallingEPOLL_CTL_DELafterclose()returnsEBADF. Close last, or skip the explicit delete.- Using
EPOLLONESHOTwithout re-arming -- the fd goes silent forever. Every event handler must callEPOLL_CTL_MODto re-enable. - Assuming portability -- epoll is Linux-only. Use
kqueueon BSD/macOS, IOCP on Windows, or a library likemioorlibuvfor cross-platform code.