epoll and io_uring, explained visually

Concurrency is not a language feature.

It's a kernel feature.

async/await in JS, fibers in Ruby, goroutines in Go — all of them are pretty facades over two or three Linux syscalls nobody wants to study.

Today we study them.

The starting point: blocking IO

Before any "magic", there was this:

thread ──► read(fd) ──► wait ──► wait ──► data arrives ──► returns
                                              │
                                       thread frozen the whole time

One thread per connection. read() blocks until data arrives.

10 clients? 10 threads. 10,000 clients? 10,000 threads.

Each thread eats stack (~2MB by default on Linux), pressures the scheduler, pollutes CPU cache.

client 1 ──► thread 1 ──► read() blocked
client 2 ──► thread 2 ──► read() blocked
client 3 ──► thread 3 ──► read() blocked
...
client N ──► thread N ──► read() blocked

Doesn't scale. Period.

First try: select

The first idea was: what if one thread could watch many sockets?

select():

fd_set readfds;
FD_ZERO(&readfds);
FD_SET(fd1, &readfds);
FD_SET(fd2, &readfds);
// ...
select(maxfd+1, &readfds, NULL, NULL, NULL);
// returns saying: "one of these is ready"
// you find out WHICH by iterating all of them

Visually:

   single thread
        │
        ├── watches fd1
        ├── watches fd2
        ├── watches fd3
        └── watches fdN
              │
        select() blocks until any becomes ready
              │
        returns ──► you iterate ALL to find out which

Obvious problems:

1024 fd limit (fixed bitmap)
every call you hand the entire bitmap to the kernel
every call the kernel scans every fd
every call you scan every fd to find who's ready

O(n) in the call. O(n) in the response. Every call.

With 10,000 connections, a tragedy.

Second try: poll

poll() removed the 1024 limit, using an array instead of a bitmap:

struct pollfd fds[N];
fds[0].fd = sock1; fds[0].events = POLLIN;
fds[1].fd = sock2; fds[1].events = POLLIN;
// ...
poll(fds, N, timeout);

But the fundamental problem remains: every call you pass the whole list. Every call the kernel scans everything.

userspace                    kernel
┌──────────────────┐         ┌──────────────────┐
│ array of N fds   │ ──────► │ copies, scans N  │
│                  │ ◄────── │ returns flags    │
└──────────────────┘         └──────────────────┘
   you scan N again
   to find out who's ready

Still O(n). Still doesn't scale.

The revolution: epoll

In 2002, the kernel got the abstraction that runs the modern internet: epoll.

The core idea:

The kernel keeps the list of fds. You only say who joins and who leaves. And it only gives back the ones that actually became ready.

Three syscalls:

int epfd = epoll_create1(0);              // create the "list" in the kernel

struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev);  // add one fd

epoll_event events[64];
int n = epoll_wait(epfd, events, 64, -1);    // wait for ready ones
// returns ONLY the ready ones

Visually:

                  ┌──────────────────────────┐
                  │  kernel: epoll instance  │
                  │  ┌────────────────────┐  │
   epoll_ctl ───► │  │  fd1 fd2 fd3 ...   │  │ ◄── add/remove
                  │  └────────────────────┘  │
                  │                          │
                  │   ┌──── ready list ───┐  │
   epoll_wait ──► │   │  fd2 fd7          │  │ ──► returns ONLY ready ones
                  │   └───────────────────┘  │
                  └──────────────────────────┘

The fd list lives in the kernel. You don't ship it back and forth. You just register changes.

When something becomes ready, it goes into the internal ready list. epoll_wait hands you just that.

O(1) per event. Not O(n) per call.

That's why nginx, redis, node, Java NIO, puma (in async IO mode), all sit on top of epoll. The entire modern internet revolves around it.

But epoll still has syscalls

Each epoll_wait is a syscall. Each read() after that is another syscall. Each write() another.

Syscalls aren't free. Context switch, privilege mode change, pipeline flush. On a server doing millions of req/s, syscall cost starts showing up in the profile.

And more: epoll tells you "you can read". You still need to call read afterward. This is what's called a readiness model:

1. epoll_wait ──► "fd2 is ready to read"
2. read(fd2)  ──► pull the bytes
3. process
4. epoll_wait again

Each cycle: 2+ syscalls. Always.

If you want a million reads, that's a million syscalls. Minimum.

The new era: io_uring

In 2019, Jens Axboe ships io_uring. Not epoll v2. A completely different model.

Instead of readiness ("tell me when I can read"), io_uring is completion ("do this and tell me when it's done"):

epoll (readiness):                io_uring (completion):

"tell me when reading works"      "read this. tell me when done."
       │                                  │
"ok, it works now"                 "ok, here are your bytes"
       │
"call read()"                      (no separate read needed)
       │
"now call again"                   (no extra syscall needed)

How it works: two ring buffers shared between userspace and kernel:

   ┌─────────────────────── userspace ─────────────────────────┐
   │                                                            │
   │   ┌─ Submission Queue (SQ) ─┐    ┌─ Completion Queue (CQ) ┐│
   │   │   read fd1 offset 0     │    │  fd1: 1024 bytes ok    ││
   │   │   write fd2 buf X       │    │  fd2: write done       ││
   │   │   accept listen_fd      │    │  listen_fd: new conn   ││
   │   │   ...                   │    │  ...                   ││
   │   └─────────────┬───────────┘    └────────────▲──────────┘│
   │                 │                              │           │
   └─────────────────┼──────────────────────────────┼───────────┘
                     │     memory-mapped, shared    │
   ┌─────────────────▼──────────────────────────────┼───────────┐
   │   kernel reads SQ                       kernel writes CQ   │
   │   executes operations                   with results       │
   └────────────────────────────────────────────────────────────┘

You write operations to the SQ. The kernel reads them, executes, and writes results to the CQ. You read from the CQ.

No syscall per operation. The queues live in shared, memory-mapped memory.

And the best part: zero syscall mode (SQPOLL). A kernel thread stays awake watching the SQ. You just write to the ring. The kernel picks it up on its own.

normal mode:                     SQPOLL mode (zero syscall):

app writes SQ                    app writes SQ
   │                                │
io_uring_enter() ──► kernel      kernel thread is already watching
   │                                │
kernel processes                  processes without being called
   │                                │
writes CQ                         writes CQ
   │                                │
app reads CQ                      app reads CQ

A high-end server can do thousands of IOs without a single syscall. Something that looked impossible on Linux a decade ago.

The evolutionary tree

                blocking IO (1 thread per connection)
                              │
                              │  "this doesn't scale"
                              ▼
                        select() (1983)
                              │
                              │  "1024 limit, O(n)"
                              ▼
                         poll() (1986)
                              │
                              │  "still O(n) per call"
                              ▼
                         epoll (2002)
                              │
                              │  "still one syscall per op"
                              ▼
                       io_uring (2019)
                              │
                              │  zero syscall, batched, truly async
                              ▼
                          (the future)

Each jump fixed the previous bottleneck.

io_uring is all that and what else?

io_uring isn't just read/write. It supports:

accept, connect, send, recv
openat, close, statx
fsync, fallocate
splice, tee
timeouts, linked operations (one depends on another)
chained batch execution

It's practically a syscall VM running in the kernel.

You can say: "open this file, read 4KB, send it on this socket, close". All in one submission. Without bouncing back to userspace in between.

Where do Ruby/Rails fit?

This part hurts.

Ruby is still, in real production, predominantly on top of epoll.

Puma uses nio4r (Java NIO-style), which uses epoll
Falcon (Async) uses nio4r too, or modern IO.select
Ruby 3.x fibers improved things a LOT — finally concurrency without callback hell

io_uring in Ruby? Minimal. There's a rio_uring gem, a few experiments. Nothing production-grade yet.

And that's fine.

Because the uncomfortable truth is: most Ruby/Rails apps are not limited by syscall overhead.

They're limited by:

slow database queries
N+1
GC pressure
synchronous IO on a blocking thread
bad code structure

Swapping epoll for io_uring in your Puma doesn't give you 2x throughput if your request already spends 300ms in Postgres.

When io_uring actually matters

io_uring shines when:

you're doing millions of IOs/s
per-syscall latency is measurable in your profile
you have heavy storage workloads (databases, filesystems)
you're writing a high-load proxy/load balancer
you want to batch operations aggressively

Who uses it seriously today: ScyllaDB, new storage engines, some hyperscalers, some proxies (Cloudflare flirting), kernel bypass alternatives.

Who doesn't need it: 99% of the CRUD Ruby/Rails/Django/Node webapps in the world.

Senior vs junior

Junior: "I'm going to use io_uring because it's newer and faster."

Senior: "Where does my request spend time? If it's in the DB, io_uring changes nothing. If it's in syscalls, show me the profile."

Switching tech without understanding the bottleneck is the illusion that burns the most career time.

The shift

epoll teaches you:

IO isn't "wait in sequence"
one thread can watch thousands of things
the kernel is your concurrency partner

io_uring teaches you:

the syscall itself is expensive at scale
submission and completion can be decoupled
batching + shared memory is the future

Together they teach you:

Concurrency isn't in your code. It's in the kernel. Your language is just faking elegance on top of it.

Conclusion

select → poll → epoll → io_uring.

Four decades summed up in "how one thread watches many things without it costing too much".

Every time someone says "this language is asynchronous", what they're really saying is "this language calls epoll under the hood for you".

Async is not a language feature.

It's a kernel feature.

And those who understand the kernel understand async for real.

The others just memorize keywords.