epoll and io_uring, explained visually
Blocking IO → select → poll → epoll → io_uring. Every step exists because the previous one stalled. Async isn't a language feature, it's a kernel feature.
epoll and io_uring, explained visually
Concurrency is not a language feature.
It's a kernel feature.
async/await in JS, fibers in Ruby, goroutines in Go — all of them are pretty facades over two or three Linux syscalls nobody wants to study.
Today we study them.
The starting point: blocking IO
Before any "magic", there was this:
thread ──► read(fd) ──► wait ──► wait ──► data arrives ──► returns
│
thread frozen the whole time
One thread per connection. read() blocks until data arrives.
10 clients? 10 threads. 10,000 clients? 10,000 threads.
Each thread eats stack (~2MB by default on Linux), pressures the scheduler, pollutes CPU cache.
client 1 ──► thread 1 ──► read() blocked
client 2 ──► thread 2 ──► read() blocked
client 3 ──► thread 3 ──► read() blocked
...
client N ──► thread N ──► read() blocked
Doesn't scale. Period.
First try: select
The first idea was: what if one thread could watch many sockets?
select():
fd_set readfds;
FD_ZERO(&readfds);
FD_SET(fd1, &readfds);
FD_SET(fd2, &readfds);
// ...
select(maxfd+1, &readfds, NULL, NULL, NULL);
// returns saying: "one of these is ready"
// you find out WHICH by iterating all of them
Visually:
single thread
│
├── watches fd1
├── watches fd2
├── watches fd3
└── watches fdN
│
select() blocks until any becomes ready
│
returns ──► you iterate ALL to find out which
Obvious problems:
- 1024 fd limit (fixed bitmap)
- every call you hand the entire bitmap to the kernel
- every call the kernel scans every fd
- every call you scan every fd to find who's ready
O(n) in the call. O(n) in the response. Every call.
With 10,000 connections, a tragedy.
Second try: poll
poll() removed the 1024 limit, using an array instead of a bitmap:
struct pollfd fds[N];
fds[0].fd = sock1; fds[0].events = POLLIN;
fds[1].fd = sock2; fds[1].events = POLLIN;
// ...
poll(fds, N, timeout);
But the fundamental problem remains: every call you pass the whole list. Every call the kernel scans everything.
userspace kernel
┌──────────────────┐ ┌──────────────────┐
│ array of N fds │ ──────► │ copies, scans N │
│ │ ◄────── │ returns flags │
└──────────────────┘ └──────────────────┘
you scan N again
to find out who's ready
Still O(n). Still doesn't scale.
The revolution: epoll
In 2002, the kernel got the abstraction that runs the modern internet: epoll.
The core idea:
The kernel keeps the list of fds. You only say who joins and who leaves. And it only gives back the ones that actually became ready.
Three syscalls:
int epfd = epoll_create1(0); // create the "list" in the kernel
struct epoll_event ev;
ev.events = EPOLLIN;
ev.data.fd = sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, sock, &ev); // add one fd
epoll_event events[64];
int n = epoll_wait(epfd, events, 64, -1); // wait for ready ones
// returns ONLY the ready ones
Visually:
┌──────────────────────────┐
│ kernel: epoll instance │
│ ┌────────────────────┐ │
epoll_ctl ───► │ │ fd1 fd2 fd3 ... │ │ ◄── add/remove
│ └────────────────────┘ │
│ │
│ ┌──── ready list ───┐ │
epoll_wait ──► │ │ fd2 fd7 │ │ ──► returns ONLY ready ones
│ └───────────────────┘ │
└──────────────────────────┘
The fd list lives in the kernel. You don't ship it back and forth. You just register changes.
When something becomes ready, it goes into the internal ready list. epoll_wait hands you just that.
O(1) per event. Not O(n) per call.
That's why nginx, redis, node, Java NIO, puma (in async IO mode), all sit on top of epoll. The entire modern internet revolves around it.
But epoll still has syscalls
Each epoll_wait is a syscall. Each read() after that is another syscall. Each write() another.
Syscalls aren't free. Context switch, privilege mode change, pipeline flush. On a server doing millions of req/s, syscall cost starts showing up in the profile.
And more: epoll tells you "you can read". You still need to call read afterward. This is what's called a readiness model:
1. epoll_wait ──► "fd2 is ready to read"
2. read(fd2) ──► pull the bytes
3. process
4. epoll_wait again
Each cycle: 2+ syscalls. Always.
If you want a million reads, that's a million syscalls. Minimum.
The new era: io_uring
In 2019, Jens Axboe ships io_uring. Not epoll v2. A completely different model.
Instead of readiness ("tell me when I can read"), io_uring is completion ("do this and tell me when it's done"):
epoll (readiness): io_uring (completion):
"tell me when reading works" "read this. tell me when done."
│ │
"ok, it works now" "ok, here are your bytes"
│
"call read()" (no separate read needed)
│
"now call again" (no extra syscall needed)
How it works: two ring buffers shared between userspace and kernel:
┌─────────────────────── userspace ─────────────────────────┐
│ │
│ ┌─ Submission Queue (SQ) ─┐ ┌─ Completion Queue (CQ) ┐│
│ │ read fd1 offset 0 │ │ fd1: 1024 bytes ok ││
│ │ write fd2 buf X │ │ fd2: write done ││
│ │ accept listen_fd │ │ listen_fd: new conn ││
│ │ ... │ │ ... ││
│ └─────────────┬───────────┘ └────────────▲──────────┘│
│ │ │ │
└─────────────────┼──────────────────────────────┼───────────┘
│ memory-mapped, shared │
┌─────────────────▼──────────────────────────────┼───────────┐
│ kernel reads SQ kernel writes CQ │
│ executes operations with results │
└────────────────────────────────────────────────────────────┘
You write operations to the SQ. The kernel reads them, executes, and writes results to the CQ. You read from the CQ.
No syscall per operation. The queues live in shared, memory-mapped memory.
And the best part: zero syscall mode (SQPOLL). A kernel thread stays awake watching the SQ. You just write to the ring. The kernel picks it up on its own.
normal mode: SQPOLL mode (zero syscall):
app writes SQ app writes SQ
│ │
io_uring_enter() ──► kernel kernel thread is already watching
│ │
kernel processes processes without being called
│ │
writes CQ writes CQ
│ │
app reads CQ app reads CQ
A high-end server can do thousands of IOs without a single syscall. Something that looked impossible on Linux a decade ago.
The evolutionary tree
blocking IO (1 thread per connection)
│
│ "this doesn't scale"
▼
select() (1983)
│
│ "1024 limit, O(n)"
▼
poll() (1986)
│
│ "still O(n) per call"
▼
epoll (2002)
│
│ "still one syscall per op"
▼
io_uring (2019)
│
│ zero syscall, batched, truly async
▼
(the future)
Each jump fixed the previous bottleneck.
io_uring is all that and what else?
io_uring isn't just read/write. It supports:
- accept, connect, send, recv
- openat, close, statx
- fsync, fallocate
- splice, tee
- timeouts, linked operations (one depends on another)
- chained batch execution
It's practically a syscall VM running in the kernel.
You can say: "open this file, read 4KB, send it on this socket, close". All in one submission. Without bouncing back to userspace in between.
Where do Ruby/Rails fit?
This part hurts.
Ruby is still, in real production, predominantly on top of epoll.
- Puma uses nio4r (Java NIO-style), which uses epoll
- Falcon (Async) uses nio4r too, or modern IO.select
- Ruby 3.x fibers improved things a LOT — finally concurrency without callback hell
io_uring in Ruby? Minimal. There's a rio_uring gem, a few experiments. Nothing production-grade yet.
And that's fine.
Because the uncomfortable truth is: most Ruby/Rails apps are not limited by syscall overhead.
They're limited by:
- slow database queries
- N+1
- GC pressure
- synchronous IO on a blocking thread
- bad code structure
Swapping epoll for io_uring in your Puma doesn't give you 2x throughput if your request already spends 300ms in Postgres.
When io_uring actually matters
io_uring shines when:
- you're doing millions of IOs/s
- per-syscall latency is measurable in your profile
- you have heavy storage workloads (databases, filesystems)
- you're writing a high-load proxy/load balancer
- you want to batch operations aggressively
Who uses it seriously today: ScyllaDB, new storage engines, some hyperscalers, some proxies (Cloudflare flirting), kernel bypass alternatives.
Who doesn't need it: 99% of the CRUD Ruby/Rails/Django/Node webapps in the world.
Senior vs junior
Junior: "I'm going to use io_uring because it's newer and faster."
Senior: "Where does my request spend time? If it's in the DB, io_uring changes nothing. If it's in syscalls, show me the profile."
Switching tech without understanding the bottleneck is the illusion that burns the most career time.
The shift
epoll teaches you:
- IO isn't "wait in sequence"
- one thread can watch thousands of things
- the kernel is your concurrency partner
io_uring teaches you:
- the syscall itself is expensive at scale
- submission and completion can be decoupled
- batching + shared memory is the future
Together they teach you:
Concurrency isn't in your code. It's in the kernel. Your language is just faking elegance on top of it.
Conclusion
select → poll → epoll → io_uring.
Four decades summed up in "how one thread watches many things without it costing too much".
Every time someone says "this language is asynchronous", what they're really saying is "this language calls epoll under the hood for you".
Async is not a language feature.
It's a kernel feature.
And those who understand the kernel understand async for real.
The others just memorize keywords.
