How the Linux kernel actually receives an HTTP request

Your application doesn't receive the request.

The kernel does.

Your app just picks up the leftovers, after half the work has already been done without it having any idea.

And while you don't understand those layers, you keep saying "Rails is slow", "Node is fast", "Go scales". All guessing.

The real path of a packet

A curl http://api.example.com/users/1 doesn't land in your Puma out of nowhere.

There's a whole journey before:

   network card (NIC)
        │  packet arrives on the wire
        ▼
   NIC driver
        │  IRQ — hardware interrupt
        ▼
   softirq (NET_RX)
        │  processes in light context
        ▼
   ring buffer (RX)
        │  driver's circular queue
        ▼
   IP layer (netfilter, routing)
        │  iptables, conntrack, etc.
        ▼
   TCP layer (state machine)
        │  SYN/ACK, reassembly, congestion
        ▼
   socket buffer (sk_buff)
        │  bytes ready for userspace
        ▼
   accept queue
        │  established connections
        ▼
   your Puma's accept()
        │
        ▼
   your app finally sees the bytes

Each of those layers has a queue, a limit, a syscall, a cost.

And if any of them is saturated, your application looks slow without having even been called.

NIC and IRQ: the entry point

The network card receives an Ethernet frame.

DMA copies the frame straight into a RAM buffer.

Then the NIC fires an IRQ — a hardware interrupt. The CPU drops whatever it was doing and runs the driver's handler.

NIC ────DMA────► RAM (ring buffer)
 │
 └── IRQ ──► CPU stops everything, services it

IRQ is expensive. On a high-throughput server, millions of packets/s, one IRQ per packet would destroy the CPU.

That's why NAPI (New API) exists: after the first IRQ, the driver switches to polling mode. Instead of one interrupt per packet, it processes a batch.

And on serious servers you also use RSS (Receive Side Scaling) to spread IRQs across multiple cores — otherwise one core handles everything while the others sit idle.

If you've never looked at /proc/interrupts, never set up IRQ affinity, never tuned a ring buffer — fine, but know that someone else is doing it, and that's why their server handles 100x your load.

Softirq: the actual work

The IRQ just signals: "there's a packet". That's it.

The real processing happens in softirq — a lighter context scheduled by the kernel.

IRQ (short)
   │
   └─► wakes softirq NET_RX
              │
              └─► processes packets from ring buffer
                  ├── parse Ethernet
                  ├── parse IP
                  ├── parse TCP
                  └── deliver to socket

If softirq can't keep up with the ring buffer, packets get dropped. Silently.

# packets dropped at softirq level
cat /proc/net/softnet_stat

# drops per interface
ethtool -S eth0 | grep -i drop

Juniors never look at this. Seniors look first.

IP layer: routing and filters

The packet moves up to IP.

Here lives:

routing: where is this packet going?
netfilter: iptables, nftables, conntrack
NAT: address translation
fragmentation: reassembling large packets

conntrack is particularly treacherous. A server with lots of short connections blows the table and starts dropping:

sysctl net.netfilter.nf_conntrack_count
sysctl net.netfilter.nf_conntrack_max

I've seen production dying from this with the team blaming "Rails".

TCP layer: handshake and queues

This is where it gets interesting.

TCP is a state machine. And for a socket in LISTEN (your Puma waiting), the kernel keeps two queues:

                  SYN packet arrives
                          │
                          ▼
              ┌───────────────────────┐
              │   SYN queue           │  ← "half-open" connections
              │   (incomplete)        │     SYN received, ACK pending
              └───────────┬───────────┘
                          │ handshake completes
                          ▼
              ┌───────────────────────┐
              │   accept queue        │  ← ready for accept()
              │   (complete)          │
              └───────────┬───────────┘
                          │
                          ▼
                  Puma's accept()

Two queues. Two places things can go wrong.

The SYN queue fills up under SYN flood or if the backlog is too small.

The accept queue fills up if your app doesn't call accept() fast enough. When it fills, the kernel drops the final ACK of the handshake. The client thinks it's connected, but it isn't. Mysterious timeout.

# track accept queue overflow
nstat -az | grep -i listen
ss -lnt   # Recv-Q column = waiting for accept

The backlog is set by listen():

listen(sockfd, 1024);   // accept queue size

And by the sysctl net.core.somaxconn. If your Puma asks for 1024 but the sysctl is 128, you get 128. I've seen this break load tests a thousand times.

SO_REUSEPORT: the missing piece

Traditionally, one socket = one accept queue.

Multiple Puma workers share the same socket, and the kernel wakes one to do accept. But when many workers compete, you get thundering herd: all wake up, one wins, the rest go back to sleep. Waste.

SO_REUSEPORT changes the game:

without SO_REUSEPORT:        with SO_REUSEPORT:

  ┌─ single socket ─┐         ┌─ socket ─┐ ┌─ socket ─┐ ┌─ socket ─┐
  │  accept queue   │         │   queue  │ │   queue  │ │   queue  │
  └────────┬────────┘         └────┬─────┘ └────┬─────┘ └────┬─────┘
           │                       │            │            │
   ┌───┬───┼───┬───┐           worker 1      worker 2     worker 3
   w1  w2  w3  w4              (kernel hashes the 4-tuple
   (all compete)                and distributes — no competition)

Each worker has its own accept queue. The kernel distributes connections by hashing the (src_ip, src_port, dst_ip, dst_port). No competition. No thundering herd.

Serious modern servers use this by default. Puma supports it. Nginx supports it. If you're not using it, you're leaving throughput on the table.

Socket buffer: where the bytes wait

Every TCP socket has a buffer in the kernel:

   network ──► sk_buff (recv buffer) ──► app's read()/recv()
                  │
                  └── size controls TCP window

net.core.rmem_max, net.ipv4.tcp_rmem — these control how much can sit buffered.

If the app doesn't read fast enough, the buffer fills, the TCP window closes, the sender stops sending. Pure backpressure.

And that's why blocking a Puma thread with synchronous IO is so bad: the socket fills, the client waits, the accept queue fills, and the whole server chokes.

And epoll?

Before calling accept(), Puma uses epoll to know when there's something to accept.

epoll_ctl(epfd, EPOLL_CTL_ADD, listen_sock, &ev);
epoll_wait(epfd, events, max, timeout);
// returns: "there's a connection on the accept queue"
accept4(listen_sock, ...);

Without epoll, you'd be calling accept() in a loop (busy wait) or blocking one thread per socket. Both terrible.

epoll is what lets one thread watch thousands of sockets. It's the engine of Linux concurrency. Worth its own article — and it has one.

Userspace finally enters

After all that:

accept() returns ──► new file descriptor
       │
       ▼
   Puma worker grabs the fd
       │
       ▼
   read() bytes of the HTTP request
       │
       ▼
   HTTP parser (in Ruby or C)
       │
       ▼
   builds Rack env
       │
       ▼
   Rails.application.call(env)

This is where your "framework" starts.

Everything before was done by the kernel, the driver, the NIC, by Puma. You had no control over any of it.

Symptoms vs real cause

Senior vs junior here is simple:

Junior: "Rails is slow, I'm switching frameworks."

Senior: "Accept queue is overflowing, I'll raise the backlog, find out why the worker is slow to accept, and turn on SO_REUSEPORT."

Scenarios that look like "slow app" but are actually kernel:

Handshake timeouts → accept queue overflow
Connections vanishing randomly → conntrack full
Throughput hitting a ceiling → small ring buffer, IRQ pinned to one core
High latency under load → bufferbloat, RTT inflated by queueing

None of these get fixed by changing your ORM.

Commands to keep in your pocket

ss -lnt                          # listening sockets, Recv-Q, Send-Q
ss -s                            # summary of TCP states
nstat -az | grep -i drop         # drops at various layers
cat /proc/net/softnet_stat       # softirq stats
ethtool -S eth0 | grep drop      # NIC-level drops
sysctl net.core.somaxconn        # max accept queue
sysctl net.ipv4.tcp_max_syn_backlog
cat /proc/interrupts             # IRQs distributed across CPUs

Learn these. They're far more useful than 90% of the gems you install.

The shift

The HTTP request your app sees is the end of a journey, not the start.

Before your controller ever exists:

the NIC grabbed the frame
the driver moved it to RAM
softirq parsed TCP/IP
the kernel completed the handshake
the accept queue held the connection
epoll woke the worker
accept() returned an fd
read() pulled bytes off the buffer
the HTTP parser built the env

Only then does Rails.application.call(env) run.

Conclusion

Your backend framework is the last 5% of the request lifecycle.

The other 95% is kernel, driver, NIC, queue, syscall, scheduler.

People who only understand the framework debug in the dark when things get tight.

People who understand the whole stack open ss, open nstat, open /proc/net/softnet_stat, and in two minutes know whether the problem is Ruby or the kernel screaming that the accept queue is overflowing.

Linux is not the background.

Linux is the stage.

Your Rails just dances on top of it.