How to avoid retry storms in background jobs

Your external API was down for 5 minutes.

When it comes back, 100k jobs hit it at the same instant.

It goes down again. This time it's your fault.

Welcome to the retry storm.

The Sidekiq default no one read

Sidekiq, by default, retries 25 times.

The formula is roughly:

# Sidekiq's default polynomial backoff
(count ** 4) + 15 + (rand(30) * (count + 1))

Translated:

retry 1: ~30s
retry 5: ~10min
retry 10: ~3h
retry 25: ~21 days

Looks fine. It isn't.

The problem isn't the interval. It's the synchronized volume.

What a retry storm looks like

Real scenario:

12:00:00 — external API starts returning 500
12:00:00 — 100,000 jobs were running, all fail
12:00:30 — all 100,000 enter retry simultaneously
12:00:30 — API recovers, but eats 100k requests in 1 second
12:00:31 — API goes down again
12:01:00 — second retry, +100k hitting it
...

You're not retrying. You're DDoS-ing your own dependency.

Worse: the API was down for 5 minutes once. You made it unstable for 2 hours.

Why this happens

Two reasons.

1. Zero jitter.

If 100k jobs failed in the same second, all of them will retry in the same second.

Exponential backoff without jitter = perfect synchronization.

2. Nobody read the defaults.

Sidekiq's default was designed for ONE job that fails. Not a hundred thousand.

At scale, defaults become bombs.

Backoff with jitter — what matters

Without jitter:

retry 1: all at 30s
retry 2: all at 16min
retry 3: all at 1h21

Synchronized spikes. Storm guaranteed.

Full jitter

sleep_time = rand(0..base_backoff(count))

Each job sleeps between 0 and the calculated backoff. Spreads nicely.

Decorrelated jitter (AWS Architecture Blog)

sleep_time = [cap, rand(base..(previous_sleep * 3))].min

Grows faster when the system is healthy and doesn't synchronize. It's what AWS recommends.

In Sidekiq:

class FlakyJob
  include Sidekiq::Job

  sidekiq_options retry: 10

  sidekiq_retry_in do |count, _exception|
    base = (count ** 4) + 15
    rand(base..(base * 3)) # decorrelated jitter
  end

  def perform(id)
    ExternalAPI.call(id)
  end
end

Tiny. Saves your life.

Circuit breaker — stop before you hit

Retry is blind.

If the API is down, retrying doesn't help. It just makes it worse.

The grown-up solution is a circuit breaker:

require 'circuitbox'

class ExternalAPI
  def self.call(id)
    circuit = Circuitbox.circuit(:external_api, {
      exceptions: [Net::OpenTimeout, Net::ReadTimeout],
      volume_threshold: 10,
      error_threshold: 50,
      time_window: 60,
      sleep_window: 300
    })

    circuit.run do
      HTTP.timeout(5).get("https://api.example.com/#{id}")
    end
  end
end

When the breaker opens:

new jobs fail instantly
nothing hits the API
after sleep_window, one request is let through to test

The difference:

without breaker: 100k useless retries hammering a dead API
with breaker: 100k jobs fail in milliseconds, go to the retry queue calmly

The cost of retrying without thinking

Real bill I once saw:

1 failing job
25 default retries
~21 days of attempts
each attempt: 5s HTTP call + Redis write

Per job: 125s of CPU, 25 round-trips to Redis, 25 external calls.

Multiply by 100k failed jobs = 3.5 million useless calls spread over 3 weeks.

Your Datadog bill loves it.

When retry makes things worse

Classic cascading failure cases:

1. Slow dependency, not dead.

API responds in 30s instead of 200ms. Your workers stall waiting. Queue fills up. Retries pile on top. You lose all processing capacity, not just the jobs for that API.

2. Non-idempotent job.

Retry charges the card twice. Sends the email three times. Creates a duplicate order.

Retry without idempotency is a scheduled bug.

3. Job that writes to the resource that's dying.

Database with lock contention? You retry and pile on more locks. Database dies faster.

4. Retry fan-out.

Job A fails, retries. Each retry spawns 10 B jobs. Which also fail. Which also retry. Which spawn 100 C jobs.

3 levels in, you have 1000x the original volume. Exponential storm.

Dead letter queue — where the dead rest

At some point, stop.

class PaymentJob
  include Sidekiq::Job

  sidekiq_options retry: 5,
    dead: true # goes to the morgue after 5 failures

  sidekiq_retries_exhausted do |msg, ex|
    DeadLetterQueue.push(
      job_class: msg['class'],
      args: msg['args'],
      error: ex.message,
      failed_at: Time.now
    )

    Alerting.notify("Job #{msg['class']} dead after 5 attempts")
  end

  def perform(payment_id)
    PaymentProcessor.charge(payment_id)
  end
end

A DLQ isn't a trash queue. It's an investigation queue.

Difference between senior and junior devs:

junior sets retry: 25 and forgets
senior sets retry: 5 + DLQ + alert

One learns when something breaks. The other finds out 21 days later, when the customer calls.

The full architecture

Job fails
   ↓
Error retryable? (timeout, 5xx)
   ↓ yes                    ↓ no
Circuit breaker open?       Straight to DLQ
   ↓ no        ↓ yes
Re-enqueue    Fail fast (no API call)
with backoff
+ jitter
   ↓
Exceeded max retries?
   ↓ yes
Goes to DLQ
   ↓
Alert
   ↓
A human investigates

Every box is a conscious decision. Not a default.

Concurrency limits save lives

Even with jitter, if you have 50k jobs ready to run and 200 workers, you still hit hard.

Explicit per-queue limits:

# config/sidekiq.yml
:limits:
  external_api: 20
  payments: 10
  emails: 50

Or with sidekiq-throttled:

class ExternalAPIJob
  include Sidekiq::Job
  include Sidekiq::Throttled::Job

  sidekiq_throttle(
    concurrency: { limit: 10 },
    threshold: { limit: 100, period: 1.minute }
  )
end

100 calls per minute. Doesn't matter how many jobs are queued.

The API thanks you.

Real numbers I've changed

Before:

Sidekiq default (retry: 25)
no jitter
no circuit breaker
external API went down ~1x/week
each outage: 4h of total system instability

After:

retry: 5 + decorrelated jitter
circuitbox with 5min sleep_window
throttle: 30 req/min to the external API
DLQ + alert

Same API outage:

30s of errors
jobs go to DLQ if the API stays down >5min
the system keeps processing other queues normally

External call cost dropped 70%. On-call pages dropped 90%.

Conclusion

Retry feels free. It isn't.

Every retry: 25 in your code is a promise of DDoS against your dependencies.

If you haven't thought about jitter, circuit breaker, idempotency and DLQ, you don't have retry — you have a time bomb.

Golden rule:

Retry is not a fix. Retry is a delay strategy.

Fixing means understanding why it failed.

If you only retry, you're pushing the failure into the future, usually multiplied.

Resilient systems don't retry more. They retry better, and stop at the right time.