How to avoid retry storms in background jobs
External API goes down. 100k Sidekiq jobs decide to retry at the same time. Jitter, circuit breaker, DLQ — retry isn't a fix, it's a deferral strategy.
How to avoid retry storms in background jobs
Your external API was down for 5 minutes.
When it comes back, 100k jobs hit it at the same instant.
It goes down again. This time it's your fault.
Welcome to the retry storm.
The Sidekiq default no one read
Sidekiq, by default, retries 25 times.
The formula is roughly:
# Sidekiq's default polynomial backoff
(count ** 4) + 15 + (rand(30) * (count + 1))
Translated:
- retry 1: ~30s
- retry 5: ~10min
- retry 10: ~3h
- retry 25: ~21 days
Looks fine. It isn't.
The problem isn't the interval. It's the synchronized volume.
What a retry storm looks like
Real scenario:
12:00:00 — external API starts returning 500
12:00:00 — 100,000 jobs were running, all fail
12:00:30 — all 100,000 enter retry simultaneously
12:00:30 — API recovers, but eats 100k requests in 1 second
12:00:31 — API goes down again
12:01:00 — second retry, +100k hitting it
...
You're not retrying. You're DDoS-ing your own dependency.
Worse: the API was down for 5 minutes once. You made it unstable for 2 hours.
Why this happens
Two reasons.
1. Zero jitter.
If 100k jobs failed in the same second, all of them will retry in the same second.
Exponential backoff without jitter = perfect synchronization.
2. Nobody read the defaults.
Sidekiq's default was designed for ONE job that fails. Not a hundred thousand.
At scale, defaults become bombs.
Backoff with jitter — what matters
Without jitter:
retry 1: all at 30s
retry 2: all at 16min
retry 3: all at 1h21
Synchronized spikes. Storm guaranteed.
Full jitter
sleep_time = rand(0..base_backoff(count))
Each job sleeps between 0 and the calculated backoff. Spreads nicely.
Decorrelated jitter (AWS Architecture Blog)
sleep_time = [cap, rand(base..(previous_sleep * 3))].min
Grows faster when the system is healthy and doesn't synchronize. It's what AWS recommends.
In Sidekiq:
class FlakyJob
include Sidekiq::Job
sidekiq_options retry: 10
sidekiq_retry_in do |count, _exception|
base = (count ** 4) + 15
rand(base..(base * 3)) # decorrelated jitter
end
def perform(id)
ExternalAPI.call(id)
end
end
Tiny. Saves your life.
Circuit breaker — stop before you hit
Retry is blind.
If the API is down, retrying doesn't help. It just makes it worse.
The grown-up solution is a circuit breaker:
require 'circuitbox'
class ExternalAPI
def self.call(id)
circuit = Circuitbox.circuit(:external_api, {
exceptions: [Net::OpenTimeout, Net::ReadTimeout],
volume_threshold: 10,
error_threshold: 50,
time_window: 60,
sleep_window: 300
})
circuit.run do
HTTP.timeout(5).get("https://api.example.com/#{id}")
end
end
end
When the breaker opens:
- new jobs fail instantly
- nothing hits the API
- after
sleep_window, one request is let through to test
The difference:
- without breaker: 100k useless retries hammering a dead API
- with breaker: 100k jobs fail in milliseconds, go to the retry queue calmly
The cost of retrying without thinking
Real bill I once saw:
- 1 failing job
- 25 default retries
- ~21 days of attempts
- each attempt: 5s HTTP call + Redis write
Per job: 125s of CPU, 25 round-trips to Redis, 25 external calls.
Multiply by 100k failed jobs = 3.5 million useless calls spread over 3 weeks.
Your Datadog bill loves it.
When retry makes things worse
Classic cascading failure cases:
1. Slow dependency, not dead.
API responds in 30s instead of 200ms. Your workers stall waiting. Queue fills up. Retries pile on top. You lose all processing capacity, not just the jobs for that API.
2. Non-idempotent job.
Retry charges the card twice. Sends the email three times. Creates a duplicate order.
Retry without idempotency is a scheduled bug.
3. Job that writes to the resource that's dying.
Database with lock contention? You retry and pile on more locks. Database dies faster.
4. Retry fan-out.
Job A fails, retries. Each retry spawns 10 B jobs. Which also fail. Which also retry. Which spawn 100 C jobs.
3 levels in, you have 1000x the original volume. Exponential storm.
Dead letter queue — where the dead rest
At some point, stop.
class PaymentJob
include Sidekiq::Job
sidekiq_options retry: 5,
dead: true # goes to the morgue after 5 failures
sidekiq_retries_exhausted do |msg, ex|
DeadLetterQueue.push(
job_class: msg['class'],
args: msg['args'],
error: ex.message,
failed_at: Time.now
)
Alerting.notify("Job #{msg['class']} dead after 5 attempts")
end
def perform(payment_id)
PaymentProcessor.charge(payment_id)
end
end
A DLQ isn't a trash queue. It's an investigation queue.
Difference between senior and junior devs:
- junior sets
retry: 25and forgets - senior sets
retry: 5+ DLQ + alert
One learns when something breaks. The other finds out 21 days later, when the customer calls.
The full architecture
Job fails
↓
Error retryable? (timeout, 5xx)
↓ yes ↓ no
Circuit breaker open? Straight to DLQ
↓ no ↓ yes
Re-enqueue Fail fast (no API call)
with backoff
+ jitter
↓
Exceeded max retries?
↓ yes
Goes to DLQ
↓
Alert
↓
A human investigates
Every box is a conscious decision. Not a default.
Concurrency limits save lives
Even with jitter, if you have 50k jobs ready to run and 200 workers, you still hit hard.
Explicit per-queue limits:
# config/sidekiq.yml
:limits:
external_api: 20
payments: 10
emails: 50
Or with sidekiq-throttled:
class ExternalAPIJob
include Sidekiq::Job
include Sidekiq::Throttled::Job
sidekiq_throttle(
concurrency: { limit: 10 },
threshold: { limit: 100, period: 1.minute }
)
end
100 calls per minute. Doesn't matter how many jobs are queued.
The API thanks you.
Real numbers I've changed
Before:
- Sidekiq default (
retry: 25) - no jitter
- no circuit breaker
- external API went down ~1x/week
- each outage: 4h of total system instability
After:
retry: 5+ decorrelated jitter- circuitbox with 5min sleep_window
- throttle: 30 req/min to the external API
- DLQ + alert
Same API outage:
- 30s of errors
- jobs go to DLQ if the API stays down >5min
- the system keeps processing other queues normally
External call cost dropped 70%. On-call pages dropped 90%.
Conclusion
Retry feels free. It isn't.
Every retry: 25 in your code is a promise of DDoS against your dependencies.
If you haven't thought about jitter, circuit breaker, idempotency and DLQ, you don't have retry — you have a time bomb.
Golden rule:
Retry is not a fix. Retry is a delay strategy.
Fixing means understanding why it failed.
If you only retry, you're pushing the failure into the future, usually multiplied.
Resilient systems don't retry more. They retry better, and stop at the right time.
