githublinkedinemail
← /articles

Lock contention: the silent killer in PostgreSQL

A 10ms migration that freezes production for 40 minutes. ALTER TABLE grabbing ACCESS EXCLUSIVE. pg_locks + pg_stat_activity. Locks are invisible — until they aren't.

5 minpostgres

Lock contention: the silent killer in PostgreSQL

The app is slow.

CPU low. IO low. Pool with connections to spare.

And still, requests piling up.

Welcome to the world of locks.


Why locks are different from everything else

Slowness from a slow query you see in the APM.

Slowness from N+1 you see in the log.

Lock contention you don't see anywhere until someone opens pg_stat_activity and finds out half the app is waiting on the other half to finish.

Locks don't consume CPU. They don't consume IO. They only consume time.

And time, in production, is money and the user's patience.


The Postgres lock modes

Postgres has 8 table-level lock modes. Memorize the extremes, look up the middle.

ACCESS SHARE              ← SELECT
ROW SHARE                 ← SELECT FOR UPDATE
ROW EXCLUSIVE             ← INSERT, UPDATE, DELETE
SHARE UPDATE EXCLUSIVE    ← VACUUM, CREATE INDEX CONCURRENTLY
SHARE                     ← CREATE INDEX
SHARE ROW EXCLUSIVE       ← rare
EXCLUSIVE                 ← REFRESH MATERIALIZED VIEW CONCURRENTLY
ACCESS EXCLUSIVE          ← ALTER TABLE, DROP TABLE, TRUNCATE, REINDEX

The simple rule:

  • ACCESS SHARE is the lightest. Coexists with almost everyone.
  • ACCESS EXCLUSIVE is the heaviest. Blocks literally everything, even SELECT.

And guess what your migration does when you run an innocent ALTER TABLE?


The migration that froze production

Real case. Migration that looks trivial:

class AddStatusToOrders < ActiveRecord::Migration[7.0]
  def change
    add_column :orders, :status, :string, default: 'pending', null: false
  end
end

In dev: 10ms.

In prod, with 80M rows: 42 minutes with the table locked.

And while the column was being backfilled with the default, no SELECT on orders ran. Checkout, listings, admin panel — all stopped.

The reason? ALTER TABLE ... ADD COLUMN ... DEFAULT 'pending' on older Postgres rewrites the whole table while holding ACCESS EXCLUSIVE.

The right way, on modern Postgres:

ALTER TABLE orders ADD COLUMN status text;
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
-- backfill in batches
UPDATE orders SET status = 'pending' WHERE status IS NULL AND id BETWEEN 1 AND 10000;
-- ...
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

Each step holds the lock for milliseconds. Not 42 minutes.


Row-level locks: the other enemy

SELECT ... FOR UPDATE locks the row until the transaction ends.

BEGIN;
SELECT * FROM accounts WHERE id = 42 FOR UPDATE;
-- meanwhile nobody else updates this account
UPDATE accounts SET balance = balance - 100 WHERE id = 42;
COMMIT;

Useful to avoid race conditions on balances, stock, counters.

Dangerous when:

  • the transaction is long
  • the FOR UPDATE lives inside a loop
  • you forget the COMMIT (transaction left open by accident)

Difference between senior and junior:

  • junior uses FOR UPDATE everywhere "just in case"
  • senior uses it only where there's real contention, and times the transaction

The query that saves your Friday

Memorize this one. Seriously.

SELECT
  blocked.pid           AS blocked_pid,
  blocked.usename       AS blocked_user,
  blocked.query         AS blocked_query,
  blocking.pid          AS blocking_pid,
  blocking.usename      AS blocking_user,
  blocking.query        AS blocking_query,
  blocking.state        AS blocking_state,
  now() - blocked.query_start AS blocked_duration
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE NOT blocked.pid = blocking.pid;

One row: who's waiting, who's holding.

If anything shows up, somebody is blocking somebody.

If blocking_state is idle in transaction, you found your culprit: somebody opened a transaction and forgot to close it.


pg_locks by hand

SELECT
  l.pid,
  l.locktype,
  l.mode,
  l.granted,
  a.query,
  a.state,
  now() - a.xact_start AS tx_age
FROM pg_locks l
JOIN pg_stat_activity a USING (pid)
WHERE NOT l.granted
ORDER BY tx_age DESC;

Shows ungranted locks — i.e., processes waiting.

If the queue has 30 entries all waiting on the same pid, you have your killer.

Kill without mercy:

SELECT pg_cancel_backend(12345);    -- asks politely
SELECT pg_terminate_backend(12345); -- yanks the cable

The timeouts you should have set yesterday

Postgres lets you wait for a lock forever. By default. Yes, forever.

Configure this:

SET lock_timeout = '3s';
SET statement_timeout = '30s';
SET idle_in_transaction_session_timeout = '10s';

What each does:

lock_timeout
  └── give up waiting for a lock after N ms

statement_timeout
  └── kill a query running longer than N ms

idle_in_transaction_session_timeout
  └── kill an open transaction with no activity

In Rails:

# config/database.yml
production:
  variables:
    lock_timeout: 3000
    statement_timeout: 30000
    idle_in_transaction_session_timeout: 10000

In migrations, even more aggressive:

class AddIndexConcurrently < ActiveRecord::Migration[7.0]
  disable_ddl_transaction!

  def change
    execute "SET lock_timeout = '2s'"
    add_index :orders, :user_id, algorithm: :concurrently
  end
end

A migration that can't grab the lock in 2s fails fast, instead of taking the app down.


Mental hierarchy for diagnosis

App slow but no CPU/IO?
   ↓
Check pg_stat_activity
   ↓
Queries with state = 'active' for more than X seconds?
   ├── yes → classic slow query → EXPLAIN ANALYZE
   └── no
       ↓
Any ungranted locks in pg_locks?
   ├── yes → lock contention → find the blocker
   └── no → look elsewhere (network, app, GC)

Old 'idle in transaction' sessions?
   └── yes → open transaction → kill it and check the code

90% of "Postgres is slow and nobody understands why" lands on one of these branches.


Classic sins

1. ALTER TABLE without lock_timeout.

You think it'll take 1s. It doesn't. The table locks. Everything locks.

2. Rails transaction holding a lock while calling an external API.

Order.transaction do
  order.lock!
  PaymentGateway.charge(order)  # 8s of HTTP
  order.update!(paid: true)
end

For 8s, nobody touches this order. Worse: the pool connection is hostage too.

3. Migration combining ADD COLUMN NOT NULL DEFAULT on a big table.

We've seen it. Don't.

4. CREATE INDEX instead of CREATE INDEX CONCURRENTLY.

The first blocks writes on the table. The second doesn't.

5. A worker that does SELECT FOR UPDATE and processes slowly.

The row stays hostage for the entire job duration.


The correct workflow for schema changes in production

1. Assess impact (table size, traffic)
   ↓
2. Break into small steps (ADD, BACKFILL in batches, NOT NULL)
   ↓
3. Always SET lock_timeout before DDL
   ↓
4. CREATE INDEX CONCURRENTLY, always
   ↓
5. Run during off-peak
   ↓
6. Have a rollback plan
   ↓
7. Monitor pg_locks DURING the migration

Not paranoia. The bare minimum.


Conclusion

Lock contention is the kind of problem that:

  • doesn't show up in the APM
  • doesn't show up in CPU/memory metrics
  • only shows up when somebody opens pg_stat_activity

The difference between senior and junior here is brutal:

  • junior debugs by reading the application log
  • senior opens pg_locks in 30 seconds

Learn the lock modes. Set the timeouts. Treat DDL like surgery.

Locks are invisible until they aren't.

And when they aren't, it's usually in production, Friday night.

Lock contention: the silent killer in PostgreSQL
Lock contention: the silent killer in PostgreSQL