Dead letter queue (DLQ) is a queue for messages / jobs that repeatedly failed processing. Instead of losing failed items after N retries — they land in DLQ for manual inspection.
Classic pattern:
- Job lands in main queue
- Worker tries processing
- Fail? Retry with exponential backoff (1s, 2s, 4s, 8s...)
- After 5-10 failed attempts → move to DLQ
- DLQ has alert (Slack, email) and a UI for inspection
- Engineer reviews failed jobs, fixes root cause, replays from DLQ → main queue
What ends up there in practice:
- Permanent failures (404, malformed data, banned account)
- Schema drift (target API changed format)
- Network outages longer than retry budget
- Software bugs that surfaced only in production
Without DLQ: failed jobs are silently lost. Customer calls "where is my report?" — turns out the pipeline has been failing for 3 days and nobody noticed. With DLQ: everything is visible, recoverable, auditable.
Implementation: RabbitMQ has built-in DLX (Dead Letter Exchange). AWS SQS has redrive policy. Redis Streams + manual logic. PostgreSQL with status column. See our engineering principle "Dead-letter everything".