01 — Problem
What was hard about this
Async workloads — image transforms, PDF rendering, webhook delivery — need queue-based processing. Production task queues fail in subtle ways: silent message drops under broker restarts, duplicate execution when a worker dies mid-task, and cascading failures when a downstream API gets slow and consumes every worker thread. A naive Celery setup hits all three within the first month of real traffic.
02 — Architecture
How the pieces fit
03 — Decisions
Trade-offs I'd defend in an interview
01Dead-letter routing instead of silent drops
Celery's default behavior after exhausted retries is to log and move on — messages disappear. I added an explicit dead-letter exchange in RabbitMQ so failed jobs land in a queue I can inspect, replay, or alert on. The DLQ has its own dashboard panel; if it grows, on-call gets paged.
02Idempotency keys, not natural-key dedup
Natural-key dedup (e.g. 'has this user_id+image_id been processed?') breaks down when retries cross worker boundaries. Clients pass an idempotency key on enqueue; Redis SETNX with a TTL gates duplicate enqueues. Survives worker crashes mid-task and broker restarts.
03Circuit breakers in Redis, not in-process
An in-process circuit breaker (e.g. pybreaker) doesn't share state across worker processes — each one has to fail independently before tripping. Backing the breaker state in Redis makes it cluster-wide: one worker tripping the breaker protects the whole pool from hammering a sick upstream.
04RabbitMQ over Kafka or SQS
Kafka's strengths (high-throughput log, replay) didn't match the workload — these are jobs, not events. SQS is fine but FIFO queue limits + no native DLQ pattern made it awkward. RabbitMQ gives me priority queues, native DLQ exchanges, and a well-understood operational model.
04 — Outcomes
What shipped
- Zero message loss across 10K+ test jobs — the DLQ caught every failure that would have been silent
- 100% duplicate elimination measured by reprocessing the same idempotency key across forced worker crashes
- Sub-50ms p99 API latency under concurrent load (rate limiter short-circuits before queueing on overload)
- 8 services orchestrated via Docker Compose; Terraform deploys the same topology to ECS Fargate
05 — Next
What I'd do if this had another sprint
- Replace Docker Compose with EKS to practice K8s operational patterns
- Add OpenTelemetry distributed tracing — correlation IDs are propagated but not yet emitted as spans
- Add a Kafka topic for fan-out events (e.g. job completed → downstream consumers)
- Publish load-test results from k6 with p50/p95/p99 graphs as part of the README