Distributed Task Queue | Malav Gajera

01 — Problem

What was hard about this

Async workloads — image transforms, PDF rendering, webhook delivery — need queue-based processing. Production task queues fail in subtle ways: silent message drops under broker restarts, duplicate execution when a worker dies mid-task, and cascading failures when a downstream API gets slow and consumes every worker thread. A naive Celery setup hits all three within the first month of real traffic.

02 — Architecture

How the pieces fit

Loading diagram…

Producer dedupes via idempotency key, RabbitMQ retries with exponential backoff before falling to a DLQ, Redis-backed circuit breakers shed load when downstream APIs slow down.

03 — Decisions

Trade-offs I'd defend in an interview

01Dead-letter routing instead of silent drops

Celery's default behavior after exhausted retries is to log and move on — messages disappear. I added an explicit dead-letter exchange in RabbitMQ so failed jobs land in a queue I can inspect, replay, or alert on. The DLQ has its own dashboard panel; if it grows, on-call gets paged.

02Idempotency keys, not natural-key dedup

Natural-key dedup (e.g. 'has this user_id+image_id been processed?') breaks down when retries cross worker boundaries. Clients pass an idempotency key on enqueue; Redis SETNX with a TTL gates duplicate enqueues. Survives worker crashes mid-task and broker restarts.

03Circuit breakers in Redis, not in-process

An in-process circuit breaker (e.g. pybreaker) doesn't share state across worker processes — each one has to fail independently before tripping. Backing the breaker state in Redis makes it cluster-wide: one worker tripping the breaker protects the whole pool from hammering a sick upstream.

04RabbitMQ over Kafka or SQS

Kafka's strengths (high-throughput log, replay) didn't match the workload — these are jobs, not events. SQS is fine but FIFO queue limits + no native DLQ pattern made it awkward. RabbitMQ gives me priority queues, native DLQ exchanges, and a well-understood operational model.

04 — Outcomes

What shipped

Zero message loss across 10K+ test jobs — the DLQ caught every failure that would have been silent
100% duplicate elimination measured by reprocessing the same idempotency key across forced worker crashes
Sub-50ms p99 API latency under concurrent load (rate limiter short-circuits before queueing on overload)
8 services orchestrated via Docker Compose; Terraform deploys the same topology to ECS Fargate

05 — Next

What I'd do if this had another sprint

Replace Docker Compose with EKS to practice K8s operational patterns
Add OpenTelemetry distributed tracing — correlation IDs are propagated but not yet emitted as spans
Add a Kafka topic for fan-out events (e.g. job completed → downstream consumers)
Publish load-test results from k6 with p50/p95/p99 graphs as part of the README

06 — Visual proof

See it in code

Back to all projects Want to talk about how this would fit your team?