Resilient Cloud Deployment Platform

01 — Problem

What was hard about this

A 'real' production deployment surface — not a toy demo — has to handle the hard parts: AZ failure, blue-green cutover, private-subnet databases, IAM that follows least privilege, and infrastructure-as-code you can hand to the next engineer without a runbook. Most class projects stop at 'deploys to EC2'. This one had to survive an availability zone going dark.

02 — Architecture

How the pieces fit

Loading diagram…

Three-AZ Auto Scaling Group behind an ALB. RDS lives in private subnets, reachable only from the app tier. Email verification runs as a serverless side-channel (SNS → Lambda → DynamoDB → SendGrid) so signup latency stays low.

03 — Decisions

Trade-offs I'd defend in an interview

013 AZs over 2

AWS bills for cross-AZ traffic, so 3 AZs costs more than 2. But with 2 AZs, losing one means halving capacity instantly; with 3, you keep 67%. For a service that's supposed to survive a real outage, the math favors 3. Production AWS deployments rarely use fewer.

02Custom Packer AMIs over containers

ECS/EKS would be cleaner long-term, but the goal here was to learn the IaC-from-scratch surface: VPC, subnets, route tables, SGs, IAM, ASG, launch templates. Packer bakes the FastAPI app + deps + systemd unit into an AMI; the ASG launches new instances from each AMI version. Slower than container redeploys, but every primitive is in Terraform.

03Blue-green via ASG instance refresh

Each new AMI triggers an ASG instance refresh: new instances spin up on the new AMI, the ALB drains traffic from the old ones, and only after health checks pass does the cutover complete. If health checks fail, the refresh halts and the old fleet stays serving. This is the simplest CI/CD pattern that gives real zero-downtime semantics on EC2.

04DynamoDB for verification tokens, not RDS

Tokens are short-lived (TTL-expiring), high-write, low-read, and access-pattern is just 'lookup by token string'. RDS would mean adding another connection from Lambda → RDS through a VPC endpoint (slow cold starts, more config). DynamoDB is fully managed, has built-in TTL, and Lambda hits it over the public AWS API with millisecond latency.

05SNS/Lambda for email — out of the request path

Sending email synchronously during signup ties your p99 latency to SendGrid's worst day. By publishing to SNS instead and letting Lambda handle email, the signup request returns the moment the user row is persisted. If email is slow or failing, the user still gets a fast signup; verification just takes longer.

04 — Outcomes

What shipped

Multi-AZ VPC across 3 zones; verified failover by terminating instances in a single AZ
Terraform manages every resource across 3 repos — webapp, infra, serverless
Zero-downtime CI/CD: PR validation → AMI bake → ASG rolling refresh
Serverless email side-channel keeps signup p99 latency decoupled from SendGrid

05 — Next

What I'd do if this had another sprint

Add CloudFront in front of the ALB for global edge caching of static assets
Migrate to ECS Fargate to drop the AMI bake step and tighten the deploy loop
Run a real GameDay: terminate an AZ via Chaos Engineering and measure recovery time
Publish a cost teardown — $/month per traffic tier, separated by service

06 — Visual proof

See it in code

Back to all projects Want to talk about how this would fit your team?