01 — Problem
What was hard about this
A 'real' production deployment surface — not a toy demo — has to handle the hard parts: AZ failure, blue-green cutover, private-subnet databases, IAM that follows least privilege, and infrastructure-as-code you can hand to the next engineer without a runbook. Most class projects stop at 'deploys to EC2'. This one had to survive an availability zone going dark.
02 — Architecture
How the pieces fit
03 — Decisions
Trade-offs I'd defend in an interview
013 AZs over 2
AWS bills for cross-AZ traffic, so 3 AZs costs more than 2. But with 2 AZs, losing one means halving capacity instantly; with 3, you keep 67%. For a service that's supposed to survive a real outage, the math favors 3. Production AWS deployments rarely use fewer.
02Custom Packer AMIs over containers
ECS/EKS would be cleaner long-term, but the goal here was to learn the IaC-from-scratch surface: VPC, subnets, route tables, SGs, IAM, ASG, launch templates. Packer bakes the FastAPI app + deps + systemd unit into an AMI; the ASG launches new instances from each AMI version. Slower than container redeploys, but every primitive is in Terraform.
03Blue-green via ASG instance refresh
Each new AMI triggers an ASG instance refresh: new instances spin up on the new AMI, the ALB drains traffic from the old ones, and only after health checks pass does the cutover complete. If health checks fail, the refresh halts and the old fleet stays serving. This is the simplest CI/CD pattern that gives real zero-downtime semantics on EC2.
04DynamoDB for verification tokens, not RDS
Tokens are short-lived (TTL-expiring), high-write, low-read, and access-pattern is just 'lookup by token string'. RDS would mean adding another connection from Lambda → RDS through a VPC endpoint (slow cold starts, more config). DynamoDB is fully managed, has built-in TTL, and Lambda hits it over the public AWS API with millisecond latency.
05SNS/Lambda for email — out of the request path
Sending email synchronously during signup ties your p99 latency to SendGrid's worst day. By publishing to SNS instead and letting Lambda handle email, the signup request returns the moment the user row is persisted. If email is slow or failing, the user still gets a fast signup; verification just takes longer.
04 — Outcomes
What shipped
- Multi-AZ VPC across 3 zones; verified failover by terminating instances in a single AZ
- Terraform manages every resource across 3 repos — webapp, infra, serverless
- Zero-downtime CI/CD: PR validation → AMI bake → ASG rolling refresh
- Serverless email side-channel keeps signup p99 latency decoupled from SendGrid
05 — Next
What I'd do if this had another sprint
- Add CloudFront in front of the ALB for global edge caching of static assets
- Migrate to ECS Fargate to drop the AMI bake step and tighten the deploy loop
- Run a real GameDay: terminate an AZ via Chaos Engineering and measure recovery time
- Publish a cost teardown — $/month per traffic tier, separated by service