Featured Project5 min read

Resilient Cloud Deployment Platform

Full AWS Stack — Terraform IaC, Multi-AZ, Blue-Green CI/CD

Production-grade cloud platform spanning 3 repositories: a FastAPI web service, Terraform IaC for the full AWS stack, and serverless Lambda functions for event-driven email verification. Designed to survive an AZ outage and ship via PR-triggered rolling deploys.

AWSTerraformFastAPIPackerGitHub ActionsLambdaSNSDynamoDBRDS PostgreSQLS3

01 — Problem

What was hard about this

A 'real' production deployment surface — not a toy demo — has to handle the hard parts: AZ failure, blue-green cutover, private-subnet databases, IAM that follows least privilege, and infrastructure-as-code you can hand to the next engineer without a runbook. Most class projects stop at 'deploys to EC2'. This one had to survive an availability zone going dark.

02 — Architecture

How the pieces fit

Loading diagram…
Three-AZ Auto Scaling Group behind an ALB. RDS lives in private subnets, reachable only from the app tier. Email verification runs as a serverless side-channel (SNS → Lambda → DynamoDB → SendGrid) so signup latency stays low.

03 — Decisions

Trade-offs I'd defend in an interview

013 AZs over 2

AWS bills for cross-AZ traffic, so 3 AZs costs more than 2. But with 2 AZs, losing one means halving capacity instantly; with 3, you keep 67%. For a service that's supposed to survive a real outage, the math favors 3. Production AWS deployments rarely use fewer.

02Custom Packer AMIs over containers

ECS/EKS would be cleaner long-term, but the goal here was to learn the IaC-from-scratch surface: VPC, subnets, route tables, SGs, IAM, ASG, launch templates. Packer bakes the FastAPI app + deps + systemd unit into an AMI; the ASG launches new instances from each AMI version. Slower than container redeploys, but every primitive is in Terraform.

03Blue-green via ASG instance refresh

Each new AMI triggers an ASG instance refresh: new instances spin up on the new AMI, the ALB drains traffic from the old ones, and only after health checks pass does the cutover complete. If health checks fail, the refresh halts and the old fleet stays serving. This is the simplest CI/CD pattern that gives real zero-downtime semantics on EC2.

04DynamoDB for verification tokens, not RDS

Tokens are short-lived (TTL-expiring), high-write, low-read, and access-pattern is just 'lookup by token string'. RDS would mean adding another connection from Lambda → RDS through a VPC endpoint (slow cold starts, more config). DynamoDB is fully managed, has built-in TTL, and Lambda hits it over the public AWS API with millisecond latency.

05SNS/Lambda for email — out of the request path

Sending email synchronously during signup ties your p99 latency to SendGrid's worst day. By publishing to SNS instead and letting Lambda handle email, the signup request returns the moment the user row is persisted. If email is slow or failing, the user still gets a fast signup; verification just takes longer.

04 — Outcomes

What shipped

  • Multi-AZ VPC across 3 zones; verified failover by terminating instances in a single AZ
  • Terraform manages every resource across 3 repos — webapp, infra, serverless
  • Zero-downtime CI/CD: PR validation → AMI bake → ASG rolling refresh
  • Serverless email side-channel keeps signup p99 latency decoupled from SendGrid

05 — Next

What I'd do if this had another sprint

  • Add CloudFront in front of the ALB for global edge caching of static assets
  • Migrate to ECS Fargate to drop the AMI bake step and tighten the deploy loop
  • Run a real GameDay: terminate an AZ via Chaos Engineering and measure recovery time
  • Publish a cost teardown — $/month per traffic tier, separated by service

06 — Visual proof

See it in code