Part 1 — What ELB Is and How It Works (Clear Foundation)
What ELB does (in one breath)
ELB is AWS’s managed traffic director. You define listeners (ports/protocols), rules (how to route), and target groups (the things that serve traffic). ELB then probes health, spreads requests across Availability Zones, terminates TLS if you want, and emits metrics/logs for ops.
The four ELB types (when to pick which)
- ALB (Application LB) — Layer 7 (HTTP/1.1, HTTP/2, gRPC, WebSockets). Path/host/header routing, redirects, OIDC/Cognito auth, Lambda targets. Default for web/APIs.
- NLB (Network LB) — Layer 4 (TCP/UDP/TLS). Static IPs/EIPs, ultra-low latency, client IP preservation, TLS pass-through or termination. Use for non-HTTP or when you need static IPs.
- GWLB (Gateway LB) — Inserts security/inspection appliances using GENEVE. Use for centralized inspection.
- CLB — Legacy. Avoid for new builds.
Core pieces you’ll see in the example
- Listener: “what to do with :80 / :443” (redirect, forward, fixed response, authenticate).
- Rule (ALB): match on host/path/headers/query → action (forward to a target group).
- Target Group: the pool of EC2/ECS/EKS IPs (or Lambda). Holds health checks, stickiness, slow-start, deregistration delay.
- Cross-zone LB: any node can send to any healthy target in any AZ (smoother failover & utilization).
- Security: TLS via ACM, WAF for L7 filtering, Shield for DDoS, Security Groups for least-privilege.
- Observability: CloudWatch metrics (
TargetResponseTime
, 4xx/5xx), access logs to S3, traces in your app.
How a request flows (ALB)
- DNS resolves
app.example.com
→ ALB. - Listener
:443
terminates TLS (ACM cert), evaluates rules. - Rule sends to a target group; only healthy targets receive traffic.
- ALB keeps probing; unhealthy targets are drained and skipped.
- Metrics/logs are emitted; you watch latency and error rates.
Part 2 — Real-World Build: ALB for a Web + API on ECS (Explained as We Go)
Scenario: A regional app with web UI and API on ECS Fargate. We want HTTPS, path-based routing, health checks that reflect real dependencies, WAF, access logs, and safe canaries for zero-downtime deploys.
Assumptions to keep this focused: You already have a VPC with two public subnets (for the ALB) and two private subnets (for ECS tasks), a hosted zone for example.com
, and an ACM cert for app.example.com
in the same region as the ALB.
Step 0 — Security groups (why first?)
We lock down traffic up-front so later steps don’t accidentally expose tasks.
Explanation:
- The ALB needs to accept
:443
from the internet. - Tasks should only accept traffic from the ALB, not from the world.
# SG for the ALB: allow 443 from anywhere; egress to tasks
resource "aws_security_group" "alb" {
name = "sg-alb"
description = "Ingress 443 from internet"
vpc_id = var.vpc_id
ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] }
}
# SG for ECS tasks: only allow port 80 from the ALB SG
resource "aws_security_group" "tasks" {
name = "sg-tasks"
vpc_id = var.vpc_id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
security_groups = [aws_security_group.alb.id] # only ALB can call tasks
}
egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] }
}
Step 1 — Target groups (split web vs API so they scale/fail independently)
Explanation:
- Separate TGs let you set different health checks (e.g., API must check DB).
- We’ll later route
/api/*
toTG-api
and/
toTG-web
.
resource "aws_lb_target_group" "web" {
name = "tg-web"
vpc_id = var.vpc_id
port = 80
protocol = "HTTP"
health_check {
path = "/healthz"
matcher = "200-399"
interval = 15
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 2
}
# Sticky only if you truly need in-memory session continuity
stickiness { type = "lb_cookie" cookie_duration = 300 enabled = false }
deregistration_delay = 30 # drain during deploys
slow_start = 0 # set >0 to ramp new tasks gently
}
resource "aws_lb_target_group" "api" {
name = "tg-api"
vpc_id = var.vpc_id
port = 80
protocol = "HTTP"
health_check {
path = "/healthz?db=true" # fail when DB isn’t healthy
matcher = "200-399"
}
deregistration_delay = 30
}
Step 2 — Create the ALB (public, cross-zone on)
Explanation:
- Put the ALB in two public subnets for resilience.
- Cross-zone evens out traffic across AZs and smooths failover.
resource "aws_lb" "app" {
name = "app-alb"
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = var.public_subnet_ids
enable_cross_zone_load_balancing = true
idle_timeout = 60 # OK for typical APIs; raise for WebSockets or move to NLB
}
Step 3 — Listeners: force HTTPS, then route by path
Explanation:
- Always redirect HTTP→HTTPS.
- Use the ACM certificate on the HTTPS listener.
- Default route goes to
web
;/api/*
goes toapi
.
# :80 → 301 redirect to HTTPS
resource "aws_lb_listener" "http" {
load_balancer_arn = aws_lb.app.arn
port = 80
protocol = "HTTP"
default_action {
type = "redirect"
redirect { port = "443" protocol = "HTTPS" status_code = "HTTP_301" }
}
}
# :443 → TLS termination + routing
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.app.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = var.acm_certificate_arn
# Default: forward to web
default_action { type = "forward" target_group_arn = aws_lb_target_group.web.arn }
}
# Rule: /api/* → API target group
resource "aws_lb_listener_rule" "api_path" {
listener_arn = aws_lb_listener.https.arn
priority = 10
action { type = "forward" target_group_arn = aws_lb_target_group.api.arn }
condition { path_pattern { values = ["/api/*"] } }
}
Step 4 — Attach ECS services to the TGs (the key bit of service wiring)
Explanation:
- Register/deregister tasks automatically so ELB knows where to send traffic.
- Use private subnets for tasks; the ALB sits in public ones.
# web
resource "aws_ecs_service" "web" {
name = "web"
cluster = var.ecs_cluster_id
task_definition = var.web_taskdef_arn
desired_count = 2
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.tasks.id]
}
load_balancer {
target_group_arn = aws_lb_target_group.web.arn
container_name = "web"
container_port = 80
}
health_check_grace_period_seconds = 30
}
# api
resource "aws_ecs_service" "api" {
name = "api"
cluster = var.ecs_cluster_id
task_definition = var.api_taskdef_arn
desired_count = 2
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.tasks.id]
}
load_balancer {
target_group_arn = aws_lb_target_group.api.arn
container_name = "api"
container_port = 80
}
health_check_grace_period_seconds = 30
}
Step 5 — Add WAF and access logs (security + observability before traffic)
Explanation:
- WAF blocks common attacks and bad bots at L7.
- Access logs let you debug 4xx/5xx quickly and analyze usage.
# Access logs from ALB to S3 (bucket must exist; add lifecycle rules to control cost)
resource "aws_s3_bucket" "alb_logs" { bucket = "my-alb-logs-123456" }
resource "aws_lb" "app_with_logs" {
# (wrap-up for brevity: same as aws_lb.app, but add access_logs)
# If you keep a single aws_lb, use the access_logs block there instead.
# access_logs config block:
access_logs {
bucket = aws_s3_bucket.alb_logs.bucket
prefix = "alb"
enabled = true
}
}
# Attach an existing WAFv2 WebACL to the ALB
resource "aws_wafv2_web_acl_association" "waf_to_alb" {
resource_arn = aws_lb.app.arn
web_acl_arn = var.wafv2_web_acl_arn
}
Step 6 — DNS: point the name at the ALB (with an ALIAS)
Explanation:
- Use an ALIAS record to the ALB so you don’t pay for extra health checks and it moves with the LB.
resource "aws_route53_record" "app" {
zone_id = var.hosted_zone_id
name = "app.example.com"
type = "A"
alias {
name = aws_lb.app.dns_name
zone_id = aws_lb.app.zone_id
evaluate_target_health = true
}
}
Step 7 — Canary deploys (graduate change safely)
Explanation:
- Create a second green target group.
- Shift a small weight (say 10%) to green, watch metrics, then proceed or roll back instantly.
# A second target group for the new API version
resource "aws_lb_target_group" "api_green" {
name = "tg-api-green"
vpc_id = var.vpc_id
port = 80
protocol = "HTTP"
health_check { path = "/healthz?db=true" matcher = "200-399" }
}
# Update the /api/* rule to weighted forward (90% old, 10% green)
resource "aws_lb_listener_rule" "api_canary" {
listener_arn = aws_lb_listener.https.arn
priority = 10
action {
type = "forward"
forward {
target_group { arn = aws_lb_target_group.api.arn weight = 90 }
target_group { arn = aws_lb_target_group.api_green.arn weight = 10 }
stickiness { enabled = true duration = 300 } # optional: stability per-user
}
}
condition { path_pattern { values = ["/api/*"] } }
}
How to roll back fast: change weights to 100/0
(old/new) and apply. No redeploy required.
Step 8 — Sanity checks (catch common issues early)
Explanation:
- Test through the ALB and directly against tasks (via a bastion) using the correct Host header; many 502s are “wrong virtual host” problems.
# Through the ALB
curl -I https://app.example.com/healthz
curl -I https://app.example.com/api/healthz?db=true
# From inside the VPC to a task IP (simulate what the ALB asks for)
curl -H 'Host: app.example.com' http://10.0.12.34/healthz
curl -H 'Host: app.example.com' http://10.0.23.45/api/healthz?db=true
Watch in CloudWatch: TargetResponseTime
, HTTPCode_Target_5XX_Count
, HealthyHostCount
, and request counts per target group.
Mini-Alternate: When NLB Is the Right Tool (explained briefly)
When to use NLB: non-HTTP protocols, static IPs/EIPs, client IP preservation, or very long-lived TCP/UDP flows.
# NLB with TLS termination forwarding to a TCP target group (e.g., gRPC over TLS that you re-encrypt)
resource "aws_lb" "edge_nlb" {
name = "edge-nlb"
load_balancer_type = "network"
subnets = var.public_subnet_ids
enable_cross_zone_load_balancing = true
}
resource "aws_lb_target_group" "tcp" {
name = "tg-tcp"
vpc_id = var.vpc_id
port = 8443
protocol = "TCP"
health_check { protocol = "TCP" }
}
resource "aws_lb_listener" "tls" {
load_balancer_arn = aws_lb.edge_nlb.arn
port = 443
protocol = "TLS"
certificate_arn = var.acm_certificate_arn
default_action { type = "forward" target_group_arn = aws_lb_target_group.tcp.arn }
}
Notes: if your app needs the client IP, enable Proxy Protocol v2 and have the app parse it; or run TLS pass-through and terminate at the target.
Troubleshooting Cheat-Sheet (context + fix)
- ALB 502 (Bad Gateway): oversized/malformed headers, app closed early, gRPC mismatch. Fix: reduce header sizes; check app proxy; confirm gRPC compatibility.
- ALB 503 (Service Unavailable): no healthy targets or capacity. Fix: verify health checks, target registration, SGs, AZ coverage.
- ALB 504 (Gateway Timeout): idle timeout hit. Fix: increase timeout or move long-lived streams to NLB.
- Green locally, red on ALB: wrong
Host:
or path in health check. Fix: reproduce with exact headers; align health endpoints with what users need. - Scaling lags: autoscale off the right signals (
RequestCountPerTarget
, queue depth, p95), not just CPU.