A Deep Dive into AWS Elastic Load Balancing (ELB) — With Step-by-Step, Explained Code

Part 1 — What ELB Is and How It Works (Clear Foundation)

What ELB does (in one breath)

ELB is AWS’s managed traffic director. You define listeners (ports/protocols), rules (how to route), and target groups (the things that serve traffic). ELB then probes health, spreads requests across Availability Zones, terminates TLS if you want, and emits metrics/logs for ops.

The four ELB types (when to pick which)

ALB (Application LB) — Layer 7 (HTTP/1.1, HTTP/2, gRPC, WebSockets). Path/host/header routing, redirects, OIDC/Cognito auth, Lambda targets. Default for web/APIs.
NLB (Network LB) — Layer 4 (TCP/UDP/TLS). Static IPs/EIPs, ultra-low latency, client IP preservation, TLS pass-through or termination. Use for non-HTTP or when you need static IPs.
GWLB (Gateway LB) — Inserts security/inspection appliances using GENEVE. Use for centralized inspection.
CLB — Legacy. Avoid for new builds.

Core pieces you’ll see in the example

Listener: “what to do with :80 / :443” (redirect, forward, fixed response, authenticate).
Rule (ALB): match on host/path/headers/query → action (forward to a target group).
Target Group: the pool of EC2/ECS/EKS IPs (or Lambda). Holds health checks, stickiness, slow-start, deregistration delay.
Cross-zone LB: any node can send to any healthy target in any AZ (smoother failover & utilization).
Security: TLS via ACM, WAF for L7 filtering, Shield for DDoS, Security Groups for least-privilege.
Observability: CloudWatch metrics (TargetResponseTime, 4xx/5xx), access logs to S3, traces in your app.

How a request flows (ALB)

DNS resolves app.example.com → ALB.
Listener :443 terminates TLS (ACM cert), evaluates rules.
Rule sends to a target group; only healthy targets receive traffic.
ALB keeps probing; unhealthy targets are drained and skipped.
Metrics/logs are emitted; you watch latency and error rates.

Part 2 — Real-World Build: ALB for a Web + API on ECS (Explained as We Go)

Scenario: A regional app with web UI and API on ECS Fargate. We want HTTPS, path-based routing, health checks that reflect real dependencies, WAF, access logs, and safe canaries for zero-downtime deploys.

Assumptions to keep this focused: You already have a VPC with two public subnets (for the ALB) and two private subnets (for ECS tasks), a hosted zone for example.com, and an ACM cert for app.example.com in the same region as the ALB.

Step 0 — Security groups (why first?)

We lock down traffic up-front so later steps don’t accidentally expose tasks.

Explanation:

The ALB needs to accept :443 from the internet.
Tasks should only accept traffic from the ALB, not from the world.

# SG for the ALB: allow 443 from anywhere; egress to tasks
resource "aws_security_group" "alb" {
  name        = "sg-alb"
  description = "Ingress 443 from internet"
  vpc_id      = var.vpc_id
  ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }
  egress  { from_port = 0   to_port = 0   protocol = "-1"  cidr_blocks = ["0.0.0.0/0"] }
}

# SG for ECS tasks: only allow port 80 from the ALB SG
resource "aws_security_group" "tasks" {
  name   = "sg-tasks"
  vpc_id = var.vpc_id
  ingress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id] # only ALB can call tasks
  }
  egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] }
}

Step 1 — Target groups (split web vs API so they scale/fail independently)

Explanation:

Separate TGs let you set different health checks (e.g., API must check DB).
We’ll later route /api/* to TG-api and / to TG-web.

resource "aws_lb_target_group" "web" {
  name     = "tg-web"
  vpc_id   = var.vpc_id
  port     = 80
  protocol = "HTTP"
  health_check {
    path                = "/healthz"
    matcher             = "200-399"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 2
  }
  # Sticky only if you truly need in-memory session continuity
  stickiness { type = "lb_cookie" cookie_duration = 300 enabled = false }
  deregistration_delay = 30   # drain during deploys
  slow_start           = 0    # set >0 to ramp new tasks gently
}

resource "aws_lb_target_group" "api" {
  name     = "tg-api"
  vpc_id   = var.vpc_id
  port     = 80
  protocol = "HTTP"
  health_check {
    path    = "/healthz?db=true" # fail when DB isn’t healthy
    matcher = "200-399"
  }
  deregistration_delay = 30
}

Step 2 — Create the ALB (public, cross-zone on)

Explanation:

Put the ALB in two public subnets for resilience.
Cross-zone evens out traffic across AZs and smooths failover.

resource "aws_lb" "app" {
  name               = "app-alb"
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.public_subnet_ids
  enable_cross_zone_load_balancing = true
  idle_timeout       = 60 # OK for typical APIs; raise for WebSockets or move to NLB
}

Step 3 — Listeners: force HTTPS, then route by path

Explanation:

Always redirect HTTP→HTTPS.
Use the ACM certificate on the HTTPS listener.
Default route goes to web; /api/* goes to api.

# :80 → 301 redirect to HTTPS
resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.app.arn
  port              = 80
  protocol          = "HTTP"
  default_action {
    type = "redirect"
    redirect { port = "443" protocol = "HTTPS" status_code = "HTTP_301" }
  }
}

# :443 → TLS termination + routing
resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.app.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = var.acm_certificate_arn

  # Default: forward to web
  default_action { type = "forward" target_group_arn = aws_lb_target_group.web.arn }
}

# Rule: /api/* → API target group
resource "aws_lb_listener_rule" "api_path" {
  listener_arn = aws_lb_listener.https.arn
  priority     = 10
  action { type = "forward" target_group_arn = aws_lb_target_group.api.arn }
  condition { path_pattern { values = ["/api/*"] } }
}

Step 4 — Attach ECS services to the TGs (the key bit of service wiring)

Explanation:

Register/deregister tasks automatically so ELB knows where to send traffic.
Use private subnets for tasks; the ALB sits in public ones.

# web
resource "aws_ecs_service" "web" {
  name            = "web"
  cluster         = var.ecs_cluster_id
  task_definition = var.web_taskdef_arn
  desired_count   = 2
  launch_type     = "FARGATE"
  network_configuration {
    subnets         = var.private_subnet_ids
    security_groups = [aws_security_group.tasks.id]
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.web.arn
    container_name   = "web"
    container_port   = 80
  }
  health_check_grace_period_seconds = 30
}

# api
resource "aws_ecs_service" "api" {
  name            = "api"
  cluster         = var.ecs_cluster_id
  task_definition = var.api_taskdef_arn
  desired_count   = 2
  launch_type     = "FARGATE"
  network_configuration {
    subnets         = var.private_subnet_ids
    security_groups = [aws_security_group.tasks.id]
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 80
  }
  health_check_grace_period_seconds = 30
}

Step 5 — Add WAF and access logs (security + observability before traffic)

Explanation:

WAF blocks common attacks and bad bots at L7.
Access logs let you debug 4xx/5xx quickly and analyze usage.

# Access logs from ALB to S3 (bucket must exist; add lifecycle rules to control cost)
resource "aws_s3_bucket" "alb_logs" { bucket = "my-alb-logs-123456" }
resource "aws_lb" "app_with_logs" {
  # (wrap-up for brevity: same as aws_lb.app, but add access_logs)
  # If you keep a single aws_lb, use the access_logs block there instead.
  # access_logs config block:
  access_logs {
    bucket  = aws_s3_bucket.alb_logs.bucket
    prefix  = "alb"
    enabled = true
  }
}

# Attach an existing WAFv2 WebACL to the ALB
resource "aws_wafv2_web_acl_association" "waf_to_alb" {
  resource_arn = aws_lb.app.arn
  web_acl_arn  = var.wafv2_web_acl_arn
}

Step 6 — DNS: point the name at the ALB (with an ALIAS)

Explanation:

Use an ALIAS record to the ALB so you don’t pay for extra health checks and it moves with the LB.

resource "aws_route53_record" "app" {
  zone_id = var.hosted_zone_id
  name    = "app.example.com"
  type    = "A"
  alias {
    name                   = aws_lb.app.dns_name
    zone_id                = aws_lb.app.zone_id
    evaluate_target_health = true
  }
}

Step 7 — Canary deploys (graduate change safely)

Explanation:

Create a second green target group.
Shift a small weight (say 10%) to green, watch metrics, then proceed or roll back instantly.

# A second target group for the new API version
resource "aws_lb_target_group" "api_green" {
  name     = "tg-api-green"
  vpc_id   = var.vpc_id
  port     = 80
  protocol = "HTTP"
  health_check { path = "/healthz?db=true" matcher = "200-399" }
}

# Update the /api/* rule to weighted forward (90% old, 10% green)
resource "aws_lb_listener_rule" "api_canary" {
  listener_arn = aws_lb_listener.https.arn
  priority     = 10

  action {
    type = "forward"
    forward {
      target_group { arn = aws_lb_target_group.api.arn       weight = 90 }
      target_group { arn = aws_lb_target_group.api_green.arn weight = 10 }
      stickiness { enabled = true duration = 300 } # optional: stability per-user
    }
  }
  condition { path_pattern { values = ["/api/*"] } }
}

How to roll back fast: change weights to 100/0 (old/new) and apply. No redeploy required.

Step 8 — Sanity checks (catch common issues early)

Explanation:

Test through the ALB and directly against tasks (via a bastion) using the correct Host header; many 502s are “wrong virtual host” problems.

# Through the ALB
curl -I https://app.example.com/healthz
curl -I https://app.example.com/api/healthz?db=true

# From inside the VPC to a task IP (simulate what the ALB asks for)
curl -H 'Host: app.example.com' http://10.0.12.34/healthz
curl -H 'Host: app.example.com' http://10.0.23.45/api/healthz?db=true

Watch in CloudWatch: TargetResponseTime, HTTPCode_Target_5XX_Count, HealthyHostCount, and request counts per target group.

Mini-Alternate: When NLB Is the Right Tool (explained briefly)

When to use NLB: non-HTTP protocols, static IPs/EIPs, client IP preservation, or very long-lived TCP/UDP flows.

# NLB with TLS termination forwarding to a TCP target group (e.g., gRPC over TLS that you re-encrypt)
resource "aws_lb" "edge_nlb" {
  name               = "edge-nlb"
  load_balancer_type = "network"
  subnets            = var.public_subnet_ids
  enable_cross_zone_load_balancing = true
}

resource "aws_lb_target_group" "tcp" {
  name     = "tg-tcp"
  vpc_id   = var.vpc_id
  port     = 8443
  protocol = "TCP"
  health_check { protocol = "TCP" }
}

resource "aws_lb_listener" "tls" {
  load_balancer_arn = aws_lb.edge_nlb.arn
  port              = 443
  protocol          = "TLS"
  certificate_arn   = var.acm_certificate_arn
  default_action { type = "forward" target_group_arn = aws_lb_target_group.tcp.arn }
}

Notes: if your app needs the client IP, enable Proxy Protocol v2 and have the app parse it; or run TLS pass-through and terminate at the target.

Troubleshooting Cheat-Sheet (context + fix)

ALB 502 (Bad Gateway): oversized/malformed headers, app closed early, gRPC mismatch. Fix: reduce header sizes; check app proxy; confirm gRPC compatibility.
ALB 503 (Service Unavailable): no healthy targets or capacity. Fix: verify health checks, target registration, SGs, AZ coverage.
ALB 504 (Gateway Timeout): idle timeout hit. Fix: increase timeout or move long-lived streams to NLB.
Green locally, red on ALB: wrong Host: or path in health check. Fix: reproduce with exact headers; align health endpoints with what users need.
Scaling lags: autoscale off the right signals (RequestCountPerTarget, queue depth, p95), not just CPU.

Leave a Comment Cancel Reply

© maqtba

Part 1 — What ELB Is and How It Works (Clear Foundation)

What ELB does (in one breath)

The four ELB types (when to pick which)

Core pieces you’ll see in the example

How a request flows (ALB)

Part 2 — Real-World Build: ALB for a Web + API on ECS (Explained as We Go)

Step 0 — Security groups (why first?)

Step 1 — Target groups (split web vs API so they scale/fail independently)

Step 2 — Create the ALB (public, cross-zone on)

Step 3 — Listeners: force HTTPS, then route by path

Step 4 — Attach ECS services to the TGs (the key bit of service wiring)

Step 5 — Add WAF and access logs (security + observability before traffic)

Step 6 — DNS: point the name at the ALB (with an ALIAS)

Step 7 — Canary deploys (graduate change safely)

Step 8 — Sanity checks (catch common issues early)

Mini-Alternate: When NLB Is the Right Tool (explained briefly)

Troubleshooting Cheat-Sheet (context + fix)

Related Posts

Leave a Comment Cancel Reply

© maqtba