maqtba

A Deep Dive into AWS Elastic Load Balancing (ELB) — With Step-by-Step, Explained Code

Part 1 — What ELB Is and How It Works (Clear Foundation)

What ELB does (in one breath)

ELB is AWS’s managed traffic director. You define listeners (ports/protocols), rules (how to route), and target groups (the things that serve traffic). ELB then probes health, spreads requests across Availability Zones, terminates TLS if you want, and emits metrics/logs for ops.

The four ELB types (when to pick which)
  • ALB (Application LB) — Layer 7 (HTTP/1.1, HTTP/2, gRPC, WebSockets). Path/host/header routing, redirects, OIDC/Cognito auth, Lambda targets. Default for web/APIs.
  • NLB (Network LB) — Layer 4 (TCP/UDP/TLS). Static IPs/EIPs, ultra-low latency, client IP preservation, TLS pass-through or termination. Use for non-HTTP or when you need static IPs.
  • GWLB (Gateway LB) — Inserts security/inspection appliances using GENEVE. Use for centralized inspection.
  • CLB — Legacy. Avoid for new builds.
Core pieces you’ll see in the example
  • Listener: “what to do with :80 / :443” (redirect, forward, fixed response, authenticate).
  • Rule (ALB): match on host/path/headers/query → action (forward to a target group).
  • Target Group: the pool of EC2/ECS/EKS IPs (or Lambda). Holds health checks, stickiness, slow-start, deregistration delay.
  • Cross-zone LB: any node can send to any healthy target in any AZ (smoother failover & utilization).
  • Security: TLS via ACM, WAF for L7 filtering, Shield for DDoS, Security Groups for least-privilege.
  • Observability: CloudWatch metrics (TargetResponseTime, 4xx/5xx), access logs to S3, traces in your app.
How a request flows (ALB)
  1. DNS resolves app.example.com → ALB.
  2. Listener :443 terminates TLS (ACM cert), evaluates rules.
  3. Rule sends to a target group; only healthy targets receive traffic.
  4. ALB keeps probing; unhealthy targets are drained and skipped.
  5. Metrics/logs are emitted; you watch latency and error rates.

Part 2 — Real-World Build: ALB for a Web + API on ECS (Explained as We Go)

Scenario: A regional app with web UI and API on ECS Fargate. We want HTTPS, path-based routing, health checks that reflect real dependencies, WAF, access logs, and safe canaries for zero-downtime deploys.

Assumptions to keep this focused: You already have a VPC with two public subnets (for the ALB) and two private subnets (for ECS tasks), a hosted zone for example.com, and an ACM cert for app.example.com in the same region as the ALB.


Step 0 — Security groups (why first?)

We lock down traffic up-front so later steps don’t accidentally expose tasks.

Explanation:

  • The ALB needs to accept :443 from the internet.
  • Tasks should only accept traffic from the ALB, not from the world.
# SG for the ALB: allow 443 from anywhere; egress to tasks
resource "aws_security_group" "alb" {
  name        = "sg-alb"
  description = "Ingress 443 from internet"
  vpc_id      = var.vpc_id
  ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }
  egress  { from_port = 0   to_port = 0   protocol = "-1"  cidr_blocks = ["0.0.0.0/0"] }
}

# SG for ECS tasks: only allow port 80 from the ALB SG
resource "aws_security_group" "tasks" {
  name   = "sg-tasks"
  vpc_id = var.vpc_id
  ingress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id] # only ALB can call tasks
  }
  egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] }
}

Step 1 — Target groups (split web vs API so they scale/fail independently)

Explanation:

  • Separate TGs let you set different health checks (e.g., API must check DB).
  • We’ll later route /api/* to TG-api and / to TG-web.
resource "aws_lb_target_group" "web" {
  name     = "tg-web"
  vpc_id   = var.vpc_id
  port     = 80
  protocol = "HTTP"
  health_check {
    path                = "/healthz"
    matcher             = "200-399"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 2
  }
  # Sticky only if you truly need in-memory session continuity
  stickiness { type = "lb_cookie" cookie_duration = 300 enabled = false }
  deregistration_delay = 30   # drain during deploys
  slow_start           = 0    # set >0 to ramp new tasks gently
}

resource "aws_lb_target_group" "api" {
  name     = "tg-api"
  vpc_id   = var.vpc_id
  port     = 80
  protocol = "HTTP"
  health_check {
    path    = "/healthz?db=true" # fail when DB isn’t healthy
    matcher = "200-399"
  }
  deregistration_delay = 30
}

Step 2 — Create the ALB (public, cross-zone on)

Explanation:

  • Put the ALB in two public subnets for resilience.
  • Cross-zone evens out traffic across AZs and smooths failover.
resource "aws_lb" "app" {
  name               = "app-alb"
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.public_subnet_ids
  enable_cross_zone_load_balancing = true
  idle_timeout       = 60 # OK for typical APIs; raise for WebSockets or move to NLB
}

Step 3 — Listeners: force HTTPS, then route by path

Explanation:

  • Always redirect HTTP→HTTPS.
  • Use the ACM certificate on the HTTPS listener.
  • Default route goes to web; /api/* goes to api.
# :80 → 301 redirect to HTTPS
resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.app.arn
  port              = 80
  protocol          = "HTTP"
  default_action {
    type = "redirect"
    redirect { port = "443" protocol = "HTTPS" status_code = "HTTP_301" }
  }
}

# :443 → TLS termination + routing
resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.app.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = var.acm_certificate_arn

  # Default: forward to web
  default_action { type = "forward" target_group_arn = aws_lb_target_group.web.arn }
}

# Rule: /api/* → API target group
resource "aws_lb_listener_rule" "api_path" {
  listener_arn = aws_lb_listener.https.arn
  priority     = 10
  action { type = "forward" target_group_arn = aws_lb_target_group.api.arn }
  condition { path_pattern { values = ["/api/*"] } }
}

Step 4 — Attach ECS services to the TGs (the key bit of service wiring)

Explanation:

  • Register/deregister tasks automatically so ELB knows where to send traffic.
  • Use private subnets for tasks; the ALB sits in public ones.
# web
resource "aws_ecs_service" "web" {
  name            = "web"
  cluster         = var.ecs_cluster_id
  task_definition = var.web_taskdef_arn
  desired_count   = 2
  launch_type     = "FARGATE"
  network_configuration {
    subnets         = var.private_subnet_ids
    security_groups = [aws_security_group.tasks.id]
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.web.arn
    container_name   = "web"
    container_port   = 80
  }
  health_check_grace_period_seconds = 30
}

# api
resource "aws_ecs_service" "api" {
  name            = "api"
  cluster         = var.ecs_cluster_id
  task_definition = var.api_taskdef_arn
  desired_count   = 2
  launch_type     = "FARGATE"
  network_configuration {
    subnets         = var.private_subnet_ids
    security_groups = [aws_security_group.tasks.id]
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 80
  }
  health_check_grace_period_seconds = 30
}

Step 5 — Add WAF and access logs (security + observability before traffic)

Explanation:

  • WAF blocks common attacks and bad bots at L7.
  • Access logs let you debug 4xx/5xx quickly and analyze usage.
# Access logs from ALB to S3 (bucket must exist; add lifecycle rules to control cost)
resource "aws_s3_bucket" "alb_logs" { bucket = "my-alb-logs-123456" }
resource "aws_lb" "app_with_logs" {
  # (wrap-up for brevity: same as aws_lb.app, but add access_logs)
  # If you keep a single aws_lb, use the access_logs block there instead.
  # access_logs config block:
  access_logs {
    bucket  = aws_s3_bucket.alb_logs.bucket
    prefix  = "alb"
    enabled = true
  }
}

# Attach an existing WAFv2 WebACL to the ALB
resource "aws_wafv2_web_acl_association" "waf_to_alb" {
  resource_arn = aws_lb.app.arn
  web_acl_arn  = var.wafv2_web_acl_arn
}

Step 6 — DNS: point the name at the ALB (with an ALIAS)

Explanation:

  • Use an ALIAS record to the ALB so you don’t pay for extra health checks and it moves with the LB.
resource "aws_route53_record" "app" {
  zone_id = var.hosted_zone_id
  name    = "app.example.com"
  type    = "A"
  alias {
    name                   = aws_lb.app.dns_name
    zone_id                = aws_lb.app.zone_id
    evaluate_target_health = true
  }
}

Step 7 — Canary deploys (graduate change safely)

Explanation:

  • Create a second green target group.
  • Shift a small weight (say 10%) to green, watch metrics, then proceed or roll back instantly.
# A second target group for the new API version
resource "aws_lb_target_group" "api_green" {
  name     = "tg-api-green"
  vpc_id   = var.vpc_id
  port     = 80
  protocol = "HTTP"
  health_check { path = "/healthz?db=true" matcher = "200-399" }
}

# Update the /api/* rule to weighted forward (90% old, 10% green)
resource "aws_lb_listener_rule" "api_canary" {
  listener_arn = aws_lb_listener.https.arn
  priority     = 10

  action {
    type = "forward"
    forward {
      target_group { arn = aws_lb_target_group.api.arn       weight = 90 }
      target_group { arn = aws_lb_target_group.api_green.arn weight = 10 }
      stickiness { enabled = true duration = 300 } # optional: stability per-user
    }
  }
  condition { path_pattern { values = ["/api/*"] } }
}

How to roll back fast: change weights to 100/0 (old/new) and apply. No redeploy required.


Step 8 — Sanity checks (catch common issues early)

Explanation:

  • Test through the ALB and directly against tasks (via a bastion) using the correct Host header; many 502s are “wrong virtual host” problems.
# Through the ALB
curl -I https://app.example.com/healthz
curl -I https://app.example.com/api/healthz?db=true

# From inside the VPC to a task IP (simulate what the ALB asks for)
curl -H 'Host: app.example.com' http://10.0.12.34/healthz
curl -H 'Host: app.example.com' http://10.0.23.45/api/healthz?db=true

Watch in CloudWatch: TargetResponseTime, HTTPCode_Target_5XX_Count, HealthyHostCount, and request counts per target group.


Mini-Alternate: When NLB Is the Right Tool (explained briefly)

When to use NLB: non-HTTP protocols, static IPs/EIPs, client IP preservation, or very long-lived TCP/UDP flows.

# NLB with TLS termination forwarding to a TCP target group (e.g., gRPC over TLS that you re-encrypt)
resource "aws_lb" "edge_nlb" {
  name               = "edge-nlb"
  load_balancer_type = "network"
  subnets            = var.public_subnet_ids
  enable_cross_zone_load_balancing = true
}

resource "aws_lb_target_group" "tcp" {
  name     = "tg-tcp"
  vpc_id   = var.vpc_id
  port     = 8443
  protocol = "TCP"
  health_check { protocol = "TCP" }
}

resource "aws_lb_listener" "tls" {
  load_balancer_arn = aws_lb.edge_nlb.arn
  port              = 443
  protocol          = "TLS"
  certificate_arn   = var.acm_certificate_arn
  default_action { type = "forward" target_group_arn = aws_lb_target_group.tcp.arn }
}

Notes: if your app needs the client IP, enable Proxy Protocol v2 and have the app parse it; or run TLS pass-through and terminate at the target.


Troubleshooting Cheat-Sheet (context + fix)

  • ALB 502 (Bad Gateway): oversized/malformed headers, app closed early, gRPC mismatch. Fix: reduce header sizes; check app proxy; confirm gRPC compatibility.
  • ALB 503 (Service Unavailable): no healthy targets or capacity. Fix: verify health checks, target registration, SGs, AZ coverage.
  • ALB 504 (Gateway Timeout): idle timeout hit. Fix: increase timeout or move long-lived streams to NLB.
  • Green locally, red on ALB: wrong Host: or path in health check. Fix: reproduce with exact headers; align health endpoints with what users need.
  • Scaling lags: autoscale off the right signals (RequestCountPerTarget, queue depth, p95), not just CPU.

Leave a Comment

Your email address will not be published. Required fields are marked *