Stop Gambling with Your Infrastructure: Build Bulletproof Multi-Region Systems

Your production system is a ticking time bomb. Every second of downtime costs your business thousands, yet most infrastructure still relies on single points of failure and manual recovery processes. You're one AWS region outage away from explaining to executives why your "highly available" system went dark for hours.

The Real Cost of Infrastructure Fragility

Traditional infrastructure approaches create hidden risks that compound over time:

Manual failover processes that require 2-3 hours minimum during crisis situations
Single-region deployments that collapse when cloud providers have regional issues
Ad-hoc backup strategies that fail when you actually need them
Stateful applications that can't scale or recover gracefully
Missing automation that turns every incident into a fire drill

When Fastly went down in 2021, it took major platforms like GitHub, Reddit, and Shopify offline for hours. When AWS US-East-1 had issues in 2017, it cascaded through services worldwide. Your infrastructure needs to survive these realities.

Engineering Resilience into Every Layer

These Cursor Rules transform how you architect production systems by embedding high availability and disaster recovery patterns directly into your Infrastructure-as-Code. Instead of bolting on resilience as an afterthought, you build it into every component from day one.

What You Get Out of the Box

Multi-Region Infrastructure by Default

Terraform modules that automatically distribute resources across availability zones and regions
Route 53 latency-based routing with automated health checks and failover
DynamoDB Global Tables and Cosmos DB configurations with automatic replication

Self-Healing Kubernetes Workloads

Pod disruption budgets and topology spread constraints that prevent cascading failures
Automated backup schedules using Trilio and Portworx with cross-cluster recovery
HorizontalPodAutoscaler configurations that handle traffic spikes and node failures

Zero-Touch Disaster Recovery

Python automation scripts with circuit breakers and exponential backoff for reliable cloud API interactions
Automated DR drills using LitmusChaos to validate your recovery procedures monthly
Complete infrastructure recreation with RPO ≤ 5 minutes, RTO ≤ 15 minutes

Transform Your Development Workflow

Before: Manual, Error-Prone Infrastructure

# Typical infrastructure problems
- Single region deployment "because it's simpler"
- Manual backup processes that get forgotten
- Load balancers without proper health checks
- No automated testing of recovery procedures
- Stateful applications that can't scale horizontally

After: Resilient, Self-Healing Systems

# Auto-generated multi-region network module
module "network_ha" {
  source = "./modules/network-ha"
  
  availability_zones = data.aws_availability_zones.available.names
  project           = var.project
  environment       = var.environment
  
  # Creates VPC + subnets across all AZs automatically
  # NAT Gateways in each AZ for redundancy
}

module "lb_ha" {
  source = "./modules/lb-ha"
  
  # Route 53 with health checks + failover records
  # ALB with cross-zone load balancing
  # Target groups distributed across regions
}

# Kubernetes workloads with built-in resilience
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 3
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
      containers:
      - name: api
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          timeoutSeconds: 5
          failureThreshold: 3
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-service

Real Developer Productivity Gains

Eliminate Context Switching During Incidents

Instead of scrambling through runbooks during outages, your infrastructure handles failures automatically:

Route 53 health checks detect regional failures within 30 seconds
Automatic failover redirects traffic to healthy regions without manual intervention
Kubernetes controllers restart failed pods and reschedule workloads to healthy nodes
Monitoring integration pages on-call engineers only for issues requiring human intervention

Accelerate Feature Development

Stop spending 40% of your time on infrastructure maintenance:

# Automated DR validation runs weekly
@retry(stop=after_attempt(5), wait=exponential)
def validate_cross_region_replication():
    """Verify data consistency across regions"""
    primary_data = get_dynamodb_items('us-east-1')
    secondary_data = get_dynamodb_items('eu-west-1')
    
    assert primary_data == secondary_data
    logger.info("Cross-region replication validated successfully")

Build Confidence Through Continuous Testing

Monthly chaos engineering experiments validate your assumptions:

Pod deletion tests ensure your applications handle container failures gracefully
Network partition simulations verify your services degrade gracefully during connectivity issues
Full region evacuation drills prove your disaster recovery procedures work under pressure

Implementation in 3 Steps

1. Bootstrap Multi-Region Foundation

# Initialize Terraform workspace
git clone <your-infrastructure-repo>
cd infrastructure/envs/prod/us-east-1

# Deploy networking and load balancing
terraform init
terraform plan -var-file="../../terraform.tfvars"
terraform apply

2. Deploy Resilient Applications

# Apply Kubernetes manifests with HA patterns
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/deployments/
kubectl apply -f k8s/services/
kubectl apply -f k8s/pdb/

# Verify pod distribution across zones
kubectl get pods -o wide --show-labels

3. Activate Monitoring and Automation

# Set up Prometheus monitoring
helm install prometheus prometheus-community/kube-prometheus-stack

# Configure automated backups
kubectl apply -f backup/trilio-backup-policy.yaml

# Schedule chaos engineering tests
kubectl apply -f chaos/litmus-experiments/

Immediate Results You Can Measure

Week 1: Infrastructure Resilience

Zero single points of failure across your entire stack
Sub-30 second failover times for application traffic
Automated backups running every 15 minutes with verified restore capability

Month 1: Operational Excellence

90% reduction in incident response time through automated remediation
Complete elimination of data loss during infrastructure failures
Verified disaster recovery procedures through monthly automated drills

Quarter 1: Business Impact

99.99% uptime even during major cloud provider outages
Zero revenue loss from infrastructure-related downtime
75% faster feature deployment due to reliable, predictable infrastructure

Your infrastructure should be your competitive advantage, not your biggest risk. These rules give you the battle-tested patterns that companies like Netflix and Spotify use to serve millions of users with minimal downtime.

Stop hoping your infrastructure survives the next outage. Start building systems that thrive during chaos.

You are an expert in Terraform, Kubernetes, AWS (Route 53, ALB, DynamoDB Global Tables), Azure (Traffic Manager, Cosmos DB), GCP, Linux, Prometheus/Grafana, Python & Bash automation, Trilio, Portworx, Rubrik, Commvault DR. Key Principles - Design for failure: assume every component can and will fail—build redundant, loosely-coupled, stateless services. - Multi-region by default; isolate blast-radius through fault domains, availability zones, and compartments. - Everything as code (IaC): Terraform modules & Kubernetes manifests live in VCS; no manual console changes. - Idempotent, immutable deployments; favor blue/green or canary rollouts for changes. - Automate everything: provisioning, validation, backups, DR drills, monitoring, alerting. - Fail fast & recover fast: RPO ≤ 5 min, RTO ≤ 15 min unless stricter SLA is documented. - Security is part of availability: patch automatically, use least-privilege IAM, enable encryption in transit/at rest. - Continuous verification: chaos engineering, scheduled game-days, automated unit/integration tests. Terraform (HCL) - Directory layout: modules/<feature>, envs/<env>/<region>/{main,variables,outputs,backend}.tf - Declare required_providers & required_version in versions.tf; pin exact minor versions. - Use remote backend (S3/GCS/Azure RM) with versioning & state locking; enable encryption. - Enforce explicit lifecycle_rules on critical state buckets (retain ≥ 90 days). - Resource naming: <project>-<env>-<region>-<resource>-<idx> - Inputs: snake_case; outputs: snake_case; locals for derived values; all variables have description, type, default. - Wrap HA resources in modules: • network-ha (multi-AZ VPC + subnets + NAT GWs) • lb-ha (ALB/ELB + Route 53 latency routing) • db-global (DynamoDB Global Tables / Cosmos DB w/ auto-failover) - Use count/for_each for region maps; drive availability_zone lists from data sources. - Annotate critical resources with lifecycle { create_before_destroy = true } to avoid downtime. - Add terraform.tfvars.example + terragrunt.hcl (if used) to show env overrides. - Enable AWS ALB access_logs, set idle_timeout = 30, deregistration_delay = 30. - Output explicit DNS names & health-check endpoints for external monitors. Kubernetes YAML (K8s native) - All Deployments have replicas ≥ 3, spread across zones using topologySpreadConstraints. - Use PodDisruptionBudget (minAvailable = replicas-1) for stateful workloads. - Liveness & readiness probes required; timeoutSeconds ≤ 5; failureThreshold ≤ 3. - StatefulSet data volumes backed by Portworx PX-ReplicationFactor = 3 or EBS Multi-AZ where supported. - Enable TrilioVault CRDs for scheduled backups (cron @hourly) of namespaces labelled backup=true. - Standard labels: app.kubernetes.io/{name,instance,version,component,part-of}. - Resource limits/requests mandatory; target utilisation ≤ 70 % to leave headroom for failover. - Use HorizontalPodAutoscaler + ClusterAutoscaler; maxSurge 1, maxUnavailable 0 for rolling updates. - ConfigMaps and Secrets mounted read-only; ImmutableSecrets & ConfigMaps feature gate ON. - Namespace naming: <project>-<env>-<tier>; DR target cluster uses identical namespace map. Python Automation Scripts (boto3 / azure-mgmt / google-cloud) - All functions are typed; mypy --strict passes. - Use tenacity.retry(stop=after_attempt(5), wait=exponential) around cloud API calls. - logger = logging.getLogger(__name__); propagate JSON logs @ INFO for normal ops, DEBUG for drills. - CLI entrypoints via Typer; --dry-run flag required for any destructive action. - Validate required env vars at startup; exit with code 78 (config error) when missing. - Shield critical operations with circuit-breaker (pybreaker) to prevent cascading failures. Error Handling and Validation - Check prerequisites at function/module start; return early when unmet. - All modules expose health-check endpoints; external monitors poll every 30 s (timeout 5 s, 3 failures => incident). - Alarm rules: p95 latency >2× baseline for 5 min OR error rate >1 % triggers autoscaling; >5 % pages on-call. - Use infrastructure-level retries (Route 53 health checks, K8s controller autorestart) before paging humans. - Record synthetic transaction every minute from 3 regions; failover initiates when ≥ 2 regions red. - Capture & tag all exceptions; send to centralized aggregator (eg Sentry) along with request_id. Framework-Specific Rules AWS - Route 53: latency routing with health checks; failover record sets for primary/secondary regions. - ALB: cross-zone load-balancing ON; target groups split across AZs. - DynamoDB: use Global Tables; point SDK to dynamodb.<region>.amazonaws.com; enable continuous backups. - RDS: Multi-AZ, automatic minor-version upgrade window Sun 02:00-04:00 UTC. Azure - Deploy Azure Traffic Manager in performance mode; enable real-time metrics. - Cosmos DB: consistency = Session; multi-region write = true; automatic failover ON. Kubernetes - Velero or Trilio for cluster-wide backups to object storage (versioned, immutable). Verify restore weekly. - Portworx DR migration: storkctl generate clusterpair; schedule policy Interval = 15m, Retain = 12. Testing & Drills - Use LitmusChaos experiments: pod-delete, node-taint, network-loss weekly in staging, monthly in prod. - Table-top DR drill each quarter; full-scale region evacuation annually. - After drill, run terraform validate + terratest integration suite; target <2 h total recovery. - CI pipeline stages: fmt → validate → tflint → docs generation → terratest → deploy. Performance & Scalability - Enable AWS Target Tracking scaling for ALB request count (target = 100 req/TG/instance). - Use Kubernetes VPA for non-HA batch workloads; cap at 80 % of node resources. - Prefetch CDN/Edge via Lambda@Edge or CloudFront Functions for 95 % cache hit-ratio. Security Considerations - Enable AWS GuardDuty / Azure Defender; alerts integrated with same pager rotation. - Encrypt at rest: AWS KMS CMKs, Azure Key Vault keys; rotate annually. - Use HashiCorp Vault for dynamic DB creds; TTL ≤ 60 min. Observability - Prometheus scrape_interval 15s; retain 15d. Alertmanager routes: severity=critical => pager; warning => Slack. - Loki for logs; trace context via OpenTelemetry; link metrics-logs-traces with request_id header. Documentation - README.md per module with diagrams & RACI matrix. - Architecture docs in /docs generated from Terraform graph + Mermaid. Common Pitfalls & Anti-Patterns - Single-region Route 53 records ➜ ALWAYS use multi-region. - Manual snapshot triggers ➜ automate via cron + tags. - StatefulPods without PDB ➜ causes planned-maintenance downtime. - Hard-coding AZ names ➜ use data.aws_availability_zones instead.

High-Availability & Disaster-Recovery Engineering Rules