Guideline rule set for implementing resilient, self-healing, multi-region cloud architectures with Infrastructure-as-Code, Kubernetes, and automation scripts.
Your production system is a ticking time bomb. Every second of downtime costs your business thousands, yet most infrastructure still relies on single points of failure and manual recovery processes. You're one AWS region outage away from explaining to executives why your "highly available" system went dark for hours.
Traditional infrastructure approaches create hidden risks that compound over time:
When Fastly went down in 2021, it took major platforms like GitHub, Reddit, and Shopify offline for hours. When AWS US-East-1 had issues in 2017, it cascaded through services worldwide. Your infrastructure needs to survive these realities.
These Cursor Rules transform how you architect production systems by embedding high availability and disaster recovery patterns directly into your Infrastructure-as-Code. Instead of bolting on resilience as an afterthought, you build it into every component from day one.
Multi-Region Infrastructure by Default
Self-Healing Kubernetes Workloads
Zero-Touch Disaster Recovery
# Typical infrastructure problems
- Single region deployment "because it's simpler"
- Manual backup processes that get forgotten
- Load balancers without proper health checks
- No automated testing of recovery procedures
- Stateful applications that can't scale horizontally
# Auto-generated multi-region network module
module "network_ha" {
source = "./modules/network-ha"
availability_zones = data.aws_availability_zones.available.names
project = var.project
environment = var.environment
# Creates VPC + subnets across all AZs automatically
# NAT Gateways in each AZ for redundancy
}
module "lb_ha" {
source = "./modules/lb-ha"
# Route 53 with health checks + failover records
# ALB with cross-zone load balancing
# Target groups distributed across regions
}
# Kubernetes workloads with built-in resilience
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 3
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
containers:
- name: api
livenessProbe:
httpGet:
path: /health
port: 8080
timeoutSeconds: 5
failureThreshold: 3
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api-service
Instead of scrambling through runbooks during outages, your infrastructure handles failures automatically:
Stop spending 40% of your time on infrastructure maintenance:
# Automated DR validation runs weekly
@retry(stop=after_attempt(5), wait=exponential)
def validate_cross_region_replication():
"""Verify data consistency across regions"""
primary_data = get_dynamodb_items('us-east-1')
secondary_data = get_dynamodb_items('eu-west-1')
assert primary_data == secondary_data
logger.info("Cross-region replication validated successfully")
Monthly chaos engineering experiments validate your assumptions:
# Initialize Terraform workspace
git clone <your-infrastructure-repo>
cd infrastructure/envs/prod/us-east-1
# Deploy networking and load balancing
terraform init
terraform plan -var-file="../../terraform.tfvars"
terraform apply
# Apply Kubernetes manifests with HA patterns
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/deployments/
kubectl apply -f k8s/services/
kubectl apply -f k8s/pdb/
# Verify pod distribution across zones
kubectl get pods -o wide --show-labels
# Set up Prometheus monitoring
helm install prometheus prometheus-community/kube-prometheus-stack
# Configure automated backups
kubectl apply -f backup/trilio-backup-policy.yaml
# Schedule chaos engineering tests
kubectl apply -f chaos/litmus-experiments/
Your infrastructure should be your competitive advantage, not your biggest risk. These rules give you the battle-tested patterns that companies like Netflix and Spotify use to serve millions of users with minimal downtime.
Stop hoping your infrastructure survives the next outage. Start building systems that thrive during chaos.
You are an expert in Terraform, Kubernetes, AWS (Route 53, ALB, DynamoDB Global Tables), Azure (Traffic Manager, Cosmos DB), GCP, Linux, Prometheus/Grafana, Python & Bash automation, Trilio, Portworx, Rubrik, Commvault DR.
Key Principles
- Design for failure: assume every component can and will fail—build redundant, loosely-coupled, stateless services.
- Multi-region by default; isolate blast-radius through fault domains, availability zones, and compartments.
- Everything as code (IaC): Terraform modules & Kubernetes manifests live in VCS; no manual console changes.
- Idempotent, immutable deployments; favor blue/green or canary rollouts for changes.
- Automate everything: provisioning, validation, backups, DR drills, monitoring, alerting.
- Fail fast & recover fast: RPO ≤ 5 min, RTO ≤ 15 min unless stricter SLA is documented.
- Security is part of availability: patch automatically, use least-privilege IAM, enable encryption in transit/at rest.
- Continuous verification: chaos engineering, scheduled game-days, automated unit/integration tests.
Terraform (HCL)
- Directory layout: modules/<feature>, envs/<env>/<region>/{main,variables,outputs,backend}.tf
- Declare required_providers & required_version in versions.tf; pin exact minor versions.
- Use remote backend (S3/GCS/Azure RM) with versioning & state locking; enable encryption.
- Enforce explicit lifecycle_rules on critical state buckets (retain ≥ 90 days).
- Resource naming: <project>-<env>-<region>-<resource>-<idx>
- Inputs: snake_case; outputs: snake_case; locals for derived values; all variables have description, type, default.
- Wrap HA resources in modules:
• network-ha (multi-AZ VPC + subnets + NAT GWs)
• lb-ha (ALB/ELB + Route 53 latency routing)
• db-global (DynamoDB Global Tables / Cosmos DB w/ auto-failover)
- Use count/for_each for region maps; drive availability_zone lists from data sources.
- Annotate critical resources with lifecycle { create_before_destroy = true } to avoid downtime.
- Add terraform.tfvars.example + terragrunt.hcl (if used) to show env overrides.
- Enable AWS ALB access_logs, set idle_timeout = 30, deregistration_delay = 30.
- Output explicit DNS names & health-check endpoints for external monitors.
Kubernetes YAML (K8s native)
- All Deployments have replicas ≥ 3, spread across zones using topologySpreadConstraints.
- Use PodDisruptionBudget (minAvailable = replicas-1) for stateful workloads.
- Liveness & readiness probes required; timeoutSeconds ≤ 5; failureThreshold ≤ 3.
- StatefulSet data volumes backed by Portworx PX-ReplicationFactor = 3 or EBS Multi-AZ where supported.
- Enable TrilioVault CRDs for scheduled backups (cron @hourly) of namespaces labelled backup=true.
- Standard labels: app.kubernetes.io/{name,instance,version,component,part-of}.
- Resource limits/requests mandatory; target utilisation ≤ 70 % to leave headroom for failover.
- Use HorizontalPodAutoscaler + ClusterAutoscaler; maxSurge 1, maxUnavailable 0 for rolling updates.
- ConfigMaps and Secrets mounted read-only; ImmutableSecrets & ConfigMaps feature gate ON.
- Namespace naming: <project>-<env>-<tier>; DR target cluster uses identical namespace map.
Python Automation Scripts (boto3 / azure-mgmt / google-cloud)
- All functions are typed; mypy --strict passes.
- Use tenacity.retry(stop=after_attempt(5), wait=exponential) around cloud API calls.
- logger = logging.getLogger(__name__); propagate JSON logs @ INFO for normal ops, DEBUG for drills.
- CLI entrypoints via Typer; --dry-run flag required for any destructive action.
- Validate required env vars at startup; exit with code 78 (config error) when missing.
- Shield critical operations with circuit-breaker (pybreaker) to prevent cascading failures.
Error Handling and Validation
- Check prerequisites at function/module start; return early when unmet.
- All modules expose health-check endpoints; external monitors poll every 30 s (timeout 5 s, 3 failures => incident).
- Alarm rules: p95 latency >2× baseline for 5 min OR error rate >1 % triggers autoscaling; >5 % pages on-call.
- Use infrastructure-level retries (Route 53 health checks, K8s controller autorestart) before paging humans.
- Record synthetic transaction every minute from 3 regions; failover initiates when ≥ 2 regions red.
- Capture & tag all exceptions; send to centralized aggregator (eg Sentry) along with request_id.
Framework-Specific Rules
AWS
- Route 53: latency routing with health checks; failover record sets for primary/secondary regions.
- ALB: cross-zone load-balancing ON; target groups split across AZs.
- DynamoDB: use Global Tables; point SDK to dynamodb.<region>.amazonaws.com; enable continuous backups.
- RDS: Multi-AZ, automatic minor-version upgrade window Sun 02:00-04:00 UTC.
Azure
- Deploy Azure Traffic Manager in performance mode; enable real-time metrics.
- Cosmos DB: consistency = Session; multi-region write = true; automatic failover ON.
Kubernetes
- Velero or Trilio for cluster-wide backups to object storage (versioned, immutable). Verify restore weekly.
- Portworx DR migration: storkctl generate clusterpair; schedule policy Interval = 15m, Retain = 12.
Testing & Drills
- Use LitmusChaos experiments: pod-delete, node-taint, network-loss weekly in staging, monthly in prod.
- Table-top DR drill each quarter; full-scale region evacuation annually.
- After drill, run terraform validate + terratest integration suite; target <2 h total recovery.
- CI pipeline stages: fmt → validate → tflint → docs generation → terratest → deploy.
Performance & Scalability
- Enable AWS Target Tracking scaling for ALB request count (target = 100 req/TG/instance).
- Use Kubernetes VPA for non-HA batch workloads; cap at 80 % of node resources.
- Prefetch CDN/Edge via Lambda@Edge or CloudFront Functions for 95 % cache hit-ratio.
Security Considerations
- Enable AWS GuardDuty / Azure Defender; alerts integrated with same pager rotation.
- Encrypt at rest: AWS KMS CMKs, Azure Key Vault keys; rotate annually.
- Use HashiCorp Vault for dynamic DB creds; TTL ≤ 60 min.
Observability
- Prometheus scrape_interval 15s; retain 15d. Alertmanager routes: severity=critical => pager; warning => Slack.
- Loki for logs; trace context via OpenTelemetry; link metrics-logs-traces with request_id header.
Documentation
- README.md per module with diagrams & RACI matrix.
- Architecture docs in /docs generated from Terraform graph + Mermaid.
Common Pitfalls & Anti-Patterns
- Single-region Route 53 records ➜ ALWAYS use multi-region.
- Manual snapshot triggers ➜ automate via cron + tags.
- StatefulPods without PDB ➜ causes planned-maintenance downtime.
- Hard-coding AZ names ➜ use data.aws_availability_zones instead.