Comprehensive coding & infrastructure rules for building, deploying, and operating horizontally-scaled, load-balanced MCP (Model Context Protocol) back-ends.
Stop debugging mysterious MCP server failures at 3 AM. These load balancing rules transform fragile, single-instance MCP deployments into bulletproof, auto-scaling infrastructure that handles traffic spikes and server failures gracefully.
You've built an MCP server that works perfectly—until it doesn't. Your current deployment probably looks like this:
The result? Sleepless nights, frustrated users, and infrastructure that becomes more fragile as your system grows.
These Cursor Rules provide battle-tested patterns for deploying horizontally-scaled MCP servers with intelligent load balancing, automatic failover, and comprehensive monitoring. You'll build infrastructure that scales automatically and fails gracefully.
Before: One MCP server crash = complete outage
After: Automatic failover to healthy instances with <1s detection time
Before: Manual capacity planning and over-provisioning
After: HPA scales from 2 to 20 instances based on RPS and CPU metrics
Before: Mystery failures with no visibility
After: Structured logs with trace IDs, circuit breaker metrics, and distributed tracing
Before: White-knuckle deployments with potential downtime
After: Zero-downtime rolling updates with automatic rollback on health check failures
Your MCP service normally handles 100 RPS but suddenly receives 1000 RPS from a new client integration.
With These Rules:
# HPA automatically scales based on custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "50" # Target 50 RPS per pod
Result: Kubernetes automatically scales from 3 to 20 pods in under 60 seconds. NGINX load balancer distributes traffic using least-connections algorithm, maintaining sub-200ms response times.
You need to deploy a hotfix during peak hours without impacting users.
With These Rules:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # Never reduce capacity
maxSurge: 1 # Add one pod at a time
Result: Zero-downtime deployment with connection draining. Users experience no interruption while the fix rolls out.
Your Redis cache becomes unavailable, threatening to cascade failures across all MCP instances.
With These Rules:
// Circuit breaker protects against cascade failures
breaker := gobreaker.NewCircuitBreaker(gobreaker.Settings{
MaxRequests: 3,
Interval: 30 * time.Second,
Timeout: 5 * time.Second,
})
// Graceful degradation instead of complete failure
result, err := breaker.Execute(func() (interface{}, error) {
return redisClient.Get(ctx, key)
})
if err != nil {
// Serve from local cache or return computed result
return fallbackHandler(ctx, request)
}
Result: Circuit breaker trips after 3 consecutive Redis failures, MCP servers continue operating with degraded functionality instead of crashing.
// cmd/server/main.go
func main() {
srv := &http.Server{
Addr: ":8080",
ReadHeaderTimeout: 5 * time.Second,
WriteTimeout: 30 * time.Second,
IdleTimeout: 60 * time.Second,
Handler: router,
}
// Graceful shutdown for Kubernetes
go func() {
c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt, syscall.SIGTERM)
<-c
ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()
srv.Shutdown(ctx)
}()
srv.ListenAndServe()
}
// internal/handler/health.go
func (h *Handler) HealthCheck(w http.ResponseWriter, r *http.Request) {
// Check dependencies
if err := h.redis.Ping(r.Context()).Err(); err != nil {
http.Error(w, "Redis unavailable", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]string{"status": "healthy"})
}
# deployments/k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
spec:
replicas: 3
template:
spec:
containers:
- name: mcp-server
image: mcp-server:latest
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 2
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
# configs/nginx.prod.conf
upstream mcp_servers {
least_conn;
keepalive 32;
server mcp-server-1:8080 max_fails=3 fail_timeout=30s;
server mcp-server-2:8080 max_fails=3 fail_timeout=30s;
server mcp-server-3:8080 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
location / {
proxy_pass http://mcp_servers;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_next_upstream error timeout http_500 http_502 http_503;
}
}
// internal/metrics/prometheus.go
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests processed",
},
[]string{"code", "method", "route"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "route"},
)
)
These rules transform your MCP deployment from a fragile single instance into enterprise-grade infrastructure that scales automatically, fails gracefully, and provides complete operational visibility. Implement them once, and focus on building features instead of fighting infrastructure fires.
You are an expert in Go, Kubernetes, NGINX, HAProxy, Prometheus, Grafana, Redis, and MCP architecture.
Key Principles
- Design for horizontal scalability; add new MCP instances without changing code.
- Minimize blast radius: isolate failures with circuit breakers and health checks.
- Principle of Least Privilege for every component (pods, load balancers, databases).
- Prefer immutable infrastructure: rebuild containers instead of patching in place.
- Automate everything: CI/CD pipelines, linting, security scanning, performance tests.
- Observability first: every feature ships with metrics, structured logs, and traces.
Go (backend service)
- Modules: always maintain go.mod/go.sum in repo root; pin exact versions.
- Directory layout: cmd/ internal/ pkg/ api/ configs/ deployments/.
- API layer must expose context.Context in every public function to propagate deadlines & cancellation from load balancer.
- Use net/http with http.Server{ReadHeaderTimeout:5s, WriteTimeout:30s, IdleTimeout: 60s}; never leave defaults.
- Implement graceful shutdown (server.Shutdown(ctx)) on SIGTERM so Kubernetes can drain connections.
- Use UUIDv4 (github.com/google/uuid) for session IDs; never reuse request-scoped IDs across sessions.
- Error values: wrap with fmt.Errorf("tag: %w", err) and classify (ErrBadRequest, ErrUnauthorized, ErrInternal).
- Panic only in truly unrecoverable paths; recover in main() and emit a fatal log + metrics increment.
- Validation layer: use github.com/xeipuuv/gojsonschema for JSON-schema input validation.
Error Handling and Validation
- First lines of every handler: 1) decode & validate request, 2) authZ check, 3) rate-limit.
- Return early on validation/oauth failures; avoid nested ifs.
- Rate limiting: implement token bucket in Redis keyed by userID+IP; return 429 after bucket empty.
- Authorization middleware queries PDP (Policy Decision Point) and caches allow/deny for TTL≤30s.
- Circuit breaker (github.com/sony/gobreaker): isolate downstream Redis/DB to prevent cascading outages.
- Emit structured log fields: trace_id, user_id, remote_ip, latency_ms, status_code.
Kubernetes (orchestration)
- Deployment spec uses readinessProbe & livenessProbe on /healthz (200 OK) with 2s timeout; maxUnavailable=0 during RollingUpdate.
- Resources: set limits & requests; CPU request ≥ 250m, memory request ≥ 256Mi; keep ratio ≤ 1:2 req:limit.
- Autoscaling: HPA on custom metric http_requests_per_second and CPU≥70%.
- PodDisruptionBudget: minAvailable: 80% to prevent cluster ops from dropping capacity below safe level.
- Env vars over ConfigMaps for non-secret configs; Secrets for tokens/keys (mount as tmpfs volume).
NGINX / HAProxy (software load balancer)
- Algorithm: least_conn for variable session lengths; fallback round_robin.
- Keepalive ≥ 32 to reuse upstream TCP connections.
- Enable PROXY protocol to preserve original client IP; Go server must parse X-Forwarded-For.
- Rate-limit zone one= $binary_remote_addr rate=100r/s burst=20 nodelay.
- Health checks: proxy_next_upstream error timeout http_500 http_502 http_503 non_idempotent.
Cloud LB (AWS/GCP/Azure)
- Use Network-level LB in front, pass to NGINX when needing L7 features.
- Enable sticky sessions ONLY when stateful affinity is mandatory; prefer stateless.
- Configure TLS termination at LB; enforce TLS1.2+, disable weak ciphers (RC4,3DES).
Redis (distributed cache & rate limiter)
- Use Redis Cluster with at least 3 masters; enable client-side hashing to avoid single slot hotspot.
- All cached items carry namespace prefix (e.g., "mcp:auth:session:<uuid>").
- Set timeouts: DialTimeout=500ms, ReadTimeout=200ms, WriteTimeout=200ms.
Observability (Prometheus & Grafana)
- Expose /metrics; include: process_cpu_seconds_total, go_goroutines, http_requests_total{code,route,method}, redis_latency_ms.
- Define SLO: 99.9% of requests < 250ms; Grafana alert after 3 consecutive 5-min windows breaching SLO.
- Log format: JSON with RFC3339 timestamp, severity, component, msg, fields.
Testing
- Unit tests ≥ 90% pkg/ coverage; run go test -race in CI.
- Integration tests spin up ephemeral Redis & NGINX via docker-compose.
- Chaos tests: inject 500ms latency and 50% packet loss between LB and MCP; assert circuit breaker trips.
- Load test: k6 script, target 2× expected peak RPS for 1h; success criteria <1% error, P99 latency <2× baseline.
Performance & Scaling
- Prefer bulkheads: separate pools for slow and fast endpoints (e.g., /export large files vs /api quick calls).
- Tune GOMAXPROCS to CPU cores; use runtime/pprof to profile hotspots before scaling wider.
- Cache idempotent GETs with 7200s TTL at NGINX; add Cache-Control: public,max-age=7200 header.
Security Hardening
- Mutual TLS between LB and MCP when traffic crosses VPC boundaries.
- Enable Content-Security-Policy, X-Frame-Options DENY in responses.
- Run gosec and trivy scans on every commit; block merge on critical findings.
- Rotate secrets using Kubernetes Secrets + external secret manager (AWS Secrets Manager, GCP Secret Manager).
Directory & File Naming
- Directories lowercase with dashes: mcp-server, rate-limiter, kube-deploy.
- Config files: environment suffix, e.g., nginx.prod.conf, hpa-dev.yaml.
Example File Structure
mcp-server/
cmd/
server/
main.go
internal/
auth/
handler/
rateLimiter/
api/
openapi.yaml
configs/
nginx.prod.conf
hpa.yaml
deployments/
k8s/
deployment.yaml
service.yaml
test/
integration/
Dockerfile
Makefile
README.md
Follow these rules to deliver a secure, observable, and elastically-scalable MCP backend capable of handling high throughput with minimal latency.