Comprehensive Rules for designing, configuring, and operating microservices on Kubernetes using a service-mesh (Istio/Linkerd/Consul/Cilium) with full observability, zero-trust security, and Git-Ops workflows.
Your microservices architecture is drowning you in operational complexity. Service discovery breaks during deployments. Security policies are inconsistent across teams. Debugging distributed failures takes hours, not minutes. You're spending more time managing infrastructure than shipping features.
The Problem: As your microservices scale beyond 10-15 services, you hit a complexity wall that conventional container orchestration can't solve:
The Real Impact: Teams spend 40-60% of their time on infrastructure concerns instead of business logic. Mean time to resolution for production issues averages 2-4 hours because debugging requires manual correlation across multiple systems.
These Cursor Rules transform your microservices from a collection of loosely connected services into a cohesive, observable, and secure distributed system by moving all cross-cutting infrastructure concerns into the service mesh layer.
What This Gives You:
# Adding a new service dependency
1. Update service discovery configuration
2. Implement retry logic and circuit breakers
3. Add authentication/authorization logic
4. Configure monitoring and alerting
5. Test failure scenarios manually
# Single VirtualService configuration handles everything
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: checkout-service
spec:
http:
- fault:
delay:
percentage:
value: 0.1
fixedDelay: 5s
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: checkout
subset: v2
weight: 100
- route:
- destination:
host: checkout
subset: v1
weight: 90
- destination:
host: checkout
subset: v2
weight: 10
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,connect-failure
Without Service Mesh: Incident starts with "payments are failing." You spend 90 minutes correlating logs across 8 services, checking network configs, and manually tracing request flows.
With These Rules: Jaeger trace shows the exact failure point in 30 seconds. Grafana dashboard reveals the payment service's circuit breaker tripped due to database connection timeouts. Root cause identified in under 5 minutes.
Without Service Mesh: Canary deployment requires custom load balancer rules, manual traffic shifting scripts, and prayer that your rollback procedure works when things go wrong.
With These Rules:
# Shift 10% traffic to v2, monitor error rates
# If p99 latency > 300ms or error rate > 1%, automatic rollback
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
spec:
strategy:
canary:
analysis:
successCondition: result[0] < 0.01
metrics:
- name: error-rate
interval: 30s
count: 5
Without Service Mesh: Manual audit of authentication logic across 25+ services, inconsistent TLS configurations, and shared secrets stored in various config systems.
With These Rules: Single PeerAuthentication policy enforces mTLS cluster-wide. AuthorizationPolicy provides zero-trust networking with namespace-level isolation. Automatic certificate rotation with audit trails.
# Istio with full observability
istioctl install --set values.pilot.env.EXTERNAL_ISTIOD=false
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.19/samples/addons/prometheus.yaml
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.19/samples/addons/grafana.yaml
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.19/samples/addons/jaeger.yaml
# Namespace configuration
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
istio-injection: enabled
annotations:
config.linkerd.io/proxy-cpu-request: "50m"
config.linkerd.io/proxy-memory-request: "64Mi"
# Default deny-all policy
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: default-deny
namespace: production
spec: {}
---
# Strict mTLS enforcement
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: production
spec:
mtls:
mode: STRICT
# Circuit breaker and outlier detection
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment-service
spec:
host: payment-service
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 100
circuitBreaker:
consecutiveGatewayErrors: 5
interval: 30s
Week 1: Automatic mTLS eliminates manual certificate management. Distributed tracing reduces debugging time by 70%.
Week 2: Circuit breakers prevent cascade failures. Mean time to detection drops from 15 minutes to 2 minutes with automated alerting.
Month 1: Canary deployments become routine. Zero-downtime deployments increase from 60% to 95% success rate.
Month 3: Developer velocity increases 40% as teams focus on business logic instead of infrastructure concerns. Production incidents decrease by 50% due to proactive circuit breaker protection.
These rules don't just add service mesh capabilities—they transform how your team operates distributed systems. You'll move from reactive fire-fighting to proactive system management, from manual deployment procedures to automated rollouts, and from hours-long debugging sessions to minute-level root cause analysis.
Your microservices architecture becomes what it was meant to be: a way to ship features faster, not slower. Install these rules and start building the infrastructure your team deserves.
You are an expert in:
- Kubernetes ≥1.27
- Service-Mesh platforms: Istio 1.19+, Linkerd 2.14+, Consul Service Mesh 1.16+, Cilium Service Mesh ≥1.14
- Cloud-native observability stack: Prometheus, Grafana, Loki, OpenTelemetry Collector
- Git-Ops & declarative tooling: Kustomize, Helm, ArgoCD, Flux
Key Principles
- Apply zero-trust networking by default (strict mTLS, least-privilege AuthorizationPolicy).
- Keep all infrastructure definitions declarative and version-controlled (Git-Ops).
- Decouple network concerns from code; rely on mesh features (retries, time-outs, circuit breaking).
- Prefer small, single-purpose workloads that expose health probes and structured logs.
- Automate deployment, validation, rollback, and certificate rotation.
- Treat observability (metrics, logs, traces) as a first-class feature; fail build if telemetry is missing.
- Fail fast: validate manifests with kube-lint & istio-verify in CI before merge.
YAML & Kubernetes Manifests
- Indent with **2 spaces**, never tabs; keys in `kebab-case`.
- One logical resource per file; filename mirrors Kind and name: `virtualservice-checkout.yaml`.
- Mandatory labels: `app.kubernetes.io/name`, `app.kubernetes.io/component`, `app.kubernetes.io/version`, `env`.
- Use annotations for mesh controls (e.g., `sidecar.istio.io/inject: "true"`).
- Use anchors & aliases for port lists and common metadata to eliminate repetition.
- Keep container images immutable (`image: registry.example.com/app:v1.2.3`), never `latest`.
- Default requests/limits for sidecars: cpu 50m/200m, memory 64Mi/256Mi; override only with justification.
Example (strict mTLS namespace):
```yaml
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: payments
spec:
mtls:
mode: STRICT
```
Error Handling & Validation
- Configure **circuit breakers** and **outlier detection** via `DestinationRule`:
- `consecutive5xxErrors: 5`
- `interval: 5s`, `baseEjectionTime: 30s`, `maxEjectionPercent: 100`
- Define **retries** in `VirtualService` (`attempts: 3`, `perTryTimeout: 2s`, `retryOn: 5xx,connect-failure`).
- Place readiness & liveness probes on **/healthz** and **/readyz** endpoints; probe timeouts ≤1s.
- Emit structured JSON logs containing `trace_id`, `span_id`, and `severity`.
- Enforce schema validation (Kubeconform, Istioctl vet) in CI; reject manifests with unknown fields.
Istio Rules
- **Traffic routing**: Use `VirtualService` for canary (% based) and header-based routing. Keep route rules deterministically ordered; final catch-all route last.
- **DestinationRule**: always pair with a `VirtualService`; specify `tls.mode: ISTIO_MUTUAL`.
- **AuthorizationPolicy**: default deny; allow by namespace selector or JWT claim.
- **Telemetry v1**: enable access log sampling to 5 ‰ (`0.005`) to reduce noise in prod.
- **Sidecar** resource: Scope egress to namespace hosts only to shrink Envoy config.
- **EnvoyFilter**: last resort only; document RFC before usage.
Linkerd Rules
- Proxy auto-injection via namespace annotation `linkerd.io/inject: enabled`.
- Use `ServiceProfile` to declare expected routes and time-outs.
- Enable **Tap** only in non-prod clusters; auto-disable through policy in prod.
Consul Service Mesh Rules
- Use `ProxyDefaults` for uniform mTLS and time-outs.
- Register services with health checks; failing checks remove instance from catalog within 5 s.
Cilium Mesh Rules
- Define **NetworkPolicies** using Cilium CRDs for L3-L7 enforcement.
- Enable Hubble for flow visibility; retain 24 h of flow logs.
Observability
- Prometheus scrape configs auto-generated by mesh add-ons; verify target count in CI.
- Use OpenTelemetry sidecar/agent; export traces to OTLP endpoint.
- Grafana dashboards: import `Istio Control Plane`, `Istio Service Mesh`, and custom RED metrics.
- Alerting: page on p50 latency >500 ms for 5 mins or 5 % error rate ≥2 mins.
Security
- Rotate root CA every 90 d, workload certs every 24 h (Istiod default).
- Disallow plain HTTP via `PeerAuthentication` + `Strict-Transport-Security` header.
- Scan container images (Trivy) and Envoy filters for CVEs during CI.
Performance & Resource Optimization
- Set `sidecarProxyResources` in `MeshConfig` to enforce global limits.
- Enable **Locality-aware load balancing** for multi-zone clusters.
- Benchmark with `fortio` or `wrk2`; target p99 latency <300 ms.
Testing Strategies
- Chaos engineering: employ Litmus or ChaosMesh to kill sidecars, corrupt DNS, inject latency.
- Validate circuit breakers by flooding service until breaker trips; assert HTTP 503.
- Use `istio-kubectl wait` to verify config sync before running integration tests.
CI/CD & Automation
- Kustomize overlays per environment (`base/`, `overlays/stage`, `overlays/prod`).
- ArgoCD sync-windows block changes during peak traffic.
- Require human approval for canary >25 % traffic shift in prod.
Common Pitfalls & Anti-patterns
- Anti: Skipping `DestinationRule` → causes TLS mismatch errors.
- Anti: Global wildcard `AuthorizationPolicy` allow → breaks zero trust.
- Anti: Unbounded retries; always cap attempts and budget.
File/Directory Layout (example)
```
service-mesh/
├─ base/
│ ├─ namespace.yaml
│ ├─ peer-auth.yaml
│ └─ kustomization.yaml
├─ overlays/
│ ├─ stage/
│ │ └─ kustomization.yaml
│ └─ prod/
│ └─ kustomization.yaml
└─ dashboards/
└─ istio-service-latency.json
```
Versioning & Documentation
- Adopt semantic versioning (`vMAJOR.MINOR.PATCH`) for app and mesh bundles.
- Document each VirtualService with use-case, owner, last reviewed date.
Checklist Before Merge
- [ ] `kubectl kustomize . | kubeconform -strict`
- [ ] `istioctl analyze` returns no error/warning.
- [ ] Trivy scan passes (< HIGH severity).
- [ ] Grafana dashboards updated if metrics schema changed.
- [ ] CHANGELOG.md entry added.