Actionable coding rules for implementing the Circuit Breaker pattern with Resilience4j in Java/Spring Boot micro-services and Istio service mesh.
Your microservices are one cascade failure away from a complete system outage. When one service slows down, it triggers a domino effect that brings down your entire application stack. You need bulletproof resilience patterns that fail fast and recover gracefully.
You've built a solid microservices architecture, but you're still vulnerable to these critical failure modes:
Without proper circuit breaker implementation, you're essentially running a distributed system with no safety mechanisms.
These Cursor Rules implement the Circuit Breaker pattern using Resilience4j and Istio, creating multiple layers of protection that automatically detect failures and prevent cascade effects. Instead of letting one service failure ripple through your entire system, you get controlled degradation with automatic recovery.
The rules combine application-level circuit breakers with service mesh-level outlier detection, giving you both fine-grained control and infrastructure-level protection.
Incident Response Time: 80% Reduction
Development Velocity: 3x Faster
System Reliability: 99.9% Uptime
Before: Payment service calls take 30 seconds to timeout when the external processor is down, saturating all thread pools and bringing down checkout.
After: Circuit breaker detects payment failures within 10 seconds, switches to fallback (order queuing), and checkout remains operational.
@Service
public class PaymentService {
@CircuitBreaker(name = "paymentProcessor", fallbackMethod = "fallbackPayment")
@Retry(name = "paymentProcessor")
public PaymentResult processPayment(PaymentRequest request) {
return externalPaymentApi.charge(request);
}
public PaymentResult fallbackPayment(PaymentRequest request, Exception ex) {
log.warn("Payment circuit open, queuing order: {}", request.getOrderId());
return PaymentResult.queued(request.getOrderId());
}
}
Before: Guessing at failure thresholds leads to either false positives (breaker trips on occasional glitches) or delayed detection (breaker stays closed during real outages).
After: Data-driven threshold tuning using real incident metrics and automated testing.
@Test
void breakerOpensUnderRealLoadConditions() {
CircuitBreaker cb = CircuitBreaker.of("paymentProcessor",
CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slidingWindowSize(100)
.minimumNumberOfCalls(20)
.build());
// Simulate real failure pattern from production incident
simulateHighFailureRate(cb, 60); // 60% failure rate
assertEquals(State.OPEN, cb.getState());
}
Before: Application circuit breakers and Kubernetes health checks work independently, creating inconsistent failure handling.
After: Synchronized circuit breaker configuration across application code and Istio service mesh.
# application.yaml
resilience4j:
circuitbreaker:
instances:
inventoryService:
failureRateThreshold: 50
slidingWindowSize: 60
waitDurationInOpenState: 30s
---
# istio-destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
spec:
host: inventory
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
implementation "io.github.resilience4j:resilience4j-spring-boot3"
implementation "org.springframework.boot:spring-boot-starter-actuator"
Create application.yaml with traffic-appropriate thresholds:
resilience4j:
circuitbreaker:
configs:
default:
registerHealthIndicator: true
slidingWindowType: TIME_BASED
slidingWindowSize: 60
minimumNumberOfCalls: 100
failureRateThreshold: 50
waitDurationInOpenState: 30s
instances:
inventoryService:
baseConfig: default
minimumNumberOfCalls: 20 # Low traffic service
paymentProcessor:
baseConfig: default
failureRateThreshold: 30 # Critical service, fail faster
@Service
public class InventoryService {
@CircuitBreaker(name = "inventoryService", fallbackMethod = "fallbackInventory")
public Inventory getInventory(String sku) {
return inventoryClient.fetchInventory(sku);
}
public Inventory fallbackInventory(String sku, Exception ex) {
log.warn("Inventory circuit open for sku: {}, state: {}",
sku, circuitBreakerRegistry.circuitBreaker("inventoryService").getState());
return Inventory.unavailable(sku);
}
}
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: inventory-outlier-detection
spec:
host: inventory
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
# prometheus-alerts.yaml
groups:
- name: circuit-breaker
rules:
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state{state="open"} == 1
for: 30s
annotations:
summary: "Circuit breaker {{ $labels.name }} is open"
- alert: HighCircuitBreakerOpenRate
expr: increase(resilience4j_circuitbreaker_state_transition_total{to="open"}[5m]) > 3
annotations:
summary: "High circuit breaker open rate for {{ $labels.name }}"
Immediate Improvements:
Operational Benefits:
Development Efficiency:
Your microservices architecture becomes antifragile—not only surviving failures but using them to strengthen system resilience. Start with your most critical service dependencies and expand coverage based on actual incident patterns.
You are an expert in Java 17+, Spring Boot 3.x, Resilience4j, Hystrix (legacy), and Istio service mesh.
Key Principles
- Fail fast, degrade gracefully: prefer immediate rejection to long waits.
- Tune based on real incident metrics, not single glitches.
- Combine Circuit Breaker with retries, timeouts, bulkheads.
- Keep breaker state in process memory; never persist to a DB.
- Visibility first: every state change MUST be exported as a metric/event.
- Prefer declarative configuration; keep breaker config outside business code.
Java
- Use records or immutable POJOs for breaker config DTOs.
- Place circuit-breaker annotated public methods in @Service classes only.
- Name breaker instances <downstreamService>Service (e.g., inventoryService).
- Use enums for explicit State { CLOSED, OPEN, HALF_OPEN }.
- Keep @CircuitBreaker, @Retry, @Bulkhead on separate wrapper methods to avoid tangled policies.
- All package names lower-snake: com.company.<domain>.resilience.
Error Handling and Validation
- Handle parameter validation BEFORE circuit-breaker invocation.
- When fallbackMethod is triggered, log WARN with correlation-id & state.
- Fallbacks must never swallow InterruptedException.
- Return neutral or cached data in fallback; never null.
- Ensure TimeoutException and CallNotPermittedException are mapped to meaningful HTTP 503/429.
Resilience4j (Spring Boot)
- Dependency: implementation "io.github.resilience4j:resilience4j-spring-boot3"
- Annotate:
@CircuitBreaker(name = "inventoryService", fallbackMethod = "fallbackInventory")
public Inventory getInventory(String sku) { ... }
- application.yaml skeleton:
```yaml
resilience4j:
circuitbreaker:
configs:
default:
registerHealthIndicator: true
slidingWindowType: TIME_BASED
slidingWindowSize: 60
minimumNumberOfCalls: 100
failureRateThreshold: 50
waitDurationInOpenState: 30s
instances:
inventoryService:
baseConfig: default
```
- Publish Micrometer metrics; enable endpoint /actuator/metrics/resilience4j.circuitbreaker.state.
- Add HealthIndicator to surface OPEN breakers in /actuator/health.
Hystrix (legacy/support)
- Only use for existing code. Migrate new modules to Resilience4j.
- Keep thread-isolation strategy set to SEMAPHORE unless blocking IO.
Istio / Service Mesh
- DestinationRule example:
```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: inventory-cb
spec:
host: inventory
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
```
- Mirror application-level breaker thresholds with mesh-level OutlierDetection.
- Use distributed tracing (Zipkin/Jaeger) to correlate OPEN state across pods.
Testing
- Unit: JUnit 5 + Resilience4j Test.
```java
@Test
void breakerOpensAfterFailures() {
CircuitBreaker cb = CircuitBreaker.ofDefaults("demo");
cb.transitionToClosedState();
IntStream.range(0, 5).forEach(i -> cb.onError(0, TimeUnit.MILLISECONDS, new IOException()));
assertEquals(State.OPEN, cb.getState());
}
```
- Load: k6 or Gatling to verify OPEN/HALF_OPEN under stress.
- Chaos: use ChaosMonkey for Spring Boot or LitmusChaos to kill downstream pods.
Monitoring & Metrics
- Track: calls, failure_rate, slow_call_rate, state (0=CLOSED,1=OPEN,2=HALF_OPEN), buffered_calls.
- Grafana dashboard: alert if OPEN for > X seconds OR > Y opens/hour.
Performance & Tuning
- Start with 50 % failureRateThreshold, 60 s slidingWindow, 30 s waitDuration.
- Reduce minimumNumberOfCalls in low-traffic services (<20 req/min) to 20.
- Consider TIME_BASED windows in bursty traffic; COUNT_BASED in steady traffic.
- Enable automatic adaptive tuning with Resilience4j Autoconfigure if ML tuning is desired.
Security
- Sanitize fallback responses; never leak stack traces.
- Log correlation-id, breaker name, state, but redact PII.
Common Pitfalls
- BREAKER STORM: multiple services cascading OPEN; avoid by staggering retry/backoff policies.
- Infinite HALF_OPEN: ensure max permitted calls >0.
- Metrics ignored: breaker configured but no alerts; add SLO-linked alerts.
Directory Layout Example
src/main/java
└── com/company/inventory
├── controller
├── service
│ └── InventoryService.java (with breaker)
└── resilience
└── BreakerConfig.java
Cheat-Sheet Thresholds
+-----------------------+-----------+
| Traffic | minCalls |
+-----------------------+-----------+
| < 100 req/min | 20 |
| 100-1000 req/min | 50 |
| > 1000 req/min | 100 |
+-----------------------+-----------+
Adjust every sprint based on Prometheus drop-wizard metrics.