Opinionated, actionable Rules for designing, implementing, and operating resilient distributed transactions in Java/Spring-Boot microservices using Sagas, Outbox, Kafka, Temporal, and cloud orchestration services.
Transform chaotic microservice failures into predictable, recoverable workflows with battle-tested distributed transaction patterns.
You're not building a simple CRUD app anymore. Your microservices need to coordinate across databases, message queues, and external APIs—and when something fails (and it will), you need more than crossed fingers and manual rollbacks.
The real problems hitting your production systems:
These aren't theoretical problems—they're the 3 AM alerts that wake up your team.
This ruleset transforms Cursor into your distributed transaction architect, implementing proven patterns that companies like Netflix, Uber, and Amazon use to handle millions of transactions daily.
What you get:
// Before: Fragile, blocking distributed calls
@Transactional
public void processOrder(CreateOrderCommand command) {
Order order = orderService.createOrder(command);
paymentService.processPayment(order.getPaymentInfo()); // Blocks thread
inventoryService.reserveItems(order.getItems()); // Can fail silently
// No compensation logic, manual cleanup required
}
// After: Resilient saga with automatic compensation
@WorkflowMethod
public void processOrderSaga(CreateOrderCommand command) {
Saga saga = new Saga(new SagaOptions.Builder().build());
try {
OrderCreatedEvent order = saga.addCompensation(
() -> orderService.cancelOrder(command.getOrderId()),
() -> orderService.createOrder(command)
);
PaymentProcessedEvent payment = saga.addCompensation(
() -> paymentService.refundPayment(order.getPaymentId()),
() -> paymentService.processPayment(order.getPaymentInfo())
);
saga.addCompensation(
() -> inventoryService.releaseItems(order.getItems()),
() -> inventoryService.reserveItems(order.getItems())
);
} catch (Exception e) {
saga.compensate(); // Automatic rollback
throw e;
}
}
// Automatic correlation across all services
@KafkaListener(topics = "ordering.order.created.v1")
public void handleOrderCreated(OrderCreatedEvent event) {
MDC.put("trace_id", event.getTraceId());
MDC.put("idempotency_key", event.getIdempotencyKey());
log.info("Processing order: {}", event.getOrderId());
// All downstream calls automatically inherit trace context
}
// Resilient payment flow with automatic retries and compensation
@Component
public class PaymentSagaOrchestrator {
@WorkflowMethod
public PaymentResult processPayment(ProcessPaymentCommand command) {
var options = ActivityOptions.newBuilder()
.setRetryOptions(RetryOptions.newBuilder()
.setInitialInterval(Duration.ofSeconds(1))
.setMaximumAttempts(5)
.setBackoffCoefficient(2.0)
.build())
.setStartToCloseTimeout(Duration.ofMinutes(2))
.build();
var activities = Workflow.newActivityStub(PaymentActivities.class, options);
// Each step automatically retries with exponential backoff
var preAuth = activities.preAuthorizePayment(command);
var charge = activities.capturePayment(preAuth.getAuthId());
var notification = activities.sendPaymentConfirmation(charge);
return PaymentResult.success(charge.getTransactionId());
}
}
// Outbox pattern ensuring exactly-once delivery
@Transactional
public void createOrder(CreateOrderCommand command) {
// Single transaction for consistency
Order order = orderRepository.save(new Order(command));
// Outbox entry ensures event delivery even if Kafka is down
outboxRepository.save(OutboxEntry.builder()
.aggregateId(order.getId())
.eventType("OrderCreatedEvent")
.payload(orderMapper.toEvent(order))
.build());
// Separate process (Debezium CDC) publishes to Kafka
// Guaranteed delivery, zero message loss
}
// Choreographed saga using domain events
@EventHandler
public class InventoryService {
@KafkaListener(topics = "ordering.order.created.v1")
@Retryable(value = {OptimisticLockingFailureException.class})
public void handleOrderCreated(OrderCreatedEvent event) {
try {
reserveInventory(event.getItems(), event.getOrderId());
publishEvent(InventoryReservedEvent.from(event));
} catch (InsufficientInventoryException e) {
publishEvent(InventoryReservationFailedEvent.from(event, e));
// Order service receives failure event and cancels order
}
}
}
.cursorrules file# Cursor will automatically suggest saga patterns
# when you type: "create order processing workflow"
# Example transformation:
# 1. Highlight your existing @Transactional method
# 2. Ask Cursor: "Convert this to a resilient saga pattern"
# 3. Cursor generates complete saga implementation with compensation
// Cursor auto-generates this boilerplate
@Configuration
public class ObservabilityConfig {
@Bean
public TraceIdGenerator traceIdGenerator() {
return () -> UUID.randomUUID().toString();
}
@Bean
public MeterRegistry meterRegistry() {
return new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
}
}
// Cursor suggests Resilience4j patterns automatically
@Component
public class PaymentServiceClient {
@CircuitBreaker(name = "payment-service", fallbackMethod = "fallbackPayment")
@Retry(name = "payment-service")
@TimeLimiter(name = "payment-service")
public CompletableFuture<PaymentResult> processPayment(PaymentRequest request) {
return restTemplate.postForObject("/payments", request, PaymentResult.class);
}
}
// Cursor generates complete test infrastructure
@SpringBootTest
@Testcontainers
class DistributedTransactionIntegrationTest {
@Container
static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:15");
@Container
static KafkaContainer kafka = new KafkaContainer(DockerImageName.parse("confluentinc/cp-kafka:7.4.0"));
@Test
void shouldHandlePaymentFailureWithCompensation() {
// Test complete saga failure and recovery
// Cursor auto-generates failure scenarios
}
}
"We went from spending 2-3 days per week on distributed transaction failures to maybe 2-3 hours per month. The automatic compensation saves us from so many 3 AM incidents." — Senior Backend Engineer
"The observability patterns made debugging distributed failures actually manageable. We can trace a transaction across 8 services in minutes." — Platform Team Lead
Stop fighting distributed transaction complexity. Start building resilient systems that your team can actually debug, maintain, and scale. Your future self (and your on-call rotation) will thank you.
You are an expert in Java 17+, Spring Boot 3.x, Kafka, Temporal, AWS Step Functions, PostgreSQL, Debezium, Docker, and Kubernetes.
Key Principles
- Prefer eventual consistency + compensating actions over global ACID; reserve 2PC for the rare hard-consistency edge-cases.
- Idempotency everywhere: every externally accessible command must be safe to retry.
- Fail fast, recover asynchronously; never block business threads for distributed locks.
- Design for partial failure: timeouts, retries with back-off, circuit breakers.
- Every state change emits an immutable event that is durably persisted before fan-out (Outbox or Transactional Messaging).
- Observability is non-negotiable: correlate logs, metrics, and traces with the same transaction ID.
- Document the chosen transaction pattern (Saga choreography/orchestration, 2PC, OCC) in the README of each service.
Java (Language-Specific Rules)
- Use Spring Boot 3.x + Java 17 records for immutable value objects.
- Enable preview features ‑-enable-preview for structured concurrency when using Loom.
- Always annotate aggregate roots with @Transactional(propagation = MANDATORY) to ensure caller controls transaction scope.
- Use sealed interfaces for command & event hierarchies; store type discriminator in JSON.
- Prefer Optional over null; fail fast with IllegalStateException when Optional::get is required.
- Naming: commands use verb-noun (CreateOrderCommand), events use past tense (OrderCreatedEvent).
- Do not expose JpaEntity classes through API; convert to DTOs via MapStruct.
- All REST endpoints are POST except true idempotent GET/PUT/DELETE; POST requires Idempotency-Key header.
Error Handling and Validation
- Detect failures early:
• Database or broker outage ⇒ use Resilience4j time limiter (2 × P99 latency).
• Downstream service 5xx ⇒ exponential back-off (jittered, capped at 32 s, max 5 attempts).
- Compensation first: writeCompensation() → try()/catch() → publish CompensateXCommand to workflow queue.
- Persist idempotency keys with unique constraint; on duplicate key ⇒ return HTTP 409 with Location of existing resource.
- Never swallow exceptions in @KafkaListener; wrap and re-throw to trigger DLQ.
- Validate DTOs with jakarta.validation; reject at edge (controller); never reach service layer with invalid state.
Framework-Specific Rules (Spring Boot 3.x, Kafka, Temporal)
Spring / Outbox
- Outbox table schema: id UUID PK, aggregate_id, event_type, payload JSONB, created_at TIMESTAMP WITH TIME ZONE.
- Use @Transactional on command handler; insert domain state + outbox row in the same transaction.
- OutboxRelay (Spring Batch or Debezium) publishes to Kafka; ensure exactly-once semantics via transactional producer.
Kafka
- Use idempotent producer (enable.idempotence=true) + transactions; transactional.id = ${spring.application.name}-${instanceId}.
- Topic naming: <bounded-context>.<aggregate>.<event>.v1 (e.g., ordering.order.created.v1).
- Partitioning key: aggregate_id to guarantee event ordering per aggregate.
- Consumers run with max.poll.interval.ms = 5 min, max.poll.records = 50 for back-pressure.
Temporal (Saga Orchestration)
- Each business transaction ⇒ one Workflow class; each service call ⇒ separate Activity with retryPolicy { initialInterval 1 s, maxAttempts 5, backoffCoefficient 2.0 }.
- Compensation: define reverse Activity; attach to Workflow with Saga.compensate().
- Correlate Temporal WorkflowId with incoming Idempotency-Key.
AWS Step Functions (optional cloud orchestration)
- State machine definition stored in IaC (CloudFormation/SAM); version-control alongside code.
- Use heartbeat timeouts on Task states to detect frozen Lambdas early.
- Catch/Retry blocks map 1-to-1 with business compensation logic; always model a Fail state.
Additional Sections
Testing
- Integration tests use Testcontainers (Kafka, PostgreSQL, Temporal Server) and run inside GitHub Actions.
- Fault-injection: ChaosMonkey4Spring toggled during nightly build; ToxiProxy to simulate broker latency.
- Idempotency test: send the same POST 3× with same Idempotency-Key ⇒ expect exactly one row in DB and identical 200 response.
Performance
- Batch up to 500 outbox rows per relay flush; align batch size with broker message.max.bytes.
- Co-locate saga orchestrator with data locality (same AZ) to avoid cross-AZ hop latency.
- Use OCC (version column) on hot aggregates; rely on retry rather than pessimistic locks.
Security
- All service-to-service traffic via mTLS (Istio sidecar); certificates rotated every 24 h.
- Encrypt Idempotency-Key (SHA-256) before storing to avoid leaking client entropy.
- Principle of Least Privilege: producer ACL limited to its own topics.
Observability
- Use OpenTelemetry Java Agent; propagate traceparent header + Idempotency-Key into logs.
- Log in JSON; field names: timestamp, level, service, trace_id, span_id, idempotency_key, message.
- Alert on: saga step retries > 3, outbox relay lag > 30 s, DLQ > 0.
CI/CD
- Run ./gradlew spotlessApply, ./gradlew test, ./gradlew bootBuildImage.
- Canary deploy behind Kubernetes progressive delivery (Argo Rollouts) monitoring custom SLI: saga_failure_rate < 0.5 %.