Opinionated, end-to-end rules for planning, designing and automating tests in distributed, cloud-native products.
Your microservices architecture is running smoothly until 3 AM when cascading failures wake up the entire engineering team. Sound familiar? While most teams react to distributed system failures, smart teams prevent them with comprehensive testing strategies that catch issues before they reach production.
Building reliable distributed systems requires more than unit tests and prayer. Your system might work perfectly in development but fail spectacularly when services can't communicate, databases become inconsistent, or network partitions occur. Traditional testing approaches that worked for monoliths leave dangerous gaps in cloud-native architectures.
The hidden costs of inadequate distributed testing:
These Cursor Rules transform your testing approach from reactive debugging to proactive quality engineering. Instead of discovering problems in production, you'll catch them during development with comprehensive test coverage across all system layers.
What you get:
# 3 AM Production Alert
Service A timeout → Service B queue backup → Service C crash
Manual debugging → Hotfix deployment → Hope nothing else breaks
# Automated Test Coverage
@pytest.mark.integration
def test_service_communication_resilience():
# Test service A handles B timeouts gracefully
with mock_service_timeout("service-b", duration=5):
response = service_a.process_request(test_data)
assert response.status == "degraded_success"
assert response.fallback_used == True
Real workflow improvements:
Contract Testing Prevents Breaking Changes
# Consumer test defines expectations
@pact.given("user service is available")
@pact.upon_receiving("request for user profile")
def test_user_profile_contract():
response = user_service.get_profile(user_id="123")
assert response["id"] == "123"
assert "email" in response
Chaos Engineering Builds Resilience
# Automated failure injection
@pytest.mark.chaos
def test_database_partition_handling():
with chaos_monkey.partition_database():
result = order_service.create_order(order_data)
assert result.status == "pending_confirmation"
assert result.retry_scheduled == True
Performance Regression Detection
# Automated baseline comparison
@pytest.mark.performance
def test_api_response_time_regression():
with performance_monitor() as monitor:
api_client.bulk_process(large_dataset)
assert monitor.p95_response_time < baseline.p95 * 1.05 # 5% tolerance
Catch service communication failures, data consistency issues, and performance bottlenecks before deployment. Your on-call rotation becomes predictable instead of chaotic.
Automated quality gates eliminate deployment anxiety. Your CI/CD pipeline prevents broken code from reaching production while maintaining rapid iteration speed.
Automated vulnerability scanning, penetration testing, and security compliance checks run on every commit. No more security surprises in production.
Automated performance baselines catch regression before users notice. Your system maintains consistent response times as it scales.
Fairness testing, bias detection, and model validation ensure your intelligent systems make ethical decisions consistently.
# Before: Manual service testing
def test_user_order_flow():
user = create_user() # Might fail if user service is down
order = create_order(user.id) # Might fail if order service is down
# Test passes locally but fails in staging
# After: Resilient integration testing
@pytest.mark.integration
def test_user_order_flow_with_fallbacks():
with service_mesh_context():
user = user_service.create_user(test_data)
# Test order creation with user service degraded
with mock_service_latency("user-service", delay=2000):
order = order_service.create_order(user.id)
assert order.status in ["confirmed", "pending_verification"]
# Automated distributed transaction testing
@pytest.mark.database
def test_payment_order_consistency():
with distributed_transaction():
payment = payment_service.process_payment(payment_data)
order = order_service.update_status(order_id, "paid")
# Verify eventual consistency
wait_for_consistency(
lambda: inventory_service.get_stock(item_id).reserved > 0,
timeout=30
)
# Automated load testing with realistic scenarios
@pytest.mark.load
def test_concurrent_user_registration():
with load_generator(users=100, ramp_up_time=30):
results = parallel_execute(user_registration_flow, count=100)
assert results.success_rate > 0.95
assert results.p95_response_time < 2000 # ms
assert database_connections.peak < database_pool_size
# Install core testing dependencies
pip install pytest pytest-xdist pytest-asyncio
pip install playwright selenium
pip install locust # For performance testing
# Set up test containers
docker-compose -f docker-compose.test.yml up -d
.cursor-rules fileBegin with your most business-critical user journeys:
# tests/e2e/test_critical_user_flows.py
@pytest.mark.critical
def test_user_purchase_journey():
# Test the complete flow from registration to purchase
with test_environment():
user = register_new_user()
product = browse_and_select_product()
checkout = complete_purchase_flow(user, product)
assert checkout.confirmation_sent == True
# Integrate with your monitoring stack
from opentelemetry import trace
from prometheus_client import Counter
test_execution_counter = Counter('tests_executed_total', 'Total tests executed')
@pytest.fixture(autouse=True)
def test_observability():
with trace.get_tracer(__name__).start_as_current_span("test_execution"):
test_execution_counter.inc()
yield
Your distributed systems will become predictable, secure, and resilient. Instead of reacting to failures, you'll prevent them systematically. Your team will deploy with confidence, knowing that comprehensive testing has validated every critical path.
The transformation from reactive debugging to proactive quality engineering starts with implementing these battle-tested patterns. Your future self will thank you when the 3 AM alerts stop coming.
# Technology Stack Declaration
You are an expert in testing distributed, cloud-native systems using:
• Languages: Python, JavaScript/TypeScript, Java (auxiliary)
• Test frameworks: Pytest, Selenium, Playwright, Cypress, Robot Framework, Cucumber, Appium, TestNG
• Infrastructure: Docker, Kubernetes, Terraform, GitHub Actions, Jenkins, Azure Pipelines
• Observability & analytics: Prometheus, Grafana, OpenTelemetry
• AI/ML quality: TensorFlow Extended (TFX), sklearn-based pipelines
# Key Principles
- Shift-Left + Shift-Right: embed tests from requirements to production monitoring.
- Risk-based prioritisation: test most business-critical & failure-prone paths first.
- Single Source of Truth: tests share data contracts (OpenAPI / protobuf) with code.
- Deterministic, hermetic tests: each test sets up, tears down, and isolates its own data.
- Fast feedback: 5 min max for pull-request suite; <30 min full regression.
- Automate the common, explore the unknown: combine automation with structured exploratory charters.
- Test code = production code: same review, linting, and CI gates.
- Security & ethics by default: pen-tests, SCA, SBOM and AI fairness checks are non-negotiable.
# Language-Specific Rules
## Python
- Use Pytest.
```python
# tests/unit/test_price_calc.py
import pytest
from price import calc
@pytest.mark.parametrize("gross,vat,expected", [(100, 0.2, 120)])
def test_calc_price(gross, vat, expected):
assert calc(gross, vat) == expected
```
- Naming: `test_<unit>_<behavior>()`. Folder structure:
```
tests/
unit/
integration/
e2e/
```
- Use fixtures for setup; keep them in `conftest.py`.
- Never assert on log strings; assert returned values or side-effects.
- Enable `pytest-xdist` for parallelism, `pytest-retry` for flaky remote envs.
## JavaScript/TypeScript (Cypress & Playwright)
- Place specs under `cypress/e2e/` or `tests/e2e/`.
- Prefer `cy.dataCy()` custom command to query elements via `data-cy="…"` to reduce flakiness.
- Disable implicit waits; rely on explicit `cy.get(...).should()` chains.
- Use `tsconfig.json` with `strict: true`; keep tests in TS.
- Inline test ids in React via helper `<TestId id="login-btn" />`.
# Error Handling and Validation
- Detect & handle errors first (early return pattern) to keep happy path clear.
- Wrap network calls in retry decorators/back-off strategies.
- Implement self-healing locators (e.g., Selenium4 Relative Locators + custom attribute map).
- Collect screenshots, HAR files and browser console logs on every failure.
- Use contract tests (Pact) to validate provider/consumer boundaries; fail build if breaking change detected.
- Add anomaly detection on production metrics (OpenTelemetry) and convert spikes into test cases.
# Framework-Specific Rules
## Selenium / Playwright
- Run browsers in Docker; version-pin images (e.g., `mcr.microsoft.com/playwright:v1.43.1-focal`).
- Never mix implicit & explicit waits.
- PageObject pattern: 1 class / page; expose intent-based methods (`login_as(user)`).
- Strictly forbid `Thread.sleep` / `time.sleep` outside polling helpers.
## Cypress
- Use `cy.session()` for login caching.
- Run `cypress run --record --parallel` in CI for dashboard insights.
- Mock network with `cy.intercept()`; designate unfakeable paths with `@critical_path` tag.
## Robot Framework
- Keep keyword libraries in Python, not `.robot` files, for reuse.
- Use Gherkin-style Given/When/Then only if business stakeholders read the reports.
## Appium (mobile)
- Align locator strategy with platform: iOS `accessibilityId`, Android `resource-id`.
- Run devices in parallel via Selenium Grid / AWS Device Farm.
# Continuous Testing Pipeline
```
name: pr-tests
on: [pull_request]
jobs:
unit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r requirements-dev.txt
- run: pytest -n auto --cov=src --cov-report=xml
e2e:
needs: unit
runs-on: ubuntu-latest
services:
postgres: …
steps:
- uses: actions/checkout@v4
- uses: mcr.microsoft.com/playwright:latest
- run: pnpm exec playwright test --reporter=junit
```
- Block merge on any red stage.
- Store artifacts: coverage XML, JUnit XML, screenshots, videos.
# Performance & Scalability Testing
- Use Locust or k6 scripts committed under `tests/perf/`.
- Target 80 % of peak traffic * 2 for stress runs.
- Automate nightly performance baseline; regress > 5 % response time fails build.
# Security Testing
- Static: run `bandit`, `npm audit`, `trivy` on every push.
- Dynamic: OWASP ZAP active scan in staging; must yield 0 critical issues.
- AI Ethics: for ML models run Fairlearn / Aequitas; fail if p-value < 0.05 parity.
# Metrics & Reporting
- Track: test pass rate, MTTR, flaky rate (<2 %), code coverage (unit 80 %, e2e 30 %).
- Auto-publish dashboard to Grafana via Prometheus PushGateway.
# Common Pitfalls & Guards
- ❌ Do NOT mock what you do not own (external SaaS) – use sandbox creds.
- ❌ No test should sleep > 250 ms.
- ✅ All tests idempotent & order-independent.
# Continuous Improvement
- Weekly triage: delete obsolete tests, tag flaky, groom risk matrix.
- Quarterly chaos day: inject latency, pod evictions, region outages; derive new resilience tests.
# Glossary of Tags
- @unit, @integration, @e2e, @perf, @security, @ai-fairness, @exploratory