Comprehensive rules for implementing best-in-class observability—covering instrumentation, data collection, alerting, dashboards, and continuous improvement—using OpenTelemetry and modern AIOps tooling.
Traditional monitoring tells you something broke. Modern observability tells you why it broke, where it broke, and how to fix it — before your users notice.
You're shipping code faster than ever, but production incidents still consume 30% of your engineering time. Your monitoring dashboard shows green lights while users experience 500ms delays. You're drowning in alerts that fire after the damage is done, and your postmortems always end with "we need better visibility."
The real problem? Your systems are opaque boxes. You can see the inputs and outputs, but the critical path between them is invisible. When issues arise, you're debugging with print statements and educated guesses instead of surgical precision.
These Cursor Rules transform your applications into self-diagnosing systems using OpenTelemetry and modern AIOps tooling. Instead of reactive firefighting, you get proactive system intelligence that correlates logs, metrics, and traces to surface root causes in under 5 minutes.
What you're building:
Stop spending hours correlating logs across systems. Distributed tracing automatically connects your Node.js API call to the Python ML service to the Go data pipeline — with full context propagation and error correlation.
// Before: Debugging across service boundaries
console.log('User order failed, checking logs...')
// 2 hours later: Still don't know if it's auth, payment, or inventory
// After: Complete request flow visibility
span.setAttributes({
'order.id': orderId,
'user.tier': 'premium',
'payment.provider': 'stripe'
})
Connect every line of code to business outcomes. See how your checkout optimization affects conversion rates, or how database query performance impacts customer satisfaction scores.
Datadog Watchdog and Dynatrace Davis automatically detect when your 95th percentile latency spikes 15% above normal — before it hits your SLO threshold. No manual rule configuration required.
Ship every deployment with complete telemetry. Your CI/CD pipeline fails if critical endpoints lack instrumentation, ensuring new features are observable from day one.
Before: "Our checkout API is slow sometimes"
After: Complete request flow in 30 seconds
# Every request automatically tracked
with tracer.start_as_current_span("process_checkout") as span:
span.set_attributes({
"checkout.amount": order.total,
"checkout.payment_method": payment_method,
"user.tier": user.subscription_tier
})
# Downstream calls automatically traced
payment_result = payment_service.charge(order.total)
# Grafana dashboard shows:
# - 99% of requests complete in <200ms
# - 1% timeout at payment service after 10s
# - Only affects orders >$500 (rate limiting)
Before: 3 AM page "Service is down"
After: Automated root cause analysis
// Instrumented database connections
db, err := sql.Open("postgres", dsn)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "database connection failed")
return err
}
// AI detection: "Database connection pool exhausted
// correlates with 10x increase in background job queue
// processing. Likely cause: scheduled ETL job not
// properly paginating large dataset."
# JavaScript/TypeScript
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
# Python
pip install opentelemetry-distro opentelemetry-instrumentation-flask
# Go
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp
// observability/otel/init.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
const sdk = new NodeSDK({
resource: new Resource({
'service.name': process.env.SERVICE_NAME,
'service.version': process.env.GIT_COMMIT,
'deployment.environment': process.env.NODE_ENV
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT
})
});
sdk.start();
# Enrich spans with business meaning
@tracer.start_as_current_span("user_registration")
def register_user(email, plan_tier):
span = trace.get_current_span()
span.set_attributes({
"user.email_domain": email.split('@')[1],
"user.plan_tier": plan_tier,
"registration.source": "web_form"
})
# Automatic correlation with downstream services
result = email_service.send_welcome(email)
analytics_service.track_conversion(email, plan_tier)
# alerts/slo-alerts.yaml
- alert: CheckoutLatencyBreach
expr: histogram_quantile(0.95, checkout_duration_seconds) > 0.5
for: 2m
labels:
severity: P1
team: payments
annotations:
summary: "Checkout 95th percentile latency exceeding 500ms SLO"
runbook: "https://wiki.company.com/checkout-latency-runbook"
# .github/workflows/deploy.yml
- name: Observability Check
run: |
# Fail if critical endpoints lack instrumentation
npm run test:observability
# Validate alert rules syntax
promtool check rules alerts/*.yml
# Ensure dashboard JSON is valid
grafana-cli dashboard validate dashboards/*.json
# observability/chaos/integration.py
def chaos_monkey_with_observability():
"""Run chaos experiments with observability gates"""
with tracer.start_as_current_span("chaos_experiment") as span:
span.set_attributes({
"chaos.experiment": "network_partition",
"chaos.duration": "30s",
"chaos.target": "payment_service"
})
# Run experiment
result = chaos_toolkit.run_experiment("network_partition.yaml")
# Observability gate: rollback if error rate > threshold
if get_error_rate() > 0.01: # 1% error rate
span.set_status(codes.Error, "Chaos experiment failed SLO")
trigger_rollback()
// Instrument business logic, not just technical metrics
func ProcessOrder(ctx context.Context, order Order) error {
span := trace.SpanFromContext(ctx)
span.SetAttributes(
attribute.String("order.type", order.Type),
attribute.Float64("order.value", order.Total),
attribute.String("customer.tier", order.Customer.Tier),
)
// Custom business metric
orderValueHistogram.Record(ctx, order.Total,
metric.WithAttributes(
attribute.String("customer.tier", order.Customer.Tier),
attribute.String("order.type", order.Type),
))
return processPayment(ctx, order)
}
Stop debugging in the dark. Transform your systems into self-diagnosing, business-aligned observability platforms that prevent incidents instead of just reporting them. These rules give you the production intelligence that turns every deployment into a competitive advantage.
Your future self will thank you when the next production incident resolves itself before you even know it happened.
You are an expert in full-stack observability, OpenTelemetry, and AIOps across JavaScript/TypeScript, Python, and Go.
Key Principles
- Align all observability work with clear business objectives and SLOs; avoid data for data’s sake.
- Instrument **everything that matters**—code, infrastructure, user journeys—using open standards (OpenTelemetry).
- Correlate logs, metrics, and traces in a single backend to enable root-cause analysis in <5 min.
- Automate collection, enrichment, and alert routing to reduce MTTR and alert fatigue.
- Design dashboards for the consumer: exec KPIs ≠ SRE runbooks.
- Build observability into CI/CD so every deploy ships with telemetry and guard-rail tests.
- Treat observability code as production code: code review, tests, versioning, and docs.
JavaScript / TypeScript
- Use `@opentelemetry/sdk-node` with auto-instrumentations; fall back to manual spans for business logic.
- Initialise the OpenTelemetry SDK in a **single entry file** (`otel.ts`) loaded before any framework code.
- Prefer semantic attribute names from the OpenTelemetry spec; never invent your own if a standard key exists.
- Wrap async functions with `context.with(span)` to keep trace context across `await`s.
- Log with a structured JSON logger (e.g., Pino) and inject `trace_id` & `span_id` via log appenders.
Python
- Install `opentelemetry-distro` and `opentelemetry-instrumentation-*` packages in requirements.txt.
- Configure a `Resource` with `service.name`, `service.version`, `deployment.environment` at startup.
- Use `with tracer.start_as_current_span("operation"):` blocks instead of nested try/except for clarity.
- For Django/Flask, add middleware **before** any third-party middleware to guarantee full coverage.
- Emit logs via `structlog`; configure processor to enrich every log with trace context from `opentelemetry.trace.get_current_span()`.
Go
- Use Go 1.22+ with `otelhttp`, `otelgrpc`, `otelsql` wrappers.
- Store the tracer instance in package-level var; keep functions pure by passing `context.Context` explicitly.
- Export metrics using `otelmetric` and Prometheus exporter; scrape interval ≤30 s for SLO windows.
- Handle errors with `span.RecordError(err)` and `span.SetStatus(codes.Error, err.Error())` before returning.
Error Handling and Validation
- Detect and surface **known-unknowns** early: validate exporter endpoints at startup and fail fast.
- Place error/edge-case detection at top of handlers; early return on invalid payloads, then start spans for the happy path.
- Standard alert priority rubric:
• P1 – SLO breach or data loss ➜ page human.
• P2 – Degradation trending to breach ➜ paging hours.
• P3 – Noise or informational ➜ ticket only.
- Enforce alert rules in code (`alerts.yaml`) version-controlled with the service.
OpenTelemetry Framework Rules
- One exporter per signal type; never mix multiple backends in prod (split-brain traces).
- Use OTLP/gRPC protocol by default; fall back to HTTP only for legacy proxies.
- Batch span processor, max batch size 512, schedule delay 5 000 ms to balance flush latency vs. overhead.
- Always add the following mandatory attributes to spans:
• `service.name`, `service.version`, `deployment.environment`
• `http.method`, `http.status_code` (for HTTP)
• `db.system`, `db.statement` (for DB)
- Propagation: W3C Trace Context + Baggage; reject proprietary headers unless translation layer added in gateway.
CI/CD Integration
- Pipeline stage `observability-check` fails build if:
• coverage <90 % spans on critical endpoints (use trace-based tests), or
• lint detects disallowed logger names.
- Auto-inject environment variables (`OTEL_EXPORTER_OTLP_ENDPOINT`) via secrets store; never commit creds.
Testing
- Unit tests assert span attributes using in-memory span exporter.
- Employ chaos testing (e.g., Gremlin) with observability gates: deployment rolls back if new error ratio > N.
Performance
- Instrumentation overhead target: <2 % CPU, <5 % latency; load test every release.
- Use AI-driven analytics (Datadog Watchdog, Dynatrace Davis) to auto-detect anomalous latency spikes.
Security
- Strip PII before export with span processor filter (`attributeservice.RejectKeys(["user.password"])`).
- Encrypt all telemetry in transit (TLS1.2+) and at rest according to company policy.
Dashboards & Visualization
- Use Grafana folders per domain team; autosync dashboards from git.
- Widget guidelines:
• Red—actual, Green—SLO.
• Max 6 panels per row, no scrolling.
Common Pitfalls
- ❌ Duplicated service names cause trace joins to fail ➜ ensure unique `service.name`.
- ❌ Logging massive payloads ➜ truncate to 4 KB or use sampling.
- ❌ Multiple alert channels cause duplicate pages ➜ centralise routing via PagerDuty.
Directory Conventions
- observability/
• otel/ # SDK init, exporters
• alerts/ # alert rules, SLOs
• dashboards/ # Grafana JSON
• tests/ # trace & log tests
Sample File: observability/otel/init.ts
```
import { Resource } from "@opentelemetry/resources";
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
const sdk = new NodeSDK({
resource: new Resource({
'service.name': 'checkout-service',
'service.version': process.env.npm_package_version,
'deployment.environment': process.env.NODE_ENV
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
headers: { 'api-key': process.env.OTEL_EXPORTER_OTLP_HEADERS }
})
});
sdk.start();
```
Continuous Improvement
- Quarterly audit: delete unused dashboards, convert noisy alerts to metrics.
- Capture lessons-learned in post-mortems and update this ruleset accordingly.