Comprehensive Rules for implementing robust, secure, and observable logging/monitoring pipelines in production workloads.
Your application crashes at 3 AM. Your logs are scattered across twelve different files, half unstructured, with missing context. By the time you piece together what happened, the issue has cost your business thousands of dollars and your users' trust.
Most development teams treat logging as an afterthought—dropping print() statements during debugging and calling it observability. But when production breaks, you discover the harsh truth:
You're not just fighting bugs—you're fighting your own logging infrastructure.
These Cursor Rules transform your Python applications into observable, debuggable systems that tell you exactly what's happening, when, and why. Built on OpenTelemetry standards and production-hardened patterns, they eliminate the guesswork from incident response.
What you get:
Before: 15 minutes jumping between tools to understand one error
# Scattered, useless logs
print(f"Error in payment processing")
logger.error("Database connection failed")
After: Complete incident context in one place
log.error(
"payment_processing_failed",
user_id=user.id,
payment_method_id=payment.id,
error_code="DB_CONNECTION_TIMEOUT",
retry_count=3,
trace_id=trace.get_current_span().get_span_context().trace_id
)
Before: 200+ daily alerts, 95% noise After: Severity-mapped alerts that fire only on actionable conditions
Scenario: API latency spike at 2:47 PM
log.warning(
"authentication_failure",
threat=True,
user_id=hash_pii(user_id),
source_ip=request.remote_addr,
failure_reason="invalid_token",
attempts_last_hour=get_failed_attempts(user_id)
)
Automatic routing to SIEM, PII protection, and correlation across services.
Your payment completion rate dropped 5% overnight. With these rules:
# Every payment attempt gets correlated logging
@trace_payment_flow
def process_payment(payment_request):
log.info(
"payment_started",
payment_id=payment_request.id,
amount_cents=payment_request.amount,
payment_method=payment_request.method
)
try:
result = payment_gateway.charge(payment_request)
log.info(
"payment_completed",
payment_id=payment_request.id,
gateway_response_time_ms=result.duration,
transaction_id=result.transaction_id
)
except PaymentGatewayTimeout as e:
log.error(
"payment_gateway_timeout",
payment_id=payment_request.id,
gateway="stripe",
timeout_duration_ms=e.duration,
retry_scheduled=True
)
Debug workflow: Query Loki for payment_gateway_timeout → see all failures clustered around 2:15 AM → correlation with gateway status page → root cause identified in 2 minutes.
Suspicious login activity detected:
@monitor_authentication
def authenticate_user(credentials):
try:
user = validate_credentials(credentials)
log.info(
"user_authenticated",
user_id=user.id,
login_method=credentials.method,
source_ip=get_client_ip()
)
except InvalidCredentials:
log.warning(
"authentication_failed",
threat=True, # Auto-routes to security team
attempted_username=hash_pii(credentials.username),
source_ip=get_client_ip(),
user_agent=request.headers.get('User-Agent'),
failure_count=get_recent_failures(credentials.username)
)
Security workflow: Threat-tagged logs auto-route to security Slack → query shows 50 failed attempts from same IP → automatic IP blocking triggered → incident contained in < 5 minutes.
@trace_database_calls
def get_user_orders(user_id):
with log.bind(user_id=user_id):
start_time = time.time()
orders = db.query("""
SELECT * FROM orders
WHERE user_id = %s
ORDER BY created_at DESC
""", user_id)
query_duration = (time.time() - start_time) * 1000
log.info(
"user_orders_retrieved",
order_count=len(orders),
query_duration_ms=query_duration,
slow_query=query_duration > 100
)
Performance workflow: Grafana dashboard shows P95 latency increase → drill down to slow_query=true logs → identify specific queries → optimize indexes → deploy fix within one sprint.
pip install structlog opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp
Create logging_config.py:
import structlog
import logging
from structlog.processors import TimeStamper, JSONRenderer
def configure_logging():
structlog.configure(
processors=[
TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
JSONRenderer(),
],
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
)
return structlog.get_logger()
log = configure_logging()
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Initialize before any imports that might log
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Export to your observability backend
otlp_exporter = OTLPSpanExporter(endpoint="https://your-otlp-endpoint")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
Copy the provided rules into your Cursor configuration. The rules will automatically:
Set up Fluentd/Vector configuration:
# Accept only JSON logs
<source>
@type tail
path /var/log/app/*.log
pos_file /var/log/fluentd/app.log.pos
tag app
format json
</source>
# Enrich with Kubernetes metadata
<filter app>
@type kubernetes_metadata
</filter>
# Route security logs to dedicated index
<match app>
@type route
<route threat>
match threat true
<match>
@type elasticsearch
index security-logs-${time.strftime('%Y.%m')}
</match>
</route>
<route **>
<match>
@type elasticsearch
index app-logs-${time.strftime('%Y.%m')}
</match>
</route>
</match>
Before implementation:
After implementation:
Your team stops being reactive firefighters and becomes proactive system architects. Instead of dreading production issues, you have complete visibility into system behavior with the confidence to ship faster.
The bottom line: These rules don't just improve your logging—they transform how your team builds, ships, and maintains production software. You'll wonder how you ever debugged anything without them.
Ready to eliminate your next 3 AM debugging session? Implement these rules and experience production observability that actually works.
You are an expert in production-grade logging, monitoring, and observability across cloud-native systems. Stack: Python, OpenTelemetry, Fluentd/Vector, Grafana Loki, Elasticsearch/OpenSearch, Prometheus, Datadog, Splunk, AWS/GCP logging services.
Key Principles
- Prefer structured, machine-readable logs (JSON). No free-form strings.
- A log line MUST contain: iso-8601 timestamp, service_name, environment, trace_id, span_id, severity, message.
- Centralize storage; never leave logs on ephemeral nodes.
- Logs are append-only; treat them as immutable evidence.
- Least-privilege: expose logs via RBAC/ABAC only.
- Alert only on actionable, SLO-related conditions to curb noise.
- Aggregate metrics, logs, and traces using OpenTelemetry for single source of truth.
- Always test logging & alerting pathways in staging before prod roll-out.
Python
- Use the standard "logging" module wrapped by structlog.
```python
import structlog, logging
from structlog.processors import TimeStamper, JSONRenderer
structlog.configure(
processors=[
TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
JSONRenderer(),
],
logger_factory=structlog.stdlib.LoggerFactory(),
)
log = structlog.get_logger()
log.info("user_login", user_id=user.id, ip=ip)
```
- NEVER use `print()` for operational logs.
- Use `logging.setLogRecordFactory()` to inject trace_id/span_id from contextvars.
- Map Python levels to RFC5424 severities: DEBUG=7, INFO=6, WARNING=4, ERROR=3, CRITICAL=2.
- Use `warnings.filterwarnings("error")` in CI so forgotten prints/warnings turn into failures.
Error Handling and Validation
- Guard every external call with try/except and log structured context + stacktrace on ERROR.
- Place validation at function start; emit WARN with `validation_error=true` for bad inputs, raise afterwards.
- For security failures (authz, tampering) log at WARN with `threat=true` and push to SIEM channel.
- Fatal, process-terminating exceptions ⇒ log at CRITICAL, then `sys.exit(1)`.
- Sample repetitive errors using token-bucket (e.g., 5 msgs/sec) to avoid DDOSing your sink.
OpenTelemetry (Framework-Specific Rules)
- Always start `TracerProvider()` early (before first import that might log).
- Export traces AND logs: set `OTEL_LOGS_EXPORTER=otlp`.
- Use resource attributes: `service.name`, `service.version`, `deployment.environment`.
- Propagate context inside Celery/RQ tasks via `opentelemetry-instrumentation-*`.
- Combine Prometheus + Grafana for metrics, Loki for logs; link via `trace_id` label for click-through.
Fluentd / Vector
- Accept only JSON; drop non-conforming lines.
- Enrich with Kubernetes metadata (`pod`, `namespace`, `container`).
- Route security logs to dedicated index with 90-day retention; app logs 30-day.
- Compress and chunk outputs (Gzip, max 2 MiB) before shipping to S3/GCS.
Additional Sections
Testing
- Integration tests MUST assert that logs are produced:
```python
caplog.set_level("INFO")
func_under_test()
assert any(r.event=="order_created" for r in caplog.records)
```
- Periodically replay recorded incidents (chaos tests) and verify alerts fire in <5 min.
Performance
- Enable async logging handler (`ConcurrentRotatingFileHandler` or `QueueHandler`).
- For >1 k log lines/sec, enable sampling: DEBUG 1%, INFO 10%.
- Use non-blocking exporters (OTLP over gRPC with `batch_span_processor`).
Security & Compliance
- NEVER log PII or secrets. Mask with SHA-256 or last-4 pattern.
- Redact JWT, OAuth tokens via middleware before the logger.
- Store prod logs in write-only bucket; deletion keys isolated.
- Enable at-rest encryption (AES-256) & TLS 1.2+ in transit.
- Keep audit logs for ≥ 1 year (GDPR/PCI-DSS). Use separate immutable index.
Alerting
- Use severity mapping:
- CRITICAL: immediate pager.
- ERROR: ticket within 30 min.
- WARN: daily digest.
- Configure alert routing via labels: `team=payments` → Slack #payments-alerts.
Dashboards
- Create Golden Signals dashboard: latency, traffic, errors, saturation.
- Include log-derived metrics: `error_rate`, `authn_failures`, `payments_timeouts`.
Common Pitfalls
- Forgetting to propagate trace context across threads/processes → use `contextvars`.
- Logging inside tight loops without sampling.
- Disabling TLS on internal log shipping (MITM risk).
- Hard-coding log levels; instead use env var `LOG_LEVEL`.
File/Directory Conventions
- `/etc/app/logging.yaml` – unified logging config loaded at start.
- `/var/log/app/YYYY/MM/DD/*.log.gz` – rotated, compressed local buffer.
- `/observability/otel-config.yml` – OpenTelemetry collector piping flows.
Reference Tooling
- Datadog: enable `DD_LOGS_INJECTION=true`.
- Splunk HEC: batch size 1000, back-off 2^n on 5xx.
- AWS CloudWatch: use `aws logs put-subscription-filter` to push to Lambda-based SIEM.