Comprehensive Rules for designing, implementing, and operating production-grade AI monitoring/observability pipelines with Python, Prometheus, Grafana, OpenTelemetry, and MLOps best practices.
Your AI models are running in production, making decisions that impact revenue, customer experience, and compliance. But here's the uncomfortable truth: most AI systems fail silently. By the time you notice accuracy degradation or bias creep, it's already cost you customers, money, or worse.
Every day your models run unmonitored, you're accumulating technical debt and business risk:
The problem isn't just monitoring—it's monitoring that actually helps you make decisions. Generic observability tools miss the nuances of AI systems: concept drift, fairness metrics, model-specific performance patterns, and the critical link between technical metrics and business outcomes.
These Cursor Rules transform your development workflow into a production-grade AI monitoring powerhouse. Instead of bolting on monitoring as an afterthought, you'll build observability into every model from day one—with automated drift detection, compliance tracking, and business impact correlation.
What you get:
# Before: Monitoring as an afterthought
def predict(data):
return model.predict(data)
# After: Observability-first development
@monitor_inference(model_name="churn_v2", track_drift=True)
def predict(data: PredictionRequest) -> PredictionResponse:
with tracer.start_span("feature_vectorize"):
features = vectorize(data)
with tracer.start_span("model_infer") as span:
prediction = model.predict(features)
span.set_attribute("confidence", prediction.confidence)
return prediction
Impact: Catch issues in development instead of discovering them in production.
# Link technical metrics to business outcomes automatically
@dataclass(frozen=True)
class BusinessMetric:
churn_prevention_lift: float # Revenue impact
customer_satisfaction_delta: float # User experience
compliance_score: float # Risk mitigation
# Every model change shows business impact immediately
register_business_kpi(
technical_metric="accuracy",
business_impact=lambda acc: calculate_revenue_impact(acc)
)
Result: Stakeholders see value, not just vanity metrics. Make data-driven decisions about model improvements.
# Bias detection runs automatically on every prediction
bias_monitor = BiasMonitor(
protected_attributes=["gender", "age", "ethnicity"],
fairness_metrics=[DemographicParity(), EqualOpportunity()],
alert_threshold=0.05
)
# Compliance reports generate themselves
@scheduler.scheduled_job("cron", hour=2) # 2 AM daily
def generate_compliance_report():
report = ComplianceReporter.generate_daily_report()
if report.violations:
alert_compliance_team(report)
Benefit: Pass audits without scrambling. Proactive compliance instead of reactive damage control.
# Automatic drift detection and retraining pipeline
@drift_detector.on_drift_detected(threshold=0.1, consecutive_hours=3)
async def handle_drift(drift_report: DriftReport):
# Create retraining ticket automatically
ticket = await create_github_issue(
title=f"Drift detected: {drift_report.model_name}",
labels=["drift", "auto-retrain"],
body=drift_report.detailed_analysis
)
# Start canary retraining if conditions met
if drift_report.severity > 0.2:
await trigger_canary_retrain(drift_report.model_name)
Outcome: Models self-heal before users notice problems. Reduce manual intervention by 80%.
Before: "Why is our churn model performing poorly?"
After: Single command investigation
# One command shows complete health picture
cursor-ai-monitor investigate --model churn_v2 --timerange 7d
# Output includes:
# - Drift analysis with root cause
# - Business impact quantification
# - Recommended remediation steps
# - Compliance status
Before: Deploy and hope
After: Production-ready from commit
# Single decorator enables complete observability
@production_monitor(
drift_detection=True,
bias_monitoring=True,
business_kpis=["revenue_impact", "csat_score"],
auto_alerts=True
)
class ChurnPredictor:
def predict(self, customer_data):
# Your prediction logic
pass
Before: Weeks of manual report generation
After: Audit-ready in minutes
# Continuous compliance tracking
compliance_report = ComplianceReporter.generate_audit_package(
models=["churn_v2", "recommendation_v1"],
timerange="90d",
standards=["NIST_AI_RMF", "EU_AI_Act", "IEEE_7003"]
)
# Generates: bias analysis, drift reports, decision logs, privacy impact assessments
# Copy the rules to your Cursor settings
mkdir -p ~/.cursor/rules
curl -o ~/.cursor/rules/ai-monitoring.json \
https://raw.githubusercontent.com/your-repo/cursor-rules/main/ai-monitoring.json
# Install required dependencies
pip install fastapi prometheus-client opentelemetry-api pydantic
# Add to any existing model service
from monitoring import ProductionMonitor
@ProductionMonitor.instrument(
model_name="your_model",
track_drift=True,
monitor_bias=True
)
def your_prediction_function(data):
# Your existing code stays the same
return model.predict(data)
# Dashboards create themselves based on your models
python -m monitoring.setup --auto-discover-models
# Creates: Grafana dashboards, Prometheus alerts, compliance reports
# Link technical performance to business outcomes
register_business_impact(
model="churn_prediction",
kpi_mapping={
"accuracy": lambda acc: (acc - 0.8) * 1000000, # Revenue per accuracy point
"fairness": lambda fair: calculate_compliance_risk(fair)
}
)
Week 1: Complete visibility into model performance
Week 2-4: Proactive issue resolution
Month 2+: Self-optimizing AI operations
Quantified Benefits:
The difference isn't just better monitoring—it's transforming AI operations from reactive crisis management to predictive, automated optimization. Your models become self-aware, self-healing systems that maintain performance while you focus on building the next breakthrough.
Stop debugging AI systems in production. Start building them to monitor themselves.
You are an expert in Python, FastAPI, Prometheus, Grafana, OpenTelemetry, Kubernetes, AWS CloudWatch, Fiddler AI.
Key Principles
- Treat monitoring as a first-class feature; instrument code while the model is being built, not after release.
- Align every technical metric (e.g., latency, concept-drift, bias) with a clear business KPI (e.g., churn, revenue impact, CSAT).
- Automate everything that can be automated (dashboards, alerts, tests) but keep a human escalation path for high-risk decisions.
- Prefer immutable, append-only logs and event streams for forensic analysis and compliance.
- Build "detect → diagnose → remediate" feedback loops; remediation can be automated retraining or rollback.
- Design for multi-tenant, multi-model environments; namespacing is mandatory.
- Security, privacy, and fairness are non-negotiable—treat them as monitored dimensions.
Python
- Always enable type hints and strict mypy checking (mypy --strict). Example:
```python
def log_inference(metric: str, value: float, labels: dict[str, str]) -> None:
...
```
- Use `dataclass(frozen=True)` or `pydantic.BaseModel` to define metric payloads; they guarantee validation.
- Follow the functional style; pure functions for feature extraction and metric calculation, side-effects isolated in adapters (e.g., Prometheus client).
- Adopt snake_case for functions/variables, PascalCase for types, UPPER_SNAKE for constants (e.g., DRIFT_THRESHOLD).
- All modules must expose a `__all__` list; undocumented symbols are private.
- Default logging level is INFO; never print()—use the stdlib `logging` with `extra={"model":"model_name"}`.
- File layout:
model_service/
├─ api.py # FastAPI endpoints + metrics route
├─ core.py # prediction logic
├─ monitoring/
│ ├─ metrics.py # Prometheus collectors
│ ├─ drift.py # concept drift detection
│ ├─ bias.py # bias & fairness checks
│ └─ alerts.py # alert routing
├─ tests/ # pytest suites
└─ Dockerfile
Error Handling and Validation
- Validate inbound inference payloads with Pydantic schemas; reject 4xx on schema violations.
- Guard rails first: if inputs out of domain (z-score > 3), short-circuit with "fallback policy".
- Wrap external calls (DB, feature store) in `tenacity` retries with exponential backoff; never block the prediction thread longer than 500 ms.
- Use early returns for error branches; happy path last.
- Raise domain-specific exceptions (`DriftDetectedError`, `BiasAlertError`) that a global FastAPI handler converts into Prometheus counters and human-readable logs.
Prometheus / Grafana (Framework-Specific Rules)
- Expose `/metrics` in FastAPI using `prometheus_fastapi_instrumentator`; register custom collectors for:
• model_inference_latency_seconds (histogram)
• model_accuracy (gauge) – update asynchronously via batch job
• model_drift_score (gauge) – Kolmogorov–Smirnov distance
• model_bias_metric{group="gender"} (gauge)
- Use the Prometheus naming convention: `snake_case`, unit suffix, and verb first (e.g., *_seconds).
- Add `deployment`, `model`, `version`, and `environment` labels to every metric.
- Grafana dashboards must include: SLO burn-down, anomaly waterfall, resource overlay (CPU/RAM vs latency), and business KPI correlation panels.
- Alertmanager rules: fire only after 3 evaluations in 2 m; route P1 (user impact) to PagerDuty, P2 to Slack.
OpenTelemetry Tracing
- Instrument prediction pipeline with spans: `receive_request → feature_vectorize → model_infer → post_process`.
- Propagate `X-Request-ID`; failing to propagate is a build breaker.
- Export traces to OTLP endpoint, then to Grafana Tempo or AWS X-Ray.
Kubernetes & CloudWatch
- Use `VerticalPodAutoscaler` driven by CPU > 70 % for 5 m OR p95 latency > 250 ms.
- Emit structured CloudWatch logs (JSON) with `trace_id` and `metric_set` fields.
- Store logs at least 90 days; index only the last 14 days to control cost.
Testing
- Write pytest unit tests for every metric collector (≥ 95 % coverage).
- Use `pytest-benchmark` to set an upper-bound regression guard (e.g., p95 latency +10 % threshold).
- Employ `great_expectations` or `deepchecks` in CI to validate training vs inference data drift.
- Chaos test alerting: simulate 502s, ensure Alertmanager fires within 60 s.
Performance & Resource Patterns
- Batch metrics pushes; use `pushgateway` only for short-lived jobs, never for online services.
- Async I/O (`async def`) for heavy network I/O; CPU-bound tasks offloaded to `ProcessPoolExecutor`.
- Track and cap memory with `resource` module; abort if RSS exceeds 80 % of limit.
Security & Compliance
- Encrypt all traffic with TLS 1.3; refuse plaintext Prometheus scrapes.
- Sanitize PII before logging; hashing with salt > 32 bytes.
- Map every metric to a DPIA (data-protection impact assessment) entry.
- Conform to NIST AI RMF, ISO 42001, EU AI Act, IEEE 7003-2024.
Naming Conventions
- Metrics: `<resource>_<action>_<unit>` (e.g., prediction_error_rate_ratio).
- Log files: `model-<model_name>-<yyyy-mm-dd>.log`.
- Dashboard titles: `📊 <Domain>: <Model> - <Environment>`.
Anomaly Detection & Retraining
- Schedule nightly Spark/SQL job to compute population stats; store in `metrics_baseline` table.
- If `model_drift_score` > 0.1 (KS) for 3 consecutive hours, automatically create a GitHub issue tagged `drift`.
- Implement canary retraining pipeline; new model promoted only if A/B test lifts KPI ≥ 0.5 % and no metric regression.
CI/CD
- Fail build if mypy, pylint (< 9.5 score), unit tests, or security scan (bandit) fail.
- Push Docker images with semantic tags (`<major>.<minor>.<patch>-<gitsha>`).
- Use ArgoCD with automated rollback on alert `deployment_rollback_required`.
Documentation
- Every metric collector must have a docstring with: “Description”, “Labels”, “Unit”, “Source”.
- Maintain a `metrics_catalog.md` artifact published to Confluence.
Common Pitfalls & Guard-rails
- ❌ Don’t aggregate micro-batch predictions; instrument at the individual call level.
- ❌ Don’t hard-code thresholds; store them in a versioned config map.
- ❌ Don’t ignore p99 latency; users feel the tail!
- ✅ Do version both code and model artefacts; include `model.sha256` in Prometheus labels.