Actionable coding and MLOps rules for building, monitoring, and iterating data-science feedback loops in production.
Your models degrade the moment they hit production. While you're debugging last week's accuracy drop, your competitors are building self-improving systems that get better with every prediction. The difference? Production-grade feedback loops that treat continuous learning as a core engineering discipline.
Traditional ML workflows follow a fatal pattern: train → deploy → pray. You push a model to production, watch accuracy slowly degrade, then scramble to retrain when performance finally tanks. Meanwhile:
The core problem? You're treating models like static artifacts instead of living systems that learn from every interaction.
These Cursor Rules transform your ML systems into self-improving engines that:
Catch drift before it kills performance with automated statistical monitoring that alerts when your model's assumptions break down. No more discovering accuracy drops weeks later through manual dashboard checks.
Turn every prediction into training data by capturing feedback signals (clicks, purchases, ratings) and automatically incorporating them into model updates. Your system gets smarter with each user interaction.
Scale human oversight intelligently using active learning to identify the 5% of predictions that need human review, while automating the 95% your model handles confidently.
Maintain audit trails that compliance teams actually want to see, with immutable logging of every model decision, feature drift, and retraining event.
Instead of writing custom dashboards and alert scripts, you get:
# Automatic drift detection with statistical tests
def detect_feature_drift(reference_data: pd.DataFrame, current_data: pd.DataFrame) -> bool:
"""KS test for covariate shift with Prometheus alerting."""
for feature in reference_data.columns:
ks_stat, p_value = kstest(reference_data[feature], current_data[feature])
if p_value < 0.01:
prometheus_client.Counter('feature_drift_detected').inc()
return True
return False
Your models update themselves when performance degrades:
# Kubeflow pipeline that runs automatically
- name: evaluate-and-promote
if: "{{ eval_metrics.accuracy >= prod_baseline + 0.02 }}"
run: mlflow models transition --name fraud_detector --stage Production
Active learning identifies exactly which samples need human review:
# Focus human effort on uncertain predictions
uncertain_samples = model.predict_proba(unlabeled_data)
entropy_scores = -np.sum(uncertain_samples * np.log(uncertain_samples), axis=1)
review_queue = unlabeled_data[entropy_scores > threshold]
Before: Your recommendation model loses effectiveness as seasonal trends shift. You discover the problem when conversion rates drop 15% over two weeks.
After: The feedback loop captures every click and purchase, detects preference shifts within hours, and automatically retrains on fresh interaction data. Your model adapts to holiday shopping patterns in real-time.
# Contextual bandit learning from immediate feedback
@app.route('/recommend', methods=['POST'])
def recommend():
context = extract_user_context(request.json)
action = bandit_model.predict(context)
# Log for feedback collection
log_prediction(user_id=context['user_id'],
action=action,
context=context,
timestamp=datetime.utcnow())
return jsonify({'recommendations': action})
@app.route('/feedback', methods=['POST'])
def collect_feedback():
# User clicked/purchased - this is our reward signal
bandit_model.partial_fit(context=request.json['context'],
action=request.json['action'],
reward=request.json['reward'])
Before: New fraud patterns emerge faster than your quarterly model updates. False positive rates spike as legitimate transactions get flagged by outdated patterns.
After: Every transaction becomes a learning opportunity. Suspicious patterns trigger active learning workflows that flag edge cases for fraud analyst review, creating targeted training data.
# Uncertainty sampling for fraud edge cases
def flag_for_review(transaction_features: np.ndarray) -> bool:
prediction_proba = fraud_model.predict_proba(transaction_features)[0]
uncertainty = entropy(prediction_proba)
if uncertainty > REVIEW_THRESHOLD:
send_to_analyst_queue(transaction_features)
return True
return False
Before: Your content classifier struggles with evolving language patterns, slang, and context shifts. Manual review queues overwhelm your moderation team.
After: The system learns from moderator decisions, automatically adapting to new language patterns while surfacing only truly ambiguous content for human review.
# requirements.txt additions
prometheus-client==0.17.1
mlflow==2.7.1
pydantic==2.4.2
tenacity==8.2.3
# Expose metrics endpoint
from prometheus_client import Counter, Histogram, generate_latest
prediction_counter = Counter('ml_predictions_total', 'Total predictions made')
accuracy_gauge = Gauge('ml_model_accuracy', 'Current model accuracy')
@app.route('/metrics')
def metrics():
return generate_latest()
from pydantic import BaseModel
from typing import Optional
class FeedbackPayload(BaseModel):
prediction_id: str
true_label: Optional[str] = None
user_rating: Optional[float] = None
implicit_feedback: Optional[dict] = None
timestamp: datetime
@app.route('/feedback', methods=['POST'])
def collect_feedback():
try:
feedback = FeedbackPayload(**request.json)
store_feedback(feedback) # Your storage layer
prediction_counter.inc()
except ValidationError as e:
return jsonify({'error': str(e)}), 400
# kubeflow-pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: feedback-retrain-pipeline
spec:
templates:
- name: drift-detection
script:
command: [python]
source: |
import pandas as pd
from scipy.stats import ks_2samp
# Load reference and current data
if detect_drift(reference_data, current_data):
print("Drift detected - triggering retrain")
else:
print("No drift - skipping retrain")
# Pre-deployment validation
def validate_model_ready(model_metrics: dict) -> bool:
checks = [
model_metrics['accuracy'] >= PROD_BASELINE + 0.02,
model_metrics['bias_score'] <= MAX_BIAS_THRESHOLD,
model_metrics['latency_p95'] <= 200, # ms
]
return all(checks)
if validate_model_ready(eval_results):
mlflow.transition_model_version_stage(
name="fraud_detector",
version=model_version,
stage="Production"
)
You'll have real-time dashboards showing model performance, drift detection, and feedback quality. No more blind spots in production.
Your models start retraining automatically when performance degrades. Manual intervention drops by 70%.
Your systems adapt to changing patterns faster than competitors. Model accuracy improves by 15-25% through continuous learning.
You're deploying new models confidently with automated quality gates. Your ML team focuses on innovation instead of maintenance firefighting.
The difference between teams that struggle with model maintenance and those that build adaptive systems isn't luck—it's treating feedback loops as first-class engineering infrastructure. Stop playing catch-up with model decay. Start building systems that improve themselves.
You are an expert in the Python Data-Science feedback-loop stack (Python 3.11+, NumPy/SciPy, scikit-learn, TensorFlow/PyTorch, MLflow, Kubeflow, Dataiku, Prometheus, Grafana, ELK).
Key Principles
- Treat the feedback loop as a first-class product feature; design, test, deploy, and monitor it like any other micro-service.
- Align every metric with a concrete business KPI; discard vanity metrics.
- Automate everything that is deterministic (data ingestion, validation, retraining, deployment); keep people in the loop for ambiguous or high-risk decisions.
- Log everything (predictions, features, feedback signals, metadata, model version) with immutable, time-series semantics.
- Always build guardrails against bias amplification and data drift before enabling automated retraining.
- Prefer declarative configuration (YAML, JSON) over ad-hoc scripts for pipelines.
Python
- Use PEP-8 + Black formatting; enforce with pre-commit.
- Enable type hints (`from __future__ import annotations`); validate with mypy.
- Separate pure functions (feature engineering, metrics) from I/O layers (data connectors, model registry API).
- Never hard-code thresholds; make them environment variables with sane defaults.
- Use `pydantic` or `attrs` for strict data-shape validation on incoming feedback payloads.
- Example pure metric function:
```python
from typing import Iterable
def mean_absolute_percentage_error(y_true: Iterable[float], y_pred: Iterable[float]) -> float:
"""Return MAPE as a percentage with safe division-by-zero handling."""
y_true, y_pred = np.array(y_true), np.array(y_pred)
denom = np.where(y_true == 0, 1e-8, y_true)
return float((np.abs((y_true - y_pred) / denom)).mean() * 100)
```
Error Handling and Validation
- Validate incoming feedback at the API boundary; reject or quarantine malformed samples.
- Detect covariate shift with statistical tests (e.g., KS, PSI) on a daily schedule; raise Prometheus alerts on p-value < 0.01.
- Fail fast in pipeline steps; wrap each stage with retry & circuit-breaker logic (e.g., `tenacity`, `pybreaker`).
- Use early returns to handle known edge conditions (nulls, out-of-range values) before heavy computation.
- Log exceptions with stack trace + model + data snapshot; store in ELK.
MLOps Orchestration (MLflow / Kubeflow)
- Register every trained model in MLflow with: `flavor`, `git_commit`, `data_version`, `run_id`, and a `stage` tag (dev|staging|prod).
- Store evaluation artifacts (confusion matrix PNG, drift JSON) as run artifacts for traceability.
- Use Kubeflow Pipelines YAML to declare steps: `ingest → validate → train → evaluate → register → deploy`.
- Enable automated triggers: if `eval.metric >= prod_baseline + delta`, auto-promote to `Staging`; else require human review.
Monitoring & Observability (Prometheus, Grafana, ELK)
- Expose `/metrics` endpoint (Prometheus format) in your prediction service with:
• `request_latency_seconds`
• `prediction_success_total`
• `feedback_samples_total`
• `model_version{stage="prod"}`
- Build Grafana dashboards: real-time MAPE, feature drift, feedback throughput heat map.
- Use ELK index templates: `index=ml-feedback-YYYY.MM.DD` with ILM rollover after 30 days.
- Configure alert rules:
• `accuracy < SLA for 3 consecutive checks → pager duty`
• `feedback_samples_total == 0 for 15 min → slack channel #ml-ops`
Reinforcement & Active Learning Rules
- Use contextual bandits (e.g., Vowpal Wabbit, RLlib) when immediate feedback is available; store `state`, `action`, `reward`, `probability`.
- For pool-based active learning, apply entropy sampling and label ≤5 % of uncertain samples per batch.
- Limit automated policy updates to daytime office hours; require two-person review outside business hours.
Testing
- Unit tests: ≥90 % coverage on feature + metric functions; mock external services.
- Integration tests: spin up local MLflow & minio via Docker-Compose; assert full pipeline success inside CI.
- Shadow deployments: route 5 % traffic to canary model, log but do not serve results; compare metrics before promotion.
Performance Optimization
- Vectorize numeric operations with NumPy; avoid Python loops in inference path.
- Serialize models with ONNX where supported to lower CPU latency by ~30 %.
- Track end-to-end latency (`t0=HTTP In` → `tN=feedback write`) and budget <200 ms.
Security & Ethics
- Encrypt feedback payloads in transit (TLS 1.3) and at rest (AES-256 via KMS).
- Store PII separately; link via surrogate keys.
- Run quarterly bias audits: disparate impact ratio, equal opportunity difference; file report to compliance.
Documentation & Directory Convention
- `src/` – pure Python packages (features, models, metrics)
- `pipelines/` – Kubeflow/MLflow YAML specs
- `configs/` – threshold, feature lists, alert rules (env-specific)
- `notebooks/` – exploratory analyses; never imported by production code
- Each directory has a `README.md` with purpose, owners, and update procedure.
Common Pitfalls & How to Avoid Them
- Feedback loop amplifies bias → mitigate via re-weighting or debiasing algorithms before retrain.
- Silent failure of retraining job → enforce CI/CD step that fails build if no new model artifact produced.
- Metric drift hidden by aggregate numbers → track metrics per segment (e.g., geography, user cohort).
Checklist Before Enabling Auto-Retrain
- [ ] Drift detection rules validated in staging
- [ ] Rollback strategy scripted (`mlflow models revert --run-id <id>`)
- [ ] Security scan passed (SAST, dependency CVE)
- [ ] Human review of sample feedback confirms label quality ≥95 %