Opinionated coding & ops rules for designing, building, and maintaining automated feedback-loop pipelines for AI systems (training → deployment → monitoring → retraining).
Your AI models are making thousands of predictions daily, but how many of those interactions make your next model better? If you're manually collecting feedback and running quarterly retraining cycles, you're leaving performance gains on the table and watching competitors ship faster, smarter systems.
Most production AI systems operate like closed loops: they serve predictions, collect logs, and hope someone will eventually analyze the data. Meanwhile, your models drift, edge cases accumulate, and user complaints pile up in support tickets that never reach your training pipeline.
The reality check: Your users are generating the exact training data you need to build better models, but 95% of that gold is evaporating because you're not capturing it systematically.
These Cursor Rules transform your AI system from a prediction service into a continuously learning organism. Every user interaction, error case, and performance metric flows directly back into your training pipeline, creating a closed loop where each deployment makes the next one smarter.
Here's the concrete difference:
Before: User reports poor response quality → Support ticket → Manual investigation → Maybe retrain in 3 months → Deploy update → Hope it's better
After: Poor response detected → Automatic data capture → Human validation triggered → Model retrained within 24 hours → A/B tested deployment → Measurable improvement
The rules establish feedback as a first-class architectural concern, not an afterthought.
Instead of quarterly model updates, you'll ship improvements daily. The rules enforce automated pipelines that detect drift, trigger retraining, and deploy updates through feature-flagged canary releases.
Every prediction automatically logs model version, data lineage, and performance metrics. When something breaks, you have complete traceability from user complaint to training data to model weights.
No more shipping models because "it looks good in Jupyter." The rules require pre-defined success metrics and automated rollbacks when performance degrades.
Instead of burning ML engineer time on manual labeling, the system automatically routes edge cases to domain experts and integrates their feedback into training batches.
Without these rules: You notice accuracy degrading in dashboards weeks later, manually export data, retrain locally, test manually, and deploy after extensive review cycles.
With these rules:
# Drift detection automatically triggers
@dataclass
class DriftDetectedError(Exception):
metric_name: str
current_value: float
threshold: float
# Your feedback pipeline handles the rest
async def on_drift_detected(error: DriftDetectedError):
await trigger_retraining_job(
reason=f"Drift detected: {error.metric_name}",
priority="high"
)
await notify_slack_channel(
"#ml-alerts",
f"Auto-retraining triggered: {error.metric_name} = {error.current_value}"
)
Without these rules: Users report irrelevant retrieved documents, you manually inspect queries, update retrieval logic, and redeploy everything together.
With these rules:
# Retriever and generator versioned independently
@dataclass
class RetrievalFeedback:
query: str
retrieved_docs: List[str]
user_rating: float
timestamp: datetime
# Automatic feedback collection in FastAPI endpoint
@app.post("/feedback")
async def collect_feedback(feedback: RetrievalFeedback):
await feedback_buffer.add(feedback)
if feedback.user_rating < 2.0:
await trigger_retriever_retraining(feedback.query)
Without these rules: Model throws exception, request fails, you investigate logs manually, and patch the specific case.
With these rules:
# Hierarchical exception handling with automatic retry and learning
@retry(stop=3, jitter=True)
async def predict_with_feedback(request: PredictionRequest):
try:
result = await model.predict(request)
await log_successful_prediction(request, result)
return result
except DataQualityError as e:
await capture_data_quality_issue(request, e)
await schedule_data_validation_improvement()
raise
except ModelError as e:
await capture_model_failure(request, e)
await trigger_model_debugging_session()
raise
# Install the core stack
pip install torch torchmetrics fastapi mlflow hydra-core structlog
# Configure your project structure
mkdir -p src/{models,pipelines,monitoring}
touch src/feedback_config.yaml
import structlog
# Every service logs feedback-ready events
logger = structlog.get_logger()
async def serve_prediction(request):
logger.info(
"prediction_served",
model_version="v1.2.3",
trace_id=request.trace_id,
latency_ms=response_time,
user_satisfaction=None # Will be populated by feedback
)
from pydantic import BaseModel
from typing import Optional
class UserFeedback(BaseModel):
prediction_id: str
rating: float # 1-5 scale
correction: Optional[str] = None
context: Dict[str, Any]
@dataclass
class TrainerConfig:
min_f1_score: float = 0.85
feedback_window_hours: int = 24
retrain_threshold: int = 100 # New feedback samples
def train_from_feedback(cfg: TrainerConfig):
feedback_data = collect_recent_feedback(cfg.feedback_window_hours)
if len(feedback_data) >= cfg.retrain_threshold:
new_model = retrain_model(feedback_data)
if evaluate_model(new_model).f1 >= cfg.min_f1_score:
register_model_version(new_model)
# LaunchDarkly integration for safe rollouts
from launchdarkly import client
@app.post("/predict")
async def predict_endpoint(request: PredictionRequest):
model_version = client.variation(
"model_version",
request.user_context,
"stable"
)
model = load_model(model_version)
return await model.predict(request.data)
While your competitors are still running ML like it's 2019—manual training, quarterly releases, reactive debugging—you'll be shipping AI systems that get smarter every day without human intervention.
These rules don't just improve your models; they fundamentally change how fast your team can innovate. When every user interaction improves your next deployment, you're not just building software—you're building learning systems that compound their capabilities over time.
The gap between teams using continuous feedback loops and those stuck in batch-mode ML grows exponentially. Start implementing these rules today, and in three months, you'll wonder how you ever shipped AI systems any other way.
## Technology Stack Declaration
You are an expert in the following technologies:
- Python 3.10+
- PyTorch / TorchMetrics / TorchData
- Retrieval-Augmented Generation (RAG) with FAISS or Vespa
- RNN/LSTM architectures
- FastAPI for serving
- MLflow + Hydra for experiment & config management
- Docker, Kubernetes, GitHub Actions (CI/CD)
- LaunchDarkly (feature flags)
- Slack & JIRA APIs for collaboration and automated notifications
- Prometheus + Grafana for monitoring
## Key Principles
- Feedback is a first-class citizen: every component must emit structured events that flow back into training datasets.
- Automate everything that can be automated; put humans where they add the most value (labeling, bias review, approvals).
- Design for rapid iteration: the time between feedback arrival and a new model in production must be <24 h.
- Prefer immutable artifacts (Docker images, model registries) and declarative configuration (Hydra YAML) to guarantee reproducibility.
- Measure before and after: every change requires a pre-defined success metric (e.g., latency p95, ROUGE-L, defect rate).
- Roll forward, not roll back: use feature flags and phased rollouts (canary, A/B) to mitigate risk while keeping momentum.
- Maintain full transparency: log data lineage, model version, hyper-params, and code commit SHA for every prediction.
## Python
- Mandatory type hints (`mypy --strict`) and `ruff` for linting.
- Use `dataclasses` or `pydantic.BaseModel` for all data contracts; disallow naked `dict`/`list` parameters.
- Follow `snake_case` for names; reserve `PascalCase` for classes, `SCREAMING_SNAKE` for constants.
- Keep modules <400 LOC. Idiom: `__init__.py` re-exports public API only.
- Never catch bare `Exception`; create hierarchical custom exceptions (`DataQualityError`, `DriftDetectedError`, …).
- Use context managers (`with`) for resource handling (GPU, file, DB session).
- Log with `structlog` in JSON mode; minimum fields: `timestamp`, `level`, `service`, `model_version`, `trace_id`.
- Unit-test with `pytest` and 100 % branch coverage on core feedback logic; snapshot expected metrics where feasible.
## Error Handling and Validation
- Validate every external input (user text, sensor stream, Slack command) via Pydantic schemas before processing.
- Detect data drift using `torchmetrics.text.EmbeddingSimilarity` or `evidently` and raise `DriftDetectedError` to trigger retraining.
- Place guards at function tops; use early returns to keep the happy path linear.
- Wrap training jobs in a `@retry(stop=3, jitter=True)` decorator to auto-recover from transient infra failures.
- Send critical errors to Slack `#ml-alerts` with full context, redacting PII via built-in scrubbers.
## Framework-Specific Rules (PyTorch / RAG)
- Build models with functional API; avoid subclassing `nn.Module` when composable blocks suffice.
- All training scripts must expose a `TrainerConfig` Hydra schema and a `train(cfg: TrainerConfig)` entry point.
- Save checkpoints every `n` steps, but register only evaluation-passing checkpoints to MLflow (`f1 >= cfg.min_f1`).
- For RAG: separate retriever & generator artifacts; version them independently to allow partial updates.
- Integrate online learning by feeding labeled production interactions into a `ReplayBuffer` that re-trains daily.
- Implement gradient clipping (`0.5`) and mixed precision (`torch.cuda.amp`) for stability and speed.
- Serve models through FastAPI routes `/predict` and `/metrics`; expose Prometheus counters (`success_total`, `error_total`).
## CI/CD & Feature Flags
- GitHub Actions pipeline stages: lint → test → build docker → deploy to staging (canary 5 %) → smoke tests → promote → notify.
- Use LaunchDarkly flags `model_version`, `retriever_version`, and `new_feature_x`; default `off` in prod until ≥95 % confidence metrics reached.
- Canary policy: 5 % → 25 % → 50 % → 100 % with automatic rollback if SLA breaches >3 × baselines.
## Testing
- Write data unit tests: schema validation, null ratio thresholds, outlier detection with Z-score <3.
- Perform A/B tests using a fixed horizon design; minimum sample size calculated via `statsmodels` power analysis.
- Run end-to-end smoke test in CI that posts a prompt to `/predict`, verifies JSON schema, and asserts latency <200 ms.
## Performance Optimization
- Cache retrieval embeddings in GPU memory when <4 GB; otherwise fall back to `faiss.IndexIVFPQ` RAM cache.
- Batch requests in inference (`max_batch_tokens=2k`) with dynamic padding.
- Prefer vectorized ops (`torch.einsum`) over loops; profile with `torch.autograd.profiler` each sprint.
## Security & Compliance
- Store personally identifiable information (PII) only in encrypted columns (AES-256) with key rotation every 90 d.
- Strip PII from logs via middleware before persistence.
- Conduct quarterly bias audits; maintain `bias_report.md` under version control.
- Enforce OAuth 2.0 for all internal service communication; tokens scoped to least privilege.
## Documentation & Collaboration
- Every repo root must contain `FEEDBACK_LOOP.md` covering: data flow diagram, metrics dictionary, on-call rota.
- Weekly feedback triage meeting participants: PM, ML Eng, Data Scientist, UX Researcher, Support.
- Automate meeting notes export to Confluence via Slack slash-command `/export-notes`.