Comprehensive, actionable rules for building, running, and maintaining robust evaluation-metric pipelines for AI systems with Python and modern MLOps tooling.
You're building AI systems that matter. Your models are making real decisions, processing real data, affecting real users. But here's the uncomfortable truth: most AI evaluation approaches are broken from day one.
Traditional evaluation stops at accuracy and F1 scores. You train, validate, deploy, and hope everything works in production. Then reality hits:
This reactive approach costs engineering teams weeks of firefighting, damages user trust, and creates technical debt that compounds over time.
These Cursor Rules establish a comprehensive evaluation system that treats metrics as first-class citizens in your AI development pipeline. Instead of bolting on evaluation as an afterthought, you get:
Holistic Assessment from Day One
Continuous Production Monitoring
Enterprise-Grade Security and Compliance
# Typical evaluation approach
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy}")
# Deploy and hope for the best
@dataclass(frozen=True)
class EvaluationConfig:
quantitative_metrics: List[str]
fairness_groups: List[str]
drift_threshold: float
slo_targets: Dict[str, float]
def comprehensive_evaluate(model, test_data, config: EvaluationConfig):
results = {
'quantitative': compute_standard_metrics(model, test_data),
'fairness': assess_demographic_parity(model, test_data, config.fairness_groups),
'robustness': test_adversarial_examples(model, test_data),
'drift': detect_distribution_shift(test_data, reference_data)
}
# Automated alerts if SLOs violated
check_slo_compliance(results, config.slo_targets)
return results
Challenge: Tracking model performance across dozens of experiments while ensuring fairness constraints are met.
Solution: Declarative evaluation pipelines that automatically compute comprehensive metrics for every training run:
# pipelines/train_eval.yaml
evaluation:
quantitative:
- accuracy
- precision
- recall
- f1_weighted
fairness:
groups: [gender, age_group, geography]
thresholds:
demographic_parity: 0.1
equal_opportunity: 0.1
monitoring:
drift_detection: true
slo_tracking: true
Impact: Reduce evaluation setup time from hours to minutes, with automatic MLflow logging and fairness validation.
Challenge: Maintaining model quality in production across multiple deployed systems with varying data patterns.
Solution: Automated monitoring infrastructure with real-time alerting:
# Prometheus metrics automatically exposed
@monitor_drift(threshold=0.2)
def batch_prediction(model, input_data):
predictions = model.predict(input_data)
# Automatic drift detection and alerting
return predictions
Impact: Catch model degradation within hours instead of weeks, with clear attribution to specific data shifts.
Challenge: Evaluating novel model architectures across multiple dimensions while maintaining reproducibility.
Solution: Extensible metric framework with custom evaluation functions:
# metrics/custom/perplexity.py
def perplexity(model_probs: np.ndarray, targets: np.ndarray) -> float:
"""Custom perplexity metric with validation."""
validate_probability_distribution(model_probs)
cross_entropy = -np.mean(np.log(model_probs[np.arange(len(targets)), targets]))
return np.exp(cross_entropy)
Impact: Standardize evaluation across research experiments while enabling novel metric development.
mkdir ai_evaluation_system
cd ai_evaluation_system
# Create the recommended directory structure
mkdir -p metrics/{custom} pipelines dashboards tests/{fairness} notebooks
# metrics/classic.py
from __future__ import annotations
import numpy as np
from typing import Dict, Any
def comprehensive_classification_metrics(
y_true: np.ndarray,
y_pred: np.ndarray,
y_prob: np.ndarray | None = None
) -> Dict[str, float]:
"""Compute all standard classification metrics with validation."""
validate_classification_inputs(y_true, y_pred, y_prob)
return {
'accuracy': accuracy_score(y_true, y_pred),
'precision_macro': precision_score(y_true, y_pred, average='macro'),
'recall_macro': recall_score(y_true, y_pred, average='macro'),
'f1_macro': f1_score(y_true, y_pred, average='macro'),
'ece': expected_calibration_error(y_true, y_prob) if y_prob is not None else None
}
# pipelines/monitoring.py
from evidently import Dashboard
from evidently.tabs import DataDriftTab, NumTargetDriftTab
def setup_drift_monitoring(reference_data, current_data):
dashboard = Dashboard(tabs=[DataDriftTab(), NumTargetDriftTab()])
dashboard.calculate(reference_data, current_data)
# Auto-alert if drift detected
if dashboard.drift_detected():
send_drift_alert(dashboard.get_metrics())
# Add to your model serving code
from prometheus_client import Counter, Histogram
PREDICTION_COUNTER = Counter('ai_predictions_total', 'Total predictions made')
ACCURACY_HISTOGRAM = Histogram('ai_accuracy_score', 'Model accuracy distribution')
@app.route('/predict')
def predict():
prediction = model.predict(request.json)
# Automatic metric collection
PREDICTION_COUNTER.inc()
if ground_truth_available():
accuracy = compute_accuracy(ground_truth, prediction)
ACCURACY_HISTOGRAM.observe(accuracy)
return prediction
These rules don't just measure your AI systems—they transform how you think about AI quality. You'll catch issues before they reach users, maintain consistent standards across teams, and build systems that actually deserve the trust you're asking for.
Your AI systems are too important to evaluate with yesterday's tools. Start measuring what matters, continuously and comprehensively.
You are an expert in Python, Jupyter, NumPy, Pandas, Scikit-learn, PyTorch, HuggingFace Evaluate, EvidentlyAI, MLflow, Prometheus/Grafana, Docker, and Kubernetes.
Key Principles
- Combine quantitative (accuracy, precision, recall, F1, BLEU, ROUGE, etc.) and qualitative (human ratings, UX surveys, societal impact) metrics for every model.
- Integrate fairness, robustness, transparency, and user-experience checks from day one; treat them as first-class metrics, not add-ons.
- Prefer reproducible, declarative pipelines driven by config files (YAML/JSON) and version-controlled with Data Version Control (DVC).
- Adopt continuous evaluation: calculate core metrics on every training run, nightly on fresh data, and in production via live monitoring.
- Update benchmarks/datasets quarterly; log timestamps and dataset versions for every metric to detect dataset drift.
- Make dashboards public within the company; hide PII and sensitive attributes via differential privacy when exporting.
Python
- Follow PEP-8 with 100-char lines; enforce via `ruff` and `black` in CI.
- Use type hints everywhere; run `mypy --strict`.
- Use `@dataclass(frozen=True)` for immutable metric configs.
- Group code: `metrics/`, `validation/`, `dashboards/`, `notebooks/`, `tests/`, `pipelines/`.
- Expose every metric as a pure function: `def f1_score(y_true: NDArray, y_pred: NDArray) -> float:`.
- Avoid class-based singletons; pass parameters explicitly. Use dependency injection for data stores.
Error Handling and Validation
- Validate inputs at the top of each metric: check shape, dtype, NaN, infinity. Example:
```python
if y_true.shape != y_pred.shape:
raise ValueError("y_true and y_pred must have identical shape")
```
- Apply `pydantic` or `attrs` validation on config files; fail fast in CI.
- Wrap external service calls (e.g., human-rating API) with retry/backoff; log failures to Sentry with dataset sample IDs.
- Detect data leakage by verifying that no train rows appear in eval sets via hash matching.
- Guard against prompt injection in LLM evaluation by stripping system/user messages and using allow-listed templates.
Framework-Specific Rules
HuggingFace Evaluate
- Use `evaluate.load(metric_name)` for standard metrics; pin version numbers: `evaluate==0.4.0`.
- For custom metrics, inherit from `datasets.Metric` and implement `_compute`. Store under `metrics/custom/` with a README.
- Log all computed metrics to MLflow with `step`, `dataset_version`, and `code_rev` tags.
Scikit-learn
- Use `sklearn.metrics` for classical models, but wrap calls in your own `metrics/sklearn_wrappers.py` to standardize signatures.
- For multi-label tasks, always specify `average` explicitly (`"micro"`, `"macro"`, or `"weighted"`).
- Use `calibration_curve` to monitor probability calibration; add Expected Calibration Error (ECE) as a custom metric.
PyTorch / Lightning
- When training deep models, attach a `MetricsCallback` that logs metrics each epoch and pushes to TensorBoard and MLflow.
- Use `torchmetrics` for GPU-accelerated metric calcs; move tensors to CPU before converting to NumPy for persistence.
EvidentlyAI (Production Monitoring)
- Set up a weekly Evidently report comparing live data to reference data; threshold drift at PSI > 0.2 or KS-stat > 0.1.
- Automatically open a Jira ticket when drift is detected; include plots and raw JSON.
Additional Sections
Testing
- For every metric, create unit tests with synthetic edge-case inputs: all-zeros, all-ones, ties, extreme class imbalance.
- Use `pytest` and `pytest-parametrize` to test metric invariants (e.g., F1 ≤ precision + recall / 2).
- Integrate fairness tests: ensure demographic parity diff < 0.1 and equal opportunity diff < 0.1 across groups in `tests/fairness/`.
Performance
- Vectorize operations with NumPy; avoid Python loops for per-item metrics.
- For huge datasets, compute metrics in chunks and stream results to avoid OOM; use `dask` or `polars` when >50M rows.
- Cache intermediate predictions in Parquet; compress with ZSTD.
Security & Privacy
- Mask PII before persisting to logs; use SHA-256 salted hashes for user IDs.
- Apply differential privacy noise when exporting metrics outside secure perimeter.
- Enforce least-privilege IAM roles for metric storage buckets; enable server-side encryption (AES-256 or KMS).
Continuous Monitoring
- Expose `/metrics` endpoint emitting Prometheus counters: `ai_eval_accuracy`, `ai_eval_latency_seconds`, `ai_drift_alerts_total`.
- Grafana dashboards must display trend lines, confidence intervals, and alert rules.
- Define Service Level Objectives (SLOs): 99% of batches processed within 5 min, <1% missing metric reports per day.
Example Metric Function
```python
from __future__ import annotations
import numpy as np
from typing import Tuple
def f1_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
"""Compute F1 score (binary) with input validation."""
if y_true.shape != y_pred.shape:
raise ValueError("Shapes differ: y_true %s vs y_pred %s" % (y_true.shape, y_pred.shape))
if not set(np.unique(y_true)).issubset({0, 1}):
raise ValueError("y_true must be binary 0/1")
tp = np.sum((y_true == 1) & (y_pred == 1))
fp = np.sum((y_true == 0) & (y_pred == 1))
fn = np.sum((y_true == 1) & (y_pred == 0))
precision = tp / (tp + fp) if tp + fp else 0
recall = tp / (tp + fn) if tp + fn else 0
return 2 * precision * recall / (precision + recall) if precision + recall else 0
```
File/Directory Convention
```
├── metrics
│ ├── __init__.py
│ ├── classic.py # wraps scikit-learn metrics
│ ├── nlp.py # BLEU, ROUGE, BERTScore
│ ├── fairness.py # demographic parity, equal opp
│ └── custom
│ └── ece.py
├── pipelines
│ ├── train_eval.yaml # declarative pipeline config
│ └── prod_monitor.yaml
├── dashboards
│ └── grafana.json # exported board
├── tests
│ ├── test_classic.py
│ ├── test_fairness.py
│ └── fixtures.py
└── README.md
```
Adopt these rules to ensure every AI system is evaluated rigorously, ethically, and continuously, with clear, reproducible metrics and real-time operational visibility.