Comprehensive rules for building, deploying, and maintaining machine-learning solutions in large-scale enterprise environments.
Building production-ready machine learning systems in enterprise environments shouldn't mean drowning in configuration files, wrestling with deployment pipelines, or debugging mysterious model drift alerts at 2 AM. You have real business problems to solve, and your ML infrastructure should accelerate that work, not slow it down.
You're not building toy models anymore. Your ML systems need to handle terabytes of sensitive data, maintain 99.9% uptime, pass security audits, and integrate with legacy enterprise systems. The gap between "it works in my notebook" and "it's running reliably in production" has become a productivity black hole.
The daily friction adds up:
These Cursor Rules eliminate the infrastructure overhead that's consuming your development cycles. They provide battle-tested patterns for building enterprise ML systems that are secure, scalable, and maintainable from day one.
Instead of reinventing the wheel for each project, you get consistent patterns for data validation, model deployment, monitoring, and security that follow enterprise standards. The rules enforce best practices automatically, so you can focus on the machine learning problems that matter to your business.
⚡ 60% Faster Time-to-Production
🔒 Enterprise Security by Default
📊 Proactive Model Health Monitoring
🚀 Zero-Configuration CI/CD
# 45 minutes of manual configuration every deployment
model = load_model("my_model.pkl")
# TODO: Add data validation
# TODO: Set up monitoring
# TODO: Configure security
# TODO: Handle errors properly
# TODO: Add logging
# Deploy and pray it works...
# Automatic validation, monitoring, security, and deployment
from models.churn_prediction import train, predict
from models.churn_prediction.schema import ChurnPredictionRequest
@serve_model(
business_kpi="reduce_customer_churn_by_15pct",
drift_threshold=0.2,
auth_required=True
)
def predict_churn(request: ChurnPredictionRequest) -> ChurnPredictionResponse:
return predict(request)
# Robust error handling with domain-specific exceptions
try:
prediction = model.predict(features)
except DataValidationError as e:
logger.error("Data validation failed", extra={"event": "validation_error", "details": e.details})
raise HTTPException(status_code=422, detail="Invalid input data")
except ModelDriftError as e:
logger.warning("Model drift detected", extra={"psi_score": e.psi_score})
# Auto-trigger retraining pipeline
trigger_retrain_pipeline()
# Built-in drift detection with configurable alerts
@monitor_drift(reference_data="training_set_v1.2", threshold=0.2)
def batch_inference(input_data: pd.DataFrame) -> pd.DataFrame:
if psi_score > 0.2:
alert_ops_team("Model drift detected - PSI: {psi_score}")
initiate_model_retrain()
return model.predict(input_data)
# Add to your Cursor Rules
curl -o .cursorrules https://raw.githubusercontent.com/your-repo/enterprise-ml-rules
your_ml_project/
├── models/
│ ├── fraud_detection/
│ │ ├── __init__.py # Exposes train(), predict()
│ │ ├── config.py # Pydantic settings
│ │ ├── schema.py # Input/output validation
│ │ ├── train.py # Training pipeline
│ │ ├── infer.py # Inference logic
│ │ └── tests/ # Comprehensive test suite
├── pipelines/
│ ├── training_pipeline.py
│ └── inference_pipeline.py
├── infrastructure/
│ ├── kubernetes/
│ └── docker/
└── README.md # Auto-generated with business KPI
# Define your business objective (enforced in every project)
# Objective: Reduce credit card fraud losses by 25% within Q2
class FraudDetectionModel:
def __init__(self, config: FraudDetectionConfig):
self.model = self._build_model(config)
@tf.function(input_signature=[tf.TensorSpec([None, 32], tf.float32)])
def predict(self, features: tf.Tensor) -> tf.Tensor:
return self.model(features, training=False)
def train(self, dataset: tf.data.Dataset) -> None:
# Automatic MLflow tracking and model registry
with mlflow.start_run():
self.model.fit(dataset)
mlflow.tensorflow.log_model(self.model, "fraud_model")
# Auto-generated Kubernetes deployment
apiVersion: v1
kind: Deployment
metadata:
name: fraud-detection-service
spec:
replicas: 3
template:
spec:
containers:
- name: fraud-model
image: your-registry/fraud-detection:v1.2.0
env:
- name: MODEL_PATH
value: "gs://your-bucket/models/fraud/v1.2.0"
livenessProbe:
httpGet:
path: /health
port: 8080
Development Velocity Gains:
Business Impact:
Team Productivity:
Real-World Results: A Fortune 500 retail company using these patterns deployed 15 production ML models in 6 months (previously took 18 months for 3 models) while maintaining 99.95% uptime and zero security incidents.
Ready to transform your enterprise ML development? These rules eliminate the infrastructure overhead that's slowing down your team and provide the production-ready patterns you need to deliver business value faster.
Your ML models deserve better than duct-tape deployments and manual monitoring. Get the enterprise-grade foundation that scales with your ambitions.
You are an expert in Python, TensorFlow, PyTorch, Scikit-learn, MLflow, Kubeflow, Airflow, Docker, Kubernetes, Google Vertex AI, Amazon SageMaker, IBM watsonx, H2O AI Cloud, and enterprise data-governance tooling.
Technology Stack Declaration
- Primary language: Python 3.11+ with strict type hints (PEP 484).
- Deep-learning: TensorFlow > 2.11 for production, PyTorch 2.x for R&D, Scikit-learn 1.4 for classical ML.
- Packaging & runtime: Poetry for dependency locking; Docker OCI images executed on Kubernetes.
- Orchestration: Kubeflow Pipelines or Airflow; model registry & lifecycle via MLflow.
- Hosting: Vertex AI or SageMaker for managed infra; fallback to on-prem K8s.
- Observability: Prometheus/Grafana for infra, OpenTelemetry + structured JSON logs for apps.
Key Principles
- Align every ML initiative with a measurable business KPI; declare it in README.md as `# Objective`.
- Start small: pilot pipeline with ≤ 10% production load before global rollout.
- Data First: treat datasets as first-class citizens; every dataset version is immutable and auditable.
- Security Everywhere: encryption in transit (mTLS) & at rest (KMS); role-based IAM for pipelines and model endpoints.
- Automate everything (CI/CD, testing, deployment, retraining) to minimise human error.
- Fail fast: detect data/model drift within minutes; auto-rollback on SLA breach.
Python Rules
- Enforce `ruff --select ALL` and `mypy --strict` in CI.
- File naming: `snake_case.py`; package roots separated by domain (`data_ingest/`, `feature_store/`, `models/`).
- Module layout (per model):
├── __init__.py # exposes train(), predict()
├── config.py # Pydantic settings
├── schema.py # pydantic models for I/O
├── train.py
├── infer.py
└── tests/
- Functions ≤ 40 lines; prefer pure functions. Classes only for stateful components (e.g., tf.keras.Model).
- Use f-strings; no string concatenation.
- Mandatory docstring format: Google style with `Args`, `Returns`, `Raises`.
Error Handling & Validation
- Define domain-specific exceptions: `DataValidationError`, `ModelDriftError`, `ServiceUnavailableError`.
- Validate all external inputs with Pydantic; raise early.
- Wrap pipeline steps with retry logic (exponential backoff, jitter).
- Log errors in structured JSON: `{ "level": "error", "event": "data_validation_failed", ... }`.
- Model drift guard:
```python
if psi(current_data, reference_data) > 0.2:
raise ModelDriftError("PSI exceeded threshold; trigger retrain")
```
TensorFlow Rules (primary framework)
- Build models with the Functional API; avoid Sequential for anything non-trivial.
- Use `tf.keras.layers.Layer` subclasses only when necessary; keep them stateless when possible.
- Compile with explicit loss, optimizer, and metrics; never rely on defaults.
- Enable mixed-precision with `tf.keras.mixed_precision.set_global_policy("mixed_float16")` for GPU/TPU.
- Save models in the `SavedModel` format with signature definitions:
```python
@tf.function(input_signature=[tf.TensorSpec([None, 32], tf.float32)])
def serving_fn(x):
return model(x, training=False)
tf.saved_model.save(model, export_dir, signatures={"serving_default": serving_fn})
```
- Register every model version to MLflow with artifacts: model, requirements.txt, training_params.json.
PyTorch Notes
- Prefer `torch.compile()` for production inference; stick to `torch.no_grad()` blocks.
- Use Lightning or Accelerate for multi-GPU; respect deterministic flags during evaluation.
Scikit-learn Notes
- Pipeline everything (imputer, scaler, estimator) via `sklearn.pipeline.Pipeline`.
- Persist models with `joblib.dump()` including pipeline; store feature names.
Testing
- Unit: pytest with 90%+ coverage; fixtures seeded via `numpy.random.default_rng(42)`.
- Data tests: Great Expectations suites run nightly.
- Integration: spin up ephemeral K8s namespace using KIND in CI for end-to-end tests.
Performance Optimisation
- Profile with cProfile + SnakeViz before attempting GPU offload.
- Batch inference requests (≥ 32 examples) to maximise throughput.
- Use asynchronous FastAPI with `uvicorn --workers 4 --loop uvloop` for serving.
Security
- Store secrets in KMS; inject via environment variables at runtime.
- Enforce OAuth2/OIDC or AWS SigV4 on REST/gRPC endpoints; rate-limit to 100 req/s per key.
- Sign model artifacts with GPG; verify signature before loading.
Deployment & MLOps
- Git main branch is always deployable. Use Git-tags `v<model>-<semver>`.
- CI (GitHub Actions): lint → unit tests → build Docker → push to registry.
- CD: Argo CD monitors registry tag; deploys Canary (5% traffic) → promote when `latency_p50 < baseline*1.1` and `accuracy_drop < 1%`.
- Retraining schedule defined in `pipeline.yaml`; triggered by data volume or drift alert.
Logging & Monitoring
- Emit Prometheus metrics: `model_inference_seconds`, `inference_requests_total{status="success"}`.
- Alert when error-rate > 2% or PSI > 0.2.
Naming Conventions
- Datasets: `<domain>_<source>_<granularity>_<YYYYMMDD>.parquet`.
- Model registry path: `<business-unit>/<problem>/<model-name>/<semver>`.
- Feature store columns: prefix with domain (e.g., `user_age`, `trans_amount`).
Common Pitfalls & Guardrails
- DO NOT train and serve inside the same container image.
- NEVER commit secrets or raw data samples containing PII.
- AVOID writing custom loggers—use standard `logging` with JSON formatter.
Sample Happy-Path Workflow (simplified)
```bash
# 1. Create a new branch for model feature
$ git switch -c feat/churn-v2
# 2. Implement & push code; open PR → CI
# 3. Merge: CI builds image `ghcr.io/acme/churn:0.2.0`
# 4. Argo CD deploys Canary on Vertex AI
# 5. Canary passes SLOs → auto-promote to 100%
# 6. MLflow marks 0.2.0 as `Production` stage
```
Follow these rules to produce secure, maintainable, and business-aligned ML solutions that scale across enterprise environments.