Rigorous, end-to-end coding standards that guarantee experiment, data, and model reproducibility in Python-based ML projects.
You've been there: model results that can't be reproduced, experiments that work on your machine but fail in production, and that sinking feeling when stakeholders ask "can you run this again?" The AI reproducibility crisis isn't just an academic problem—it's costing you time, credibility, and sanity.
Every data scientist has faced these productivity killers:
These aren't edge cases—they're the norm in most ML teams. The hidden cost? Research shows teams spend 40-60% of their time on reproducibility issues rather than actual model development.
This Cursor Rules configuration transforms your Python ML workflow into a deterministic, audit-ready pipeline. Instead of hoping your experiments are reproducible, you'll guarantee them through automated tooling and rigorous standards.
What you get:
Instead of "works on my machine" syndrome, every environment is captured, versioned, and reproducible via Docker + Conda lock files.
Before: Hours debugging package conflicts and version mismatches After: One command reproduces any environment exactly
Comprehensive seed management across TensorFlow, PyTorch, NumPy, and system randomness ensures identical outputs.
Before: "Why did my accuracy drop 2% when I re-ran training?" After: Bit-for-bit identical results across all runs
Teams share exact code, data versions, and environments through integrated DVC + MLflow tracking.
Before: Email chains sharing "the right version" of datasets and configs After: Automated experiment sharing with full reproducibility metadata
Built-in experiment cards, metadata tracking, and artifact management satisfy audit requirements automatically.
Before: Weeks reconstructing training procedures for compliance reviews
After: Complete audit trail generated automatically for every experiment
The Old Way:
# Developer A trains model
python train.py --epochs 100 --lr 0.001
# Results: 94.2% accuracy
# Developer B tries to reproduce
python train.py --epochs 100 --lr 0.001
# Results: 93.8% accuracy - Why the difference?
With Reproducibility Rules:
# Automatic seed management in every script
set_global_seed(42) # Called first, always
# MLflow tracks everything automatically
with mlflow.start_run():
mlflow.log_params(asdict(config))
mlflow.set_tag("data_version", dvc_data_hash)
# Training code here
mlflow.pytorch.log_model(model, "model")
Result: Developer B gets identical 94.2% accuracy, with full experiment lineage tracked.
The Old Way:
# Development training
model = train_model(data) # Works great locally
# Production deployment
model = load_model('model.pkl') # Different behavior!
With Reproducibility Rules:
# Every model includes environment snapshot
mlflow.pytorch.log_model(
model,
"model",
conda_env="conda.yaml", # Exact environment captured
code_paths=["src/"] # Full source code included
)
# Production uses identical environment
# Docker image built from same conda.yaml
Result: Production models behave identically to development versions.
The Old Way:
# Multiple experiments with unclear differences
experiment_1 = run_training(lr=0.01) # What data? What seed? What environment?
experiment_2 = run_training(lr=0.001) # Can't compare meaningfully
With Reproducibility Rules:
# Every experiment automatically tracked
@dataclass
class Config:
learning_rate: float
batch_size: int
model_architecture: str
config = Config(learning_rate=0.01, batch_size=32, model_architecture="resnet50")
with mlflow.start_run():
mlflow.log_params(asdict(config))
mlflow.set_tag("git_commit", get_git_commit())
mlflow.set_tag("data_version", get_dvc_data_hash())
# Training automatically logged
Result: Perfect experiment comparison with full context and reproducibility metadata.
# Create project structure
mkdir my_ml_project && cd my_ml_project
mkdir -p src/{data_ingest,features,models,training,evaluation,utils}
# Initialize version control
git init
dvc init
# environment.yml
name: ml-reproducible
channels:
- conda-forge
dependencies:
- python=3.11.5 # Pinned minor version
- pip=23.2.1
- pip:
- -r requirements-lock.txt # Generated via pip-compile --generate-hashes
# src/utils/reproducibility.py
def set_global_seed(seed: int = 42):
import os, random, numpy as np, torch, tensorflow as tf
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
tf.random.set_seed(seed)
torch.use_deterministic_algorithms(True)
tf.config.experimental.enable_op_determinism()
# src/training/train.py
import mlflow
from dataclasses import dataclass, asdict
from utils.reproducibility import set_global_seed
@dataclass
class TrainingConfig:
learning_rate: float = 0.001
batch_size: int = 32
epochs: int = 100
def main():
set_global_seed(42) # Always first
config = TrainingConfig()
with mlflow.start_run():
mlflow.log_params(asdict(config))
mlflow.set_tag("data_version", get_dvc_data_hash())
mlflow.set_tag("git_commit", get_git_commit())
# Your training code here
model = train_model(config)
mlflow.pytorch.log_model(model, "model")
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: check-seed-usage
name: Ensure set_global_seed() is called
entry: python scripts/check_seed.py
language: python
- id: check-git-clean
name: Ensure git status is clean
entry: bash -c 'git diff --exit-code'
language: system
Teams using this reproducibility framework report:
Real Example: A fintech ML team reduced their model validation cycle from 3 weeks to 3 days by eliminating reproducibility uncertainty. They now deploy models with confidence, knowing production results will match their experiments exactly.
# CI automatically validates every experiment can be reproduced
pytest tests/test_reproducibility.py::test_experiment_deterministic
# Track data transformations with DVC
dvc run -n preprocess \
-d data/raw \
-o data/processed \
python src/data_ingest/preprocess.py
# Dockerfile ensures identical environments
FROM mambaorg/micromamba:1.5.0
COPY environment.yml .
RUN micromamba env create -f environment.yml
Stop treating reproducibility as an afterthought. In regulated industries, audit-heavy environments, or any team larger than one person, reproducible ML isn't optional—it's the foundation of professional ML engineering.
This configuration gives you the tools to build that foundation right into your development workflow. You'll ship models faster, collaborate more effectively, and sleep better knowing your experiments are rock-solid.
Your next model deployment doesn't have to be a leap of faith. Make it a guarantee.
You are an expert in Python, TensorFlow, PyTorch, MLflow, DVC, Docker, Conda, and modern MLOps tooling.
Key Principles
- Reproducibility first: every commit, data snapshot, and experiment must be replayable on any machine.
- Automate everything: CI/CD pipelines (GitHub Actions) create, test, and publish artefacts.
- Treat data as code: version datasets, feature sets, and metadata alongside source.
- Determinism over speed: prefer slower deterministic ops to non-deterministic GPU kernels.
- Immutable artefacts: once an experiment is registered, its code, data hash, env hash, and seed are locked.
- Documentation is code: generate markdown or HTML reports for every run via MLflow autologging.
Python
- Use Python ≥ 3.11; pin minor version in `pyproject.toml` and `environment.yml`.
- Dependency pinning: `pip-compile --generate-hashes` or Conda lock files; never use floating versions (`>=`).
- Directory layout (lower-snake-case):
src/
├── data_ingest/
├── features/
├── models/
├── training/
├── evaluation/
└── utils/
- All modules are pure functions; side-effects only in `__main__.py` or CLI entry points.
- Import order: stdlib → third-party → first-party, each group alphabetised.
- Seed control helper (always call first):
```python
def set_global_seed(seed: int = 42):
import os, random, numpy as np, torch, tensorflow as tf
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
tf.random.set_seed(seed)
torch.use_deterministic_algorithms(True)
tf.config.experimental.enable_op_determinism()
```
Error Handling and Validation
- Validate external inputs (CLI args, API payloads) with `pydantic` models; fail fast.
- Wrap training loops in `try/except` and always log exceptions to MLflow.
- Use custom exception hierarchy (`ReproducibilityError`, `DataVersionMismatch`, `SeedNotSetError`).
- Abort run if:
• `git status` is dirty
• `dvc status -c` reports differences
• required env vars are missing
- Early return pattern; avoid nested `if` chains.
TensorFlow & PyTorch
- Disable stochastic layers (Dropout, BatchNorm) during evaluation by `.eval()` / `model(training=False)`.
- Register every model artefact via MLflow model registry (`mlflow.tensorflow.log_model`, `mlflow.pytorch.log_model`).
- Store hyperparameters in a typed `dataclass` then log with `mlflow.log_params(asdict(cfg))`.
- Use `torch.backends.cudnn.deterministic = True` and `benchmark = False`.
MLflow Rules
- Experiment naming: `<team>/<project>/<dataset_hash>`.
- Tag required metadata: `mlflow.set_tag("data_version", dvc_data_hash)`.
- Each run must attach:
• `conda.yaml` (captured via `mlflow-conda.yaml`)
• `git_commit`
• `start_time` & `end_time`
- Prohibit manual UI edits; changes via API only.
DVC Rules
- One DVC stage per logical step: ingest → preprocess → train → evaluate.
- Use `.dvc` files, not YAML pipelines, for fine-grained locks.
- Lock large binaries in remote (S3/GCS); never commit to git.
- `dvc metrics` for tracking performance; integrate with CI to block regressions.
Docker & Environments
- Base image: `FROM mambaorg/micromamba:1.5.0` to ensure Conda reproducibility.
- Label image with `org.opencontainers.image.*` metadata and MLflow run ID.
- Always build with build-args `PYTHON_VERSION` and `CUDA_VERSION`.
- Prohibit `latest` tags; use semantic version matching git tags.
Testing
- Unit tests: pytest with 100 % seed coverage (`set_global_seed()` in `conftest.py`).
- Property tests: hypothesis for data transformers (idempotence, invariants).
- Integration test: CI job executes `dvc repro` end-to-end on a 1 % data sample.
- Snapshot test: saved model predictions compared via SHA-256 of output tensor.
Performance & Scalability
- Use mixed precision only if deterministic support confirmed (`torch.backends.cuda.matmul.allow_tf32 = False`).
- Benchmark scripts log CPU/GPU specs (`nvidia-smi --query-gpu=name,driver_version`);
attach as MLflow artifacts.
- Use `onnxruntime` reproducibility mode (`--disable_mem_pattern --enable_cpu_mem_arena 0`).
Security & Compliance
- Strip PII from logs; use hash or surrogate keys.
- Store secrets in HashiCorp Vault; never embed in code or DVC files.
- All artefacts scanned by Trivy in CI before registry push.
Documentation
- Auto-generate experiment cards (`mlflow run ... --entry-point gen_card`).
- `docs/` contains Jupyter notebooks; executed in CI with `papermill` to ensure they run headless.
- Use Open Data License statements in README for each dataset.
Common Pitfalls & Guardrails
- Forgetting to fix CUDA seed → add `pre-commit` hook checking for `set_global_seed()`.
- Drift between data and code → CI compares `dvc.lock` against latest remote.
- Non-deterministic augmentation (e.g., Albumentations `p=0.5`) → configure with fixed RNG.
- Divergent envs across OS → container is single source of truth; local conda only for dev.