Opinionated Rules for maintaining high-quality, consistent contribution guidelines in open-source Data-Science (DS) projects.
Stop struggling with inconsistent contribution workflows and unclear project standards. These Cursor Rules transform chaotic open-source data science projects into well-oiled collaboration machines where contributors know exactly what's expected and maintainers can focus on innovation instead of process management.
Every data science maintainer knows this pain: promising contributors disappear after their first confusing interaction, PRs sit in limbo because requirements weren't clear, and your codebase becomes a patchwork of different styles and quality levels. Meanwhile, you're spending more time managing contributions than advancing your actual research.
The typical open-source data science project suffers from:
These rules establish a complete contribution framework that automates quality control while making participation welcoming and efficient. You get professional-grade governance without the overhead.
What makes this different: Instead of generic "please follow our guidelines" documentation, you get specific, executable standards with automated enforcement. Contributors know exactly what success looks like, and your CI pipeline catches issues before they reach human reviewers.
Automated checks catch formatting, type errors, and missing tests before PRs reach maintainers. Your review time shifts from catching basic issues to providing strategic feedback on implementation approaches.
Professional standards signal project maturity. Experienced developers recognize well-run projects and are more likely to invest significant time in contributions.
Strict type checking, automated formatting, and comprehensive testing requirements maintain consistency as your contributor base grows.
Auto-generated API docs from docstrings and notebook-based tutorials that execute in CI mean your documentation stays current without manual maintenance.
Before: "The model doesn't work" with no context, environment details, or reproduction steps. You spend hours trying to understand the issue.
After: Structured issue templates require environment details, reproduction scripts, and expected vs. actual behavior. Contributors include self-assessment of their debugging attempts. Issues become actionable immediately.
Before: Contributors submit code that breaks tests, lacks documentation, and doesn't follow project conventions. Multiple review cycles drain everyone's energy.
After: Pre-commit hooks catch formatting and basic errors locally. CI validates type hints, runs comprehensive tests, and checks performance regressions. PRs arrive ready for meaningful technical review.
# Before: Contributors submit code like this
def process_data(df):
# No type hints, poor docstring, no validation
df.fillna(0, inplace=True)
return df.groupby('category').mean()
# After: Automated tooling enforces this standard
def process_data(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate category-wise means with missing value handling.
Parameters
----------
df : pd.DataFrame
Input DataFrame with 'category' column and numeric data
Returns
-------
pd.DataFrame
Grouped means by category
Raises
------
DataValidationError
If required columns are missing
Examples
--------
>>> df = pd.DataFrame({'category': ['A', 'B'], 'value': [1, 2]})
>>> result = process_data(df)
"""
if 'category' not in df.columns:
raise DataValidationError("DataFrame must contain 'category' column")
return df.fillna(0).groupby('category').mean()
Before: New contributors struggle to set up environments, understand coding standards, and figure out how to run tests. Many give up before making their first contribution.
After: Single-command environment setup with pyproject.toml, clear skill-level self-assessment in PR templates, and comprehensive local testing instructions. Contributors can be productive on day one.
Create these files in your repository root:
mkdir -p .github/ISSUE_TEMPLATE .github/PULL_REQUEST_TEMPLATE
touch CONTRIBUTING.md CODE_OF_CONDUCT.md
touch .pre-commit-config.yaml pyproject.toml
.pre-commit-config.yaml:
repos:
- repo: https://github.com/psf/black
rev: 23.7.0
hooks:
- id: black
args: [--line-length=88]
- repo: https://github.com/pycqa/isort
rev: 5.12.0
hooks:
- id: isort
args: [--profile=black]
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.0.287
hooks:
- id: ruff
Create templates that guide contributors toward success:
.github/PULL_REQUEST_TEMPLATE.md:
## Motivation
Why is this change needed? Link related issues.
## Self-Assessment
- [ ] My skill level with this technology: Beginner/Intermediate/Advanced
- [ ] Areas where I'd appreciate extra review: Performance/Architecture/Testing
- [ ] I've tested this locally and all checks pass
## Technical Approach
Describe your implementation approach and any trade-offs made.
## Checklist
- [ ] Tests added/updated and passing
- [ ] Documentation updated
- [ ] Type hints added
- [ ] Performance impact considered
Configure pytest with coverage requirements:
pyproject.toml:
[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = ["--strict-markers", "--cov=src", "--cov-fail-under=90"]
[tool.mypy]
strict = true
warn_return_any = true
warn_unused_configs = true
Create GitHub Actions that enforce standards:
.github/workflows/ci.yml:
name: CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- run: pip install -e .[dev]
- run: pre-commit run --all-files
- run: mypy src/
- run: pytest
Your project maintains consistent code quality as it scales. New contributors learn professional Python practices through your automated feedback. Documentation stays current because it's part of the development workflow, not an afterthought.
Clear expectations and helpful automation create a welcoming environment for contributors at all skill levels. Senior developers see professional project management and invest more time. Junior developers get structured learning opportunities and stick around to grow with your project.
The bottom line: These rules transform contribution management from a time-consuming bottleneck into an automated system that scales with your project's success. You spend less time on process and more time on the data science breakthroughs that matter.
Start implementing these standards today, and watch your project's contribution quality and community engagement transform within weeks.
You are an expert in Data-Science collaboration and engineering (Python 3.11+, Jupyter Lab, Pandas, Scikit-learn, PyTorch, Git, GitHub Actions).
Key Principles
- Keep the CONTRIBUTING.md concise, actionable, and beginner-friendly while enforcing professional standards.
- Prefer automation over manual checks (CI linters, test suites, pre-commit hooks).
- Enforce SMART goals for each issue/PR and document expected outcomes.
- Maintain an honest self-assessment culture: encourage contributors to state their skill level and areas of growth in PR descriptions.
- Foster transparent, timely, and respectful communication; reference a Code of Conduct (Contributor Covenant v2.1).
- Every change must improve at least one of: correctness, clarity, performance, test coverage, or documentation.
Python
- Follow PEP 8; let `black` (line length = 88) and `isort` (profile=black) format code automatically.
- Require type hints everywhere (`mypy --strict` must pass). Prefer `pandas.typing` & `typing.Protocol` for DS objects.
- Use docstrings in NumPy style. Each public function/class must include: Short summary, Parameters, Returns, Raises, Examples.
- Disallow `from module import *`, mutable default arguments, and bare `except:` blocks.
- File layout per feature:
feature_name/
├─ __init__.py (re-export public API)
├─ core.py (pure functions/classes)
├─ io.py (data loading/saving)
├─ cli.py (optional Click entry points)
└─ tests/ (pytest test_*.py files)
- Notebook rules:
• Keep notebooks under `notebooks/`; no committed outputs (`jupyter nbstripout`).
• Pair each notebook with a `.py` script or markdown tutorial.
Error Handling & Validation
- Validate external inputs at boundaries (CLI args, API payloads) using `pydantic` models.
- Early-return on invalid data; raise domain-specific exceptions (`DataValidationError`, `ModelNotFittedError`).
- Log errors with `structlog` in JSON format; avoid `print`.
- For PRs, CI must fail on unhandled exceptions uncovered by tests or static analysis (`ruff`, `pylint`).
GitHub Workflow (Framework-Specific Rules)
- Branch naming: `type/scope-description` (e.g. `feat/data-loader`, `fix/model-eval`).
- Commit messages follow Conventional Commits: `type(scope): subject` + body + footer.
- Pull Requests
• Template includes: Motivation, Linked Issue, Approach, Screenshots/Artifacts, Self-Checklist.
• Minimum two approvals; at least one from a domain maintainer.
• Squash-merge with the PR title as the final commit message.
- Labels: `kind/bug`, `kind/feature`, `kind/docs`, `priority/high`, etc. New issues must be triaged within 48 h.
- GitHub Actions
• `ci.yml` runs lint, type-check, tests (Ubuntu, MacOS, Windows).
• `docs.yml` deploys Sphinx site on push to main.
• `auto-assign.yml` assigns reviewers based on CODEOWNERS.
Testing
- Use `pytest>=7`, coverage target ≥ 90 %; fail build if ↓.
- Each bug-fix PR must add a regression test reproducing the prior failure.
- Property-based tests (`hypothesis`) for data-processing functions.
- ML models: include unit test, integration test (end-to-end pipeline), and performance baseline test (±5% tolerance).
Documentation
- Docs live in `docs/` (Sphinx + MyST). Build locally with `make docs` before PR.
- Every public module/function/class is auto-documented via `sphinx.ext.autodoc`.
- Tutorials and examples in `docs/tutorials/` using Jupyter notebooks executed by `nbsphinx` in CI.
Performance
- Use `pandas` vectorization before loops; fall back to `polars` or `numba` for heavy compute.
- Add `%%timeit` benchmarks in `benchmarks/` (pytest-bench). CI alerts if runtime ↑ >10%.
Security
- Run `bandit -ll` in CI; fix or `# nosec` with justification.
- Dependabot enabled; PRs auto-merge after tests if patch-level.
- Secrets scanning (`gitleaks`) required.
Tooling & Automation
- Pre-commit config includes: black, isort, ruff, mypy, nbstripout, check-merge-conflict.
- Issue & PR templates in `.github/` and `.gitlab/` (if mirrored).
- Use GitHub Projects/Boards for task delegation; each card links an issue.
Common Pitfalls & How to Avoid Them
- Un-reproducible environments → commit `environment.yml` or `pyproject.toml` with strict versions; CI tests fresh clone.
- Large data files in repo → add to DVC / Git LFS and document download steps.
- Notebook merge conflicts → run `nbstripout` and keep outputs cleared.
Checklist for Maintainers
- [ ] Issue clearly scoped (SMART) & labeled.
- [ ] CONTRIBUTING.md, CODE_OF_CONDUCT.md, templates up-to-date.
- [ ] CI green on main.
- [ ] Dependencies audited monthly (`pip-audit`).
- [ ] Post-release retrospective logged in `docs/release_notes/`.