Comprehensive, actionable naming-convention rules for Python-centric data-science projects (scripts, notebooks, SQL, dashboards) with supporting guidelines for error handling, testing, performance and popular libraries.
Your data science codebase is growing fast. Models are multiplying, notebooks are proliferating, and your team is constantly context-switching between Python scripts, SQL queries, and dashboard code. Sound familiar?
Every data scientist has been there: opening a colleague's notebook only to find variables like df2_processed, final_model_v3, and temp_data_clean scattered throughout. Or worse—opening your own code from three months ago and spending 20 minutes deciphering what customer_data_transformed actually contains.
This isn't just about aesthetics. Poor naming conventions create real productivity drains:
These Cursor Rules establish a comprehensive naming convention system specifically designed for Python-centric data science workflows. They go beyond basic PEP 8 compliance to address the unique challenges of ML pipelines, notebook development, and multi-language data workflows.
What makes this different? Most naming guides give you abstract rules. This system provides concrete patterns for every component of your data science stack—from DataFrame columns to model variables to dashboard components.
Transform ambiguous transformation chains into clear, readable workflows:
# Before: What does this pipeline actually do?
df_temp = df.apply(lambda x: process_func(x))
model_v2 = train_model_updated(df_temp)
# After: Crystal clear intent and flow
customer_features_df = extract_customer_features(raw_data_df)
churn_prediction_model = train_gradient_boosting(customer_features_df)
Stop mental gear-shifting when moving between Python, SQL, and JavaScript dashboard code:
daily_revenue_df, calculate_churn_probability()daily_revenue table, churn_probability columndailyRevenue, calculateChurnProbability()Eliminate the "what does this variable contain?" conversations:
# Immediately clear what each variable represents
sales_pipeline_df: pd.DataFrame # No ambiguity about data type
forecast_accuracy_score: float # Obvious metric purpose
outlier_detection_model: IsolationForest # Clear model type
The Challenge: You're building a customer segmentation model with feature engineering, multiple preprocessing steps, and model comparison. Without clear naming, tracking data through each stage becomes a debugging nightmare.
With These Rules:
# Data loading and initial processing
raw_customers_df = load_customer_data()
cleaned_customers_df = remove_duplicates(raw_customers_df)
# Feature engineering with clear intent
recency_features_df = calculate_recency_metrics(cleaned_customers_df)
frequency_features_df = calculate_frequency_metrics(cleaned_customers_df)
monetary_features_df = calculate_monetary_metrics(cleaned_customers_df)
# Model training with descriptive names
customer_segments_kmeans = train_kmeans_clustering(combined_features_df)
customer_segments_dbscan = train_dbscan_clustering(combined_features_df)
# Performance tracking
kmeans_silhouette_score = evaluate_clustering(customer_segments_kmeans)
dbscan_silhouette_score = evaluate_clustering(customer_segments_dbscan)
Impact: Zero ambiguity about data flow, easy model comparison, and immediate understanding for team collaboration.
The Challenge: Moving experimental notebook code to production scripts often breaks because of inconsistent naming between environments.
With These Rules:
# Consistent naming from notebook to production
# notebook/exploration.ipynb
feature_importance_scores = model.feature_importances_
top_features_mask = feature_importance_scores > IMPORTANCE_THRESHOLD
# scripts/train_model.py
feature_importance_scores = model.feature_importances_ # Same variable name
top_features_mask = feature_importance_scores > IMPORTANCE_THRESHOLD # Same logic
Impact: Seamless notebook-to-production transitions, reduced refactoring time, and fewer deployment bugs.
The Challenge: Data scientists create models, frontend developers build dashboards, and the naming inconsistencies create integration bottlenecks.
With These Rules:
# Python backend (data scientist)
monthly_user_retention_rate = calculate_retention(users_df)
churn_risk_segments = ["high", "medium", "low"]
# JavaScript frontend (dashboard developer)
const monthlyUserRetentionRate = fetchRetentionMetrics();
const churnRiskSegments = ["high", "medium", "low"];
Impact: Faster integration cycles, fewer miscommunication bugs, and smoother handoffs between team roles.
# Install required linting tools
pip install pylint flake8 black
npm install eslint prettier
# Add to your pre-commit hooks
echo "pylint --good-names=df,ax,fig *.py" >> .pre-commit-config.yaml
Create these directories with clear naming patterns:
project/
├── data_ingestion/ # Raw data loading modules
├── feature_engineering/ # Feature creation pipelines
├── models/ # Model training and evaluation
├── notebooks/ # Exploratory analysis (kebab-case files)
└── utils/ # Shared utility functions
For Pandas/Polars workflows:
# DataFrame naming: context + data type when multiple types exist
customers_df = load_customers()
customers_summary = customers_df.describe() # Single type in scope
# Pipeline functions: verb + object
filtered_customers_df = filter_active_customers(customers_df)
aggregated_sales_df = aggregate_daily_sales(transactions_df)
For scikit-learn models:
# Model naming: target_algorithm pattern
churn_random_forest = RandomForestClassifier()
price_gradient_boosting = GradientBoostingRegressor()
# Hyperparameter dictionaries
random_forest_params = {"n_estimators": 100, "max_depth": 10}
gradient_boosting_params = {"learning_rate": 0.1, "n_estimators": 200}
Add this to your pyproject.toml:
[tool.pylint.basic]
good-names = ["df", "ax", "fig", "i", "j", "k", "x", "y"]
invalid-name = "rgx:^[a-z]+[0-9]*$" # Prevent df1, df2 naming
[tool.black]
line-length = 88
While PEP 8 covers Python basics, these rules address data science-specific challenges: handling multi-format data, managing complex ML pipelines, and maintaining consistency across notebook and script environments.
The result? A data science codebase that scales with your team and projects, where every variable name tells a clear story about its purpose and data flow.
Ready to eliminate naming confusion from your data science workflows? Implement these rules and experience the difference that consistent, meaningful names make in your daily development cycle.
You are an expert in Python (Data-Science focus) + SQL + Jupyter + Bash + JavaScript (dashboards).
Key Principles
- Strive for self-documenting code: names should reveal intent without external comments.
- One concept → one spelling: never mix synonyms (e.g., customer vs. client).
- Separate WHAT from HOW: avoid implementation details in identifiers (e.g., df_normalised is better than df_applied_minmax_scaler).
- Keep a single natural language (English) across the codebase.
- Prefer brevity after clarity: mean_squared_error not mserr, but mse is fine when domain-standard.
- Variables ⇒ nouns, functions/methods ⇒ verbs, exceptions ⇒ noun + “Error”.
- Consistency over cleverness: follow the dominant convention of the file or project.
Python
- Variables & functions: snake_case ➜ mean_temperature, load_data().
- Classes, Enums, dataclasses: PascalCase ➜ LinearRegressionModel.
- Constants/module-level globals: UPPER_SNAKE_CASE ➜ DEFAULT_CHUNK_SIZE = 10_000.
- Private attributes: single underscore prefix (_cache).
- Avoid type suffix/prefix (df_, list_): rely on type hints instead (customers: pd.DataFrame).
- Pandas columns: lower_snake_case; never rely on implicit index names (rename Series.name).
- Notebook cell variables follow same rules; avoid single-letter (i, j) except in short comprehensions.
- File names: kebab-case for scripts (train-model.py), snake_case for modules (data_loader.py).
SQL
- Table names: plural snake_case nouns (customers, order_items).
- Column names: singular snake_case (order_id, created_at).
- Aliases: short, lowercase, no reserved words (c, o).
- CTE names: descriptive snake_case verbs or nouns (daily_sales, get_latest_orders).
JavaScript (Dash/Plotly, Streamlit components)
- Variables, functions: camelCase (fetchMetrics).
- React components: PascalCase (SalesChart).
- Constants: UPPER_SNAKE_CASE (DEFAULT_COLOR).
Error Handling & Validation
- Custom exceptions end with Error (DataValidationError).
- Prefix guard functions with ensure_ or assert_ (ensure_non_empty_df).
- Early-return on invalid data; communicate via raised errors not sentinel values.
Framework-Specific Rules
Pandas / Polars
- DataFrame variables end with _df only when multiple data types coexist in scope.
- Use verb + object for transformation pipelines (filter_outliers, aggregate_sales).
Scikit-learn / PyTorch
- Model variables: <target>_<model> (sales_rf, price_lgbm).
- Hyperparameter dicts: <model>_params.
Airflow
- DAG id: kebab-case, version suffix (train-model-v2).
- Task ids: verb-noun kebab-case (load-customers).
Additional Sections
Testing
- Test modules prefixed with test_*.py.
- Test functions: test_<unit_of_work>_<expected_behaviour> (test_load_data_handles_empty).
- Fixtures: noun_fixture (customers_df_fixture).
Performance
- Timing variables: prefix with t_ (t_start, t_end) for quick grep.
- Cache keys: describe scope + hash (user_features_sha256).
Security
- Secrets/env vars: UPPER_SNAKE_CASE suffixed with _KEY/_TOKEN/_PWD.
- Never embed secret values in variable names.
Documentation
- Use docstrings: first sentence starts with verb ("Return cleaned dataframe.").
- Include naming rationale when deviating from rules.
Directory Layout
- data_ingestion/, feature_engineering/, models/, notebooks/, utils/.
- Each dir contains __init__.py exporting public symbols listed in __all__ for clarity.
Linting & Tooling
- Enable pylint/flake8 rules: naming-convention (good-names, invalid-name).
- Pre-commit hook: noqa-naming for SQL using sqlfluff.
- prettier + eslint for JS; enforce camelCase rule.
Edge Cases & Pitfalls
- No Hungarian notation (strCustomerName).
- Avoid shadowing built-ins (list, dict, sum).
- Temporary vars: tmp_, _scratch, delete before commit.
Happy coding with clear names that survive refactors!