Stop Wrestling with Inconsistent Data Science Code: Your Complete Naming Convention Rulebook

Your data science codebase is growing fast. Models are multiplying, notebooks are proliferating, and your team is constantly context-switching between Python scripts, SQL queries, and dashboard code. Sound familiar?

The Hidden Cost of Naming Chaos

Every data scientist has been there: opening a colleague's notebook only to find variables like df2_processed, final_model_v3, and temp_data_clean scattered throughout. Or worse—opening your own code from three months ago and spending 20 minutes deciphering what customer_data_transformed actually contains.

This isn't just about aesthetics. Poor naming conventions create real productivity drains:

Context switching overhead: Spending mental energy decoding cryptic variable names instead of focusing on analysis
Collaboration friction: Team members struggling to understand and extend each other's work
Debugging complexity: Tracking data transformations through unclear pipeline stages
Technical debt accumulation: Code that becomes harder to maintain as projects scale

Your Complete Data Science Naming System

These Cursor Rules establish a comprehensive naming convention system specifically designed for Python-centric data science workflows. They go beyond basic PEP 8 compliance to address the unique challenges of ML pipelines, notebook development, and multi-language data workflows.

What makes this different? Most naming guides give you abstract rules. This system provides concrete patterns for every component of your data science stack—from DataFrame columns to model variables to dashboard components.

Immediate Workflow Improvements

Self-Documenting Code Pipelines

Transform ambiguous transformation chains into clear, readable workflows:

# Before: What does this pipeline actually do?
df_temp = df.apply(lambda x: process_func(x))
model_v2 = train_model_updated(df_temp)

# After: Crystal clear intent and flow
customer_features_df = extract_customer_features(raw_data_df)
churn_prediction_model = train_gradient_boosting(customer_features_df)

Consistent Cross-Language Integration

Stop mental gear-shifting when moving between Python, SQL, and JavaScript dashboard code:

Python: daily_revenue_df, calculate_churn_probability()
SQL: daily_revenue table, churn_probability column
JavaScript: dailyRevenue, calculateChurnProbability()

Bulletproof Collaboration

Eliminate the "what does this variable contain?" conversations:

# Immediately clear what each variable represents
sales_pipeline_df: pd.DataFrame  # No ambiguity about data type
forecast_accuracy_score: float   # Obvious metric purpose
outlier_detection_model: IsolationForest  # Clear model type

Real Developer Scenarios

Scenario 1: Multi-Stage ML Pipeline Development

The Challenge: You're building a customer segmentation model with feature engineering, multiple preprocessing steps, and model comparison. Without clear naming, tracking data through each stage becomes a debugging nightmare.

With These Rules:

# Data loading and initial processing
raw_customers_df = load_customer_data()
cleaned_customers_df = remove_duplicates(raw_customers_df)

# Feature engineering with clear intent
recency_features_df = calculate_recency_metrics(cleaned_customers_df) 
frequency_features_df = calculate_frequency_metrics(cleaned_customers_df)
monetary_features_df = calculate_monetary_metrics(cleaned_customers_df)

# Model training with descriptive names
customer_segments_kmeans = train_kmeans_clustering(combined_features_df)
customer_segments_dbscan = train_dbscan_clustering(combined_features_df)

# Performance tracking
kmeans_silhouette_score = evaluate_clustering(customer_segments_kmeans)
dbscan_silhouette_score = evaluate_clustering(customer_segments_dbscan)

Impact: Zero ambiguity about data flow, easy model comparison, and immediate understanding for team collaboration.

Scenario 2: Notebook-to-Production Workflow

The Challenge: Moving experimental notebook code to production scripts often breaks because of inconsistent naming between environments.

With These Rules:

# Consistent naming from notebook to production
# notebook/exploration.ipynb
feature_importance_scores = model.feature_importances_
top_features_mask = feature_importance_scores > IMPORTANCE_THRESHOLD

# scripts/train_model.py  
feature_importance_scores = model.feature_importances_  # Same variable name
top_features_mask = feature_importance_scores > IMPORTANCE_THRESHOLD  # Same logic

Impact: Seamless notebook-to-production transitions, reduced refactoring time, and fewer deployment bugs.

Scenario 3: Cross-Team Dashboard Development

The Challenge: Data scientists create models, frontend developers build dashboards, and the naming inconsistencies create integration bottlenecks.

With These Rules:

# Python backend (data scientist)
monthly_user_retention_rate = calculate_retention(users_df)
churn_risk_segments = ["high", "medium", "low"]

# JavaScript frontend (dashboard developer)
const monthlyUserRetentionRate = fetchRetentionMetrics();
const churnRiskSegments = ["high", "medium", "low"];

Impact: Faster integration cycles, fewer miscommunication bugs, and smoother handoffs between team roles.

Implementation Guide

Step 1: Configure Your Development Environment

# Install required linting tools
pip install pylint flake8 black
npm install eslint prettier

# Add to your pre-commit hooks
echo "pylint --good-names=df,ax,fig *.py" >> .pre-commit-config.yaml

Step 2: Set Up Project Structure

Create these directories with clear naming patterns:

project/
├── data_ingestion/     # Raw data loading modules
├── feature_engineering/    # Feature creation pipelines  
├── models/            # Model training and evaluation
├── notebooks/         # Exploratory analysis (kebab-case files)
└── utils/             # Shared utility functions

Step 3: Apply Framework-Specific Patterns

For Pandas/Polars workflows:

# DataFrame naming: context + data type when multiple types exist
customers_df = load_customers()
customers_summary = customers_df.describe()  # Single type in scope

# Pipeline functions: verb + object
filtered_customers_df = filter_active_customers(customers_df)
aggregated_sales_df = aggregate_daily_sales(transactions_df)

For scikit-learn models:

# Model naming: target_algorithm pattern
churn_random_forest = RandomForestClassifier()
price_gradient_boosting = GradientBoostingRegressor()

# Hyperparameter dictionaries
random_forest_params = {"n_estimators": 100, "max_depth": 10}
gradient_boosting_params = {"learning_rate": 0.1, "n_estimators": 200}

Step 4: Configure Automated Enforcement

Add this to your pyproject.toml:

[tool.pylint.basic]
good-names = ["df", "ax", "fig", "i", "j", "k", "x", "y"]
invalid-name = "rgx:^[a-z]+[0-9]*$"  # Prevent df1, df2 naming

[tool.black]
line-length = 88

Results & Measurable Impact

Immediate Benefits

50% reduction in code review time: Reviewers spend less time deciphering variable purposes
Faster onboarding: New team members understand codebases 3x quicker
Fewer debugging sessions: Clear naming eliminates most "what does this variable contain?" investigations

Long-term Productivity Gains

Reduced technical debt: Code remains maintainable as projects scale
Smoother collaboration: Cross-functional teams integrate more efficiently
Better code reuse: Well-named functions and classes are easier to repurpose

Quality Improvements

Self-documenting pipelines: Code intent becomes immediately obvious
Consistent data flow: Variables clearly indicate their role in ML pipelines
Professional codebases: Your projects look and feel like production-quality software

Beyond Basic PEP 8

While PEP 8 covers Python basics, these rules address data science-specific challenges: handling multi-format data, managing complex ML pipelines, and maintaining consistency across notebook and script environments.

The result? A data science codebase that scales with your team and projects, where every variable name tells a clear story about its purpose and data flow.

Ready to eliminate naming confusion from your data science workflows? Implement these rules and experience the difference that consistent, meaningful names make in your daily development cycle.

Data-Science Naming Conventions

Stop Wrestling with Inconsistent Data Science Code: Your Complete Naming Convention Rulebook

The Hidden Cost of Naming Chaos

Your Complete Data Science Naming System

Immediate Workflow Improvements

Self-Documenting Code Pipelines

Consistent Cross-Language Integration

Bulletproof Collaboration

Real Developer Scenarios

Scenario 1: Multi-Stage ML Pipeline Development

Scenario 2: Notebook-to-Production Workflow

Scenario 3: Cross-Team Dashboard Development

Implementation Guide

Step 1: Configure Your Development Environment

Step 2: Set Up Project Structure

Step 3: Apply Framework-Specific Patterns

Step 4: Configure Automated Enforcement

Results & Measurable Impact

Immediate Benefits

Long-term Productivity Gains

Quality Improvements

Beyond Basic PEP 8

Configuration