Comprehensive Rules for building, deploying, and maintaining an AI-assisted software-project estimation service.
You know the drill. Another sprint planning meeting, another round of estimation poker, another project that takes twice as long as predicted. While everyone else is still throwing story point cards around conference tables, you could be building an AI-driven estimation engine that learns from every missed deadline and delivers confidence intervals instead of wild guesses.
Traditional software estimation is broken by design. Planning poker gives you the illusion of precision while masking fundamental uncertainty. Historical velocity calculations assume your next sprint will be identical to your last. Management wants commitments, but you're making educated guesses with incomplete information.
Here's what's actually happening:
The result? You're spending hours in estimation meetings to produce numbers that are consistently wrong by 50-200%.
These Cursor Rules help you build an estimation service that doesn't just guess—it forecasts. By combining top-down, bottom-up, analogy-based, and parametric estimation methods with machine learning models trained on your actual delivery data, you get predictions with quantified confidence and automatic bias correction.
Instead of story points pulled from thin air, you get:
{
"pointEstimate": 34,
"confidenceInterval": {"low": 28, "high": 42},
"methodologyWeights": {
"topDown": 0.25,
"bottomUp": 0.35,
"analogy": 0.30,
"parametric": 0.10
},
"riskBuffer": 8,
"biasCorrection": -2.3
}
The system learns from every completed sprint, automatically retraining models to reduce estimation bias over time.
Eliminate Estimation Theater Replace 2-hour planning poker sessions with 10-minute AI-assisted forecasts. Your ML models analyze thousands of similar tasks instantly, while your team focuses on breaking down complex requirements.
Quantify Uncertainty Instead of Hiding It Stop pretending estimates are commitments. Surface confidence intervals and risk buffers that help stakeholders make informed decisions about scope and timelines.
Continuous Accuracy Improvement
Every completed sprint feeds back into the model. Estimation bias automatically decreases over time as the system learns your team's specific patterns and constraints.
Risk-Aware Planning The system factors in technical debt, team velocity variance, and historical scope creep patterns to provide realistic buffers instead of optimistic best-case scenarios.
Before: Product manager presents 12 user stories. Team spends 90 minutes debating whether authentication should be 5 or 8 points. Half the team anchors on the first estimate, discussion gets circular, final estimates vary wildly based on who spoke loudest.
After: Upload story descriptions to /estimate endpoint. Get back predictions in seconds with confidence intervals. Team spends 15 minutes reviewing high-uncertainty items and discussing scope clarifications. Planning meeting focuses on breaking down complex stories, not arguing about numbers.
# Real API call during sprint planning
response = await client.post("/estimate", json={
"stories": [
{
"title": "Implement OAuth2 authentication",
"description": "Users should be able to log in with Google/GitHub",
"acceptanceCriteria": ["Social login buttons", "Token management", "User profile sync"],
"complexity": "MEDIUM",
"domain": "AUTHENTICATION"
}
],
"teamContext": {
"velocity": {"mean": 28, "std": 4.2},
"sprintNumber": 12,
"technicalDebtScore": 3.2
}
})
Before: Scope creep hits during development. Original 5-point story balloons to 13 points. Team scrambles to re-estimate remaining backlog manually. Sprint commitment becomes meaningless.
After: System automatically re-estimates based on actual progress. Confidence intervals tighten as work progresses. Stakeholders get proactive alerts when sprint commitment is at risk.
Before: Sum up story points across epics, multiply by assumed velocity, present timeline to executives. Reality diverges from plan within 2 weeks.
After: Portfolio-level Monte Carlo simulation considers inter-team dependencies, resource constraints, and historical delivery variance. Present probability distributions instead of false precision.
# Project structure following the rules
mkdir estimation-service && cd estimation-service
mkdir -p src/{estimation_api,estimation_core,estimation_ml,estimation_data}
mkdir -p tests/{unit,integration}
# Initialize with proper tooling
poetry init
poetry add fastapi uvicorn pydantic torch scikit-learn lightgbm
poetry add --group dev pytest black ruff mypy bandit
Build the hybrid estimation engine that combines multiple methodologies:
# src/estimation_core/hybrid_estimator.py
from dataclasses import dataclass
from typing import Dict, List, Optional
import numpy as np
from .methods import TopDownEstimator, BottomUpEstimator, AnalogyEstimator
@dataclass
class EstimationResult:
point_estimate: float
confidence_low: float
confidence_high: float
methodology_weights: Dict[str, float]
risk_buffer: float
class HybridEstimator:
def __init__(self):
self.top_down = TopDownEstimator()
self.bottom_up = BottomUpEstimator()
self.analogy = AnalogyEstimator()
async def estimate(self, request: EstimationRequest) -> EstimationResult:
# Get estimates from each method
estimates = await asyncio.gather(
self.top_down.estimate(request),
self.bottom_up.estimate(request),
self.analogy.estimate(request)
)
# Calculate weighted average based on confidence
weights = self._calculate_weights(estimates, request)
point_estimate = sum(est.value * weight for est, weight in zip(estimates, weights))
# Calculate confidence interval using ensemble variance
confidence_interval = self._calculate_confidence(estimates, weights)
return EstimationResult(
point_estimate=point_estimate,
confidence_low=confidence_interval[0],
confidence_high=confidence_interval[1],
methodology_weights=dict(zip(['topDown', 'bottomUp', 'analogy'], weights)),
risk_buffer=self._calculate_risk_buffer(request)
)
Implement the learning system that improves estimates over time:
# src/estimation_ml/training_pipeline.py
import torch
import lightgbm as lgb
from sklearn.ensemble import StackingRegressor
class EstimationMLPipeline:
def __init__(self):
self.base_models = [
('lgb', lgb.LGBMRegressor(objective='regression', num_leaves=31)),
('nn', self._build_neural_network())
]
self.meta_model = StackingRegressor(
estimators=self.base_models,
final_estimator=lgb.LGBMRegressor()
)
def train_nightly(self, training_data: pd.DataFrame):
"""Nightly retraining job that learns from completed sprints"""
features = self._engineer_features(training_data)
targets = training_data['actual_story_points']
# Split and train
X_train, X_val = train_test_split(features, test_size=0.2)
y_train, y_val = train_test_split(targets, test_size=0.2)
self.meta_model.fit(X_train, y_train)
# Calculate bias correction
predictions = self.meta_model.predict(X_val)
bias = np.mean(predictions - y_val)
# Save model with version and bias correction
self._save_versioned_model(bias)
Wire up the production API with proper monitoring and error handling:
# src/estimation_api/main.py
from fastapi import FastAPI, Depends, HTTPException
from prometheus_client import Histogram, Counter
import structlog
# Metrics
estimation_latency = Histogram('estimation_request_duration_seconds')
estimation_errors = Counter('estimation_errors_total')
logger = structlog.get_logger()
@app.post("/estimate", response_model=EstimationResponse)
async def estimate_stories(
request: EstimationRequest,
estimator: HybridEstimator = Depends(get_estimator),
trace_id: str = Depends(get_trace_id)
):
with estimation_latency.time():
try:
result = await estimator.estimate(request)
# Store for future training
await store_estimation_request(request, result, trace_id)
logger.info(
"Estimation completed",
trace_id=trace_id,
point_estimate=result.point_estimate,
methodology_weights=result.methodology_weights
)
return EstimationResponse(**result.__dict__)
except Exception as e:
estimation_errors.inc()
logger.error("Estimation failed", trace_id=trace_id, error=str(e))
raise HTTPException(status_code=503, detail="Estimation service unavailable")
Set up production deployment with auto-scaling and monitoring:
# infrastructure/main.py (CDK)
from aws_cdk import aws_ecs as ecs, aws_ecs_patterns as ecs_patterns
service = ecs_patterns.ApplicationLoadBalancedFargateService(
self, "EstimationService",
task_definition=task_def,
public_load_balancer=True,
desired_count=2,
enable_logging=True
)
# Auto-scaling based on CPU and custom metrics
scaling = service.service.auto_scale_task_count(max_capacity=10)
scaling.scale_on_cpu_utilization("CpuScaling", target_utilization_percent=60)
scaling.scale_on_metric("EstimationLatency",
metric=estimation_latency.metric_95th_percentile(),
scaling_steps=[{"upper": 200, "change": +2}]
)
Estimation Accuracy: Teams report 40-60% reduction in estimation error after 3 months of model learning from actual delivery data.
Planning Efficiency: Sprint planning meetings shortened from 2+ hours to 30-45 minutes, with more time spent on valuable scope discussion rather than number debates.
Stakeholder Trust: Confidence intervals and risk buffers help manage expectations. When estimates say "80% chance of completing 28-34 points," stakeholders can make informed decisions about scope trade-offs.
Continuous Improvement: Automated bias detection catches systematic estimation errors. Teams that consistently under-estimate authentication work get automatic corrections applied to future similar tasks.
Portfolio Visibility: Release managers can run Monte Carlo simulations across multiple teams to forecast delivery dates with quantified uncertainty rather than false precision.
Risk Management: Built-in contingency buffer calculations based on project complexity and team velocity variance reduce scope creep impact by 30-50%.
The system pays for itself within the first quarter by reducing estimation overhead and improving delivery predictability. Your team stops guessing and starts forecasting with confidence intervals that actually mean something.
You are an expert in Agile software estimation, Python, FastAPI, PyTorch, scikit-learn, and AWS.
Key Principles
- Blend top-down, bottom-up, analogy, and parametric methods in every estimate; show the weight of each in output JSON.
- Estimates are forecasts, not commitments; the code must expose confidence intervals and contingency buffers.
- Prefer functional, declarative code; minimal shared state, pure functions for calculations, classes only for data models.
- All estimation logic must be reproducible: every prediction request stores its full input payload, model version, and hyper-parameters.
- Automate continuous improvement: nightly jobs retrain models on latest actual-vs-estimate deltas.
- Infrastructure as Code (IaC) is mandatory; use Terraform CDK with least-privilege IAM roles.
- Default to secure, privacy-preserving data handling; no PII in logs.
Python
- Use Python 3.11+. Enforce `ruff` + `black` (line length = 100) and `mypy --strict` in CI.
- Type hints are required; use `pydantic.BaseModel` for request/response schemas.
- Directory layout:
src/
estimation_api/ # FastAPI routers & DI
estimation_core/ # pure estimation algorithms
estimation_ml/ # ML pipelines, torch models
estimation_data/ # feature builders, data access
tests/
- Never catch bare `Exception`; trap specific exceptions (`ValueError`, `HTTPException`, etc.).
- Use `Enum` for categorical features (e.g., Domain, ComplexityBucket).
Error Handling and Validation
- Validate all incoming JSON with Pydantic models; reject unknown fields (`extra = "forbid"`).
- Early-return on invalid data; respond with HTTP 422 and detailed validation errors.
- On prediction failure, return HTTP 503 with `retry_after_seconds`.
- Wrap ML inference in a circuit-breaker (e.g., `aiobreaker`) to avoid cascading failures.
- Add `x-trace-id` header to every request/response; propagate to logs (use structlog).
FastAPI
- Use dependency injection to inject `EstimatorService` (facade around rule-based + ML ensemble).
- Define routes:
POST /estimate → returns {pointEstimate, ciLow, ciHigh, methodologyWeights, riskBuffer}
GET /healthz
GET /metrics → Prometheus metrics
- Enable CORS only for whitelisted front-end domains.
- Instrument with `prometheus_fastapi_instrumentator`; expose latency buckets for <100 ms.
PyTorch / scikit-learn
- Start with gradient-boosting baseline (LightGBM) plus a PyTorch MLP ensemble; use stacking for final prediction.
- Save models with explicit semantic versioning: MAJOR.MINOR.PATCH.jobYYYYMMDD.
- Feature engineering rules:
• One-hot for categorical ≤ 15 levels; target encoding otherwise.
• Log-scale story-point counts to reduce skew.
• Calculate rolling velocity (mean & std over last 3 sprints) as numeric feature.
- Store artifacts in S3 → versioned bucket; checksum validated on load.
Testing
- 95 %+ branch coverage enforced by `pytest --cov`.
- Golden-set testing: freeze a sample of past projects; prediction MAPE must remain within ±2 % across releases.
- Chaos tests simulate upstream latency spike; ensure timeout logic is respected.
Performance
- Async I/O everywhere (`async def`); ML inference runs in dedicated threadpool (`concurrent.futures.ThreadPoolExecutor(max_workers=4)`).
- P99 latency target ≤ 200 ms; fail CI if `locust` load test exceeds.
- Cache identical requests for 10 minutes via `aiocache` (Redis backend).
Security
- All secrets in AWS Secrets Manager; never commit `.env` files.
- Enable JWT auth; scopes: `estimate:read` `estimate:write`.
- Run `bandit -r src -ll` on each PR.
DevOps & Deployment
- Dockerfile must be slim (`python:3.11-slim`, multi-stage build, `poetry install --no-dev`).
- Helm chart includes HPA (CPU 60 %, min 2, max 10 pods) and PodDisruptionBudget (minAvailable = 1).
- Canary deploy via Argo Rollouts with auto-rollback on error-rate > 2 %.
Logging & Observability
- Use `structlog` JSON formatter; mandatory keys: timestamp, level, msg, trace_id, model_version.
- Emit custom metric `estimation_bias` (predicted-actual) for each closed project; alarm on 3-sprint moving-avg |bias| > 10 %.
Common Pitfalls & Guardrails
- Never allow absolute time estimates (hours/days) to leak into API; convert to story points before persistence.
- Ensure contingency buffer ≥ (critical path risk score × 0.1).
- Denormalize high-variance features; check VIF < 5 to avoid multicollinearity.
Documentation
- Generate OpenAPI docs; must show example payloads for all scenarios (greenfield, legacy rewrite, spikes).
- Maintain `CHANGELOG.md`; include model metrics delta for every release.
- Each public function needs NumPy-style docstring with `Raises` section.