Comprehensive Rules focused on building, operating, and governing AI/ML data pipelines with strong compliance, quality, security, and transparency guarantees.
Transform chaotic data pipelines into compliant, auditable AI systems that actually pass enterprise scrutiny.
You've built brilliant ML models. They work beautifully in development. Then enterprise security reviews your production pipeline and asks:
Suddenly your elegant data science becomes a compliance nightmare. Manual governance processes break down at scale, auditors demand documentation you don't have, and every model deployment becomes a legal review.
The problem isn't your ML skills—it's that most data teams bolt on governance as an afterthought.
These Cursor Rules transform your development workflow to build compliance, security, and explainability directly into every data pipeline and ML model. Instead of retrofitting governance onto existing systems, you'll architect transparency from the ground up.
What you get:
Before: Spend weeks creating compliance reports manually, hoping you didn't miss anything critical
After: Generate audit-ready documentation automatically from metadata captured during normal pipeline execution
Before: Governance slows down model deployment cycles and frustrates data teams
After: Ship AI features faster because compliance validation is built into your CI/CD pipeline
Before: Panic when auditors ask for proof your AI systems handle sensitive data properly
After: Demonstrate end-to-end data protection with automatically generated lineage diagrams and policy enforcement logs
Before: Scramble to add interpretability features when stakeholders question model decisions
After: Every model includes feature importance tracking, decision logs, and bias metrics from day one
from governed_pipeline import ClassificationPipeline, PIIDetector
@task
def process_customer_data(raw_df: DataFrame) -> DataFrame:
# Automatic PII detection and classification
pii_detector = PIIDetector()
classified_df = pii_detector.classify_columns(raw_df)
# Policy enforcement before any processing
if classified_df.has_restricted_data():
raise PolicyViolation("Cannot process PII without explicit consent")
# Automated data quality checks
expectations = load_expectations("customer_data_v2")
validated_df = expectations.validate_or_fail(classified_df)
return validated_df.with_lineage_metadata()
Impact: Your feature engineering automatically respects data classification, validates quality, and maintains audit trails—no separate compliance step needed.
@mlflow_governed_run
def train_credit_model(features: DataFrame, target: Series):
model = XGBClassifier()
model.fit(features, target)
# Automatic bias detection across protected attributes
bias_checker = FairnessValidator(protected_attrs=['age', 'gender'])
bias_metrics = bias_checker.evaluate(model, features, target)
# Block model registration if fairness thresholds exceeded
if bias_metrics.disparate_impact > 0.8:
raise GovernanceError(f"Model shows bias: {bias_metrics}")
# Log governance metadata with model
mlflow.log_governance_metadata({
'bias_metrics': bias_metrics,
'training_data_hash': hash_dataset(features),
'feature_importance': model.feature_importances_
})
Impact: Every model deployment includes automated bias testing and explainability metadata, preventing discriminatory AI from reaching production.
@dag(tags=['governed', 'domain:customer', 'sla:4h'])
def customer_analytics_pipeline():
@task
def validate_data_freshness(table_name: str):
# Automatic SLA monitoring
freshness = check_table_freshness(table_name)
if freshness > timedelta(hours=4):
trigger_sla_breach_alert(table_name, freshness)
@task
def apply_retention_policy(processed_data: DataFrame):
# Automatic data lifecycle management
retention_manager = RetentionPolicyManager()
return retention_manager.apply_policy(processed_data)
@task
def publish_with_lineage(final_data: DataFrame):
# Automatic metadata registration
lineage_tracker = OpenLineageTracker()
lineage_tracker.emit_dataset(
dataset=final_data,
classification="sensitive",
retention_days=365
)
Impact: Your Airflow DAGs automatically enforce data retention, monitor SLAs, and emit lineage metadata without any manual governance overhead.
# Install governance stack
pip install great-expectations pydantic[email] openlineage-python mlflow
# Initialize governance config
cursor-rules init --template ai-governance
cursor-rules configure --compliance-frameworks gdpr,ccpa
# .cursor-rules/data-classification.yml
policies:
pii_detection:
enabled: true
confidence_threshold: 0.85
retention_defaults:
raw_data: 90_days
processed_data: 1_year
ml_models: 5_years
access_controls:
pii_data: ["data_scientists", "privacy_officers"]
model_artifacts: ["ml_engineers", "model_reviewers"]
# Add to existing data processing functions
from cursor_governance import governed_pipeline
@governed_pipeline(
expectations="customer_data_quality",
classification_required=True,
lineage_tracking=True
)
def existing_etl_function(data):
# Your existing logic unchanged
return processed_data
# Add to CI/CD pipeline
- name: Validate Governance Compliance
run: |
great_expectations checkpoint run --fail-on-validation-failure
python scripts/validate_model_fairness.py
openlineage verify --required-metadata classification,retention
Ready to build AI systems that pass enterprise scrutiny on day one? These Cursor Rules eliminate the compliance scramble and turn governance into a competitive advantage. Your models will be more trustworthy, your deployments faster, and your audit reviews painless.
Start building governed AI systems that scale with confidence, not compliance theater.
You are an expert in Data Governance for AI, including Python, SQL, Spark, dbt, Airflow, Great Expectations, OpenMetadata/Apache Atlas, MLflow, Domo, and major cloud services (AWS Glue Catalog, GCP Dataplex, Azure Purview).
Key Principles
- Align every technical choice with clearly documented governance objectives & regulatory requirements (GDPR, CCPA, HIPAA, PCI-DSS).
- Treat data as a regulated asset: catalogue, classify, trace, and version every dataset and model artifact.
- Automate quality, security, and compliance checks; never rely on manual gates.
- Minimize data collection (data minimization) and prefer aggregated or synthetic data when practicable.
- Enforce least-privilege, role-based, attribute-based access control (RBAC/ABAC) at storage, query, and application layers.
- Make AI models explainable and auditable by design (model metadata, lineage, decision logs, feature importance).
- Fail fast on policy violations; surface actionable, user-friendly error details.
- Keep policy code (YAML/JSON) version-controlled, peer-reviewed, and promoted via CI/CD.
Python
- Use Python 3.11+ with type hints (PEP 484) and `mypy --strict` in CI.
- Mandatory `pydantic` (v2) validation for inbound/outbound data schemas.
- Write pipelines as pure, idempotent functions; avoid global state.
- Name variables with governed intent: `raw_customer_table`, `is_restricted`, `gdpr_erasure_job`.
- Do not catch bare `Exception`; catch domain-specific errors (`DataQualityError`, `PolicyViolation`).
- Wrap external I/O in a `retry_with_backoff()` decorator using exponential strategy capped at 5 attempts.
SQL
- Adopt ANSI SQL 2016 standard; test against engine-specific linter (sqlfluff) profile.
- SELECT list must be explicit (no `SELECT *`).
- Partition large tables by event_date; cluster by high-cardinality keys used in where/join.
- Embed data classification as table/view comments: `@class:PII`, `@retention:2y`.
Spark (PySpark)
- Enable `spark.sql.adaptive.enabled=true` and `spark.sql.legacy` off.
- Persist interim DataFrames only when reused ≥2×; otherwise keep in memory.
- All DataFrames must include `__ingestion_ts` and `__source_system` columns for lineage.
Error Handling & Validation
- Apply Great Expectations checkpoints at pipeline ingress and egress. Block downstream steps on failure.
- First lines of every task: 1) schema validation, 2) ACL enforcement, 3) sensitivity check (`is_pii()` helper).
- Emit structured events (OpenTelemetry) for each violation: `event_type=policy_violation`, `severity=high`.
- Early-return pattern: if validation fails, log, raise, exit; happy path last.
Framework-Specific Rules
Apache Airflow
- Use TaskFlow API; each `@task` returns serializable objects only (no open DB connections).
- DAG definition lives in `dags/{domain}/{dag_id}.py`. Filename mirrors DAG id.
- Tag every DAG with `['governed', 'domain:<name>', 'sla:<x>h']`.
- Configure `sla_miss_callback` to trigger incident runbook and Slack alert.
- Store XCom only in encrypted backend; no PII in XCom.
dbt
- Source YAML must declare `freshness`, `loaded_at_field`, and `meta.sensitivity`.
- `on_schema_change: fail` to force explicit review of column alterations.
- Use `exposures` to map downstream ML models.
Great Expectations
- Keep expectations in code-data pairs: expectation file next to the dataset directory.
- Use `mostly` parameter ≤ 0.01 for critical fields, ≤ 0.05 for non-critical.
- Version expectation suites; never mutate IDs—create a new version.
MLflow
- Log `mlflow.model.type` tag (`classification`, `regression`, etc.).
- Store training data hash & feature store version in run metadata.
- Register model only if attached validation suite passes and bias metrics are within thresholds.
Domo / Metadata Stores
- Auto-register every new dataset or model artifact with lineage links.
- Deny deployments of unregistered entities.
Additional Sections
Testing
- Unit: pytest with 100% statement coverage for policy modules; 80% overall.
- Integration: nightly DAG run in staging with synthetic PII to test masking/erasure.
- Contract: CI step runs `great_expectations checkpoint run --fail-on-validation-failure`.
Performance & Scalability
- Profile queries ≥ 1 min runtime; add `EXPLAIN ANALYZE` to pull plans into repository.
- Use incremental materializations; process only partitions with `event_date >= last_successful_run`.
Security & Compliance
- Secrets managed by vault (AWS SecretsManager, HashiCorp Vault). Never commit credentials.
- Data in transit TLS 1.3; at rest AES-256-GCM.
- Enable column-level encryption for PII; rotate keys every 180 days.
- Run quarterly privacy impact assessments (PIA) and document in `/compliance/pia-YYYY-QX.md`.
Metadata & Lineage
- Capture end-to-end lineage via OpenLineage integration; publish to OpenMetadata UI.
- Include `dataset_owner`, `data_classification`, `retention_policy`, and `contact_slack` in metadata payloads.
- Auto-generate lineage diagrams in PR comments for modified pipelines.
Incident Response
- Maintain `runbooks/<incident>.md` with detection logic, severity levels, escalation path.
- Run game-days bi-annually to test data breach response; log outcomes.
Documentation
- Each directory contains a `README.md` with: dataset description, owner, sensitivity, glossary.
- Use diagram-as-code (Mermaid) for architecture diagrams; store in `/docs/architecture`.
Lifecycle & Retention
- Default retention: raw 90 days, refined 1 year, aggregates 5 years unless override in metadata.
- `gdpr_delete_user(user_id)` helper deletes across raw, refined, feature store, model logs.
Governance Metrics
- Track and report: % datasets classified, % DAGs with quality gates, mean policy violation MTTR, model bias score trend.