Comprehensive Rules for collecting, processing, storing, and disposing of data in a transparent, privacy-preserving, and regulation-compliant manner.
You're tired of retrofitting privacy controls into existing pipelines. You're sick of compliance audits that reveal gaps you didn't know existed. And you're done with the endless cycle of "fixing" data governance after the fact.
What if your data pipelines were built ethically from day one? What if compliance wasn't a burden but a natural outcome of good engineering?
Most data teams are performing compliance theater. They bolt on privacy controls after building the pipeline, add governance tools that nobody uses, and create documentation that's outdated before the ink dries.
The result? You're constantly:
The core issue: Privacy and ethics are treated as afterthoughts, not engineering requirements.
These Cursor Rules transform data governance from a compliance burden into a development accelerator. Instead of retrofitting privacy controls, you build them into every function, every pipeline, every data transformation from the start.
Here's what changes:
Before: Build pipeline → Add privacy controls → Hope it works
After: Privacy controls are the pipeline architecture
Your code automatically:
Instead of spending weeks tracking down data for deletion requests, your pipelines know exactly where every piece of PII lives and can execute deletions in under 24 hours.
# Automatic lineage tracking built into every pipeline
@track_lineage
def process_user_data(pii_email: str, consent_record: ConsentRecord):
if not consent_record.is_valid_for_purpose("marketing"):
raise ConsentMissingError("No marketing consent for user")
# Anonymization happens immediately
anonymous_id = sha256_hash(pii_email)
# Original PII never leaves this function
Your CI/CD pipeline automatically runs fairness tests on every model, preventing biased algorithms from reaching users.
# Built-in bias testing that fails builds
@test_fairness(protected_attributes=['age', 'gender'])
def train_recommendation_model(training_data: DataFrame):
# Model training code here
pass
# Fairness metrics computed automatically
# Build fails if disparate impact > 0.8
When auditors ask for data processing records, you hand them a complete, automatically-generated audit trail instead of scrambling through documentation.
# Every data operation generates audit events
with audit_context(purpose="customer_segmentation", legal_basis="legitimate_interest"):
anonymized_data = anonymize_customer_data(raw_data)
# Audit trail: what data, when, why, who - all automatic
Before:
After:
# Privacy is built into the DAG structure
@dag(
dag_id="dag_marketing_segmentation_personal",
tags=['personal-data', 'gdpr']
)
def marketing_segmentation():
consent_check = validate_consent_taskgroup()
anonymize = anonymize_pii_task()
segment = create_segments_task()
register = register_in_collibra_task()
consent_check >> anonymize >> segment >> register
Privacy verification, anonymization, and compliance registration happen automatically. No retrofitting required.
Before:
After:
# One function handles enterprise-wide deletion
@propagate_deletion(timeout_hours=24)
async def delete_user_data(user_id: str):
# Lineage tracking knows every system with this user's data
# Deletion propagates automatically
# Downstream models retrain if needed
# Audit trail records the entire process
Before:
After:
# Bias testing is part of the training pipeline
@ensure_fairness(
protected_attributes=['age', 'gender', 'ethnicity'],
fairness_metrics=['demographic_parity', 'equalized_odds']
)
def train_credit_model(training_data: DataFrame):
# If model fails fairness tests, training stops
# No biased models make it to production
# Fairness metrics logged for audit
# Install the ethical data stack
pip install pandas sqlalchemy airflow cryptography pydantic
pip install aequitas fairlearn # For bias testing
pip install ruff black mypy # For code quality
# Create your first privacy-compliant pipeline
from ethical_data import ConsentValidator, PIIAnonymizer, AuditLogger
@dataclass
class CustomerRecord:
user_id: int
pii_email: str
age: int
purchase_history: List[float]
def process_customer_data(records: List[CustomerRecord]):
# Step 1: Verify consent (fails fast if missing)
ConsentValidator.verify_batch(records, purpose="analytics")
# Step 2: Immediate anonymization
anonymizer = PIIAnonymizer()
anonymous_records = anonymizer.anonymize_batch(records)
# Step 3: Process anonymized data
return analyze_purchase_patterns(anonymous_records)
# Add to your Airflow DAG
def create_ethical_pipeline():
dag = DAG(
'dag_customer_analytics_personal',
tags=['personal-data', 'gdpr'],
start_date=datetime(2024, 1, 1)
)
# Every DAG starts with consent validation
validate_consent = PythonOperator(
task_id='validate_consent',
python_callable=verify_processing_consent
)
# Then anonymization
anonymize_data = PythonOperator(
task_id='anonymize_pii',
python_callable=anonymize_customer_pii
)
# Finally, register in governance catalog
register_dataset = PythonOperator(
task_id='register_collibra',
python_callable=register_processed_dataset
)
validate_consent >> anonymize_data >> register_dataset
return dag
# Add to your ML training pipeline
@pytest.fixture
def test_model_fairness():
"""Runs automatically in CI/CD"""
model = load_trained_model()
test_data = load_test_data()
# Test for multiple fairness criteria
fairness_metrics = compute_fairness_metrics(
model, test_data,
protected_attributes=['age', 'gender']
)
# Fail if any metric exceeds threshold
assert fairness_metrics['demographic_parity'] > 0.8
assert fairness_metrics['equalized_odds'] > 0.8
Stop treating data governance like a necessary evil. These Cursor Rules make ethical development faster, not slower. They prevent problems instead of cleaning up messes. They turn compliance from a burden into a competitive advantage.
Your data pipelines can be both powerful and ethical. Your ML models can be both accurate and fair. Your compliance can be both thorough and automatic.
The question isn't whether you can afford to implement ethical data practices. It's whether you can afford not to.
Ready to build data pipelines that work ethically from day one? These rules give you everything you need to make privacy-first development your new normal.
You are an expert in Python 3.11+, SQL (ANSI & PostgreSQL), Pandas, SQLAlchemy, Apache Airflow, AWS S3/Glue/KMS, cryptography (PyCA), differential-privacy libraries, and enterprise data-governance platforms (Ataccama, Talend, Collibra).
Key Principles
- Informed consent is mandatory. Code must verify consent *before* data pull or processing begins.
- Full transparency: document data sources, fields, legal basis, retention, and sharing partners in machine-readable metadata (YAML/JSON).
- Data minimization: retrieve only required columns (explicit SELECT list); immediately delete/omit superfluous attributes.
- Privacy by design & default: encryption at rest + in transit, anonymization/pseudonymization as early as feasible.
- Fairness & bias mitigation: run automated bias tests on every model-training pipeline; log metrics (TPR/FPR parity, disparate impact).
- Robust accountability: store immutable lineage and processing logs (no PII) in Collibra catalog; assign owner for each dataset.
- Security first: protect keys in KMS/Secrets Manager; never hard-code secrets or dump PII to logs/metrics.
- Right to erasure: design pipelines so that per-subject deletions propagate to all downstream systems within 24 h.
- Auditability: every code path that touches personal data must emit a DataProcessingEvent to central audit stream.
Python
- Enforce PEP 8 with Black; use Ruff for lint+security rules (e.g., B-compliance, PII-leak checks).
- Type-annotate **all** public functions; enable MyPy strict.
- Use dataclasses for structured records: `@dataclass class ConsentRecord:`.
- Prefix variables that contain PII with `pii_` (e.g., `pii_email`); ensures linters can flag leakage.
- Provide doctrings that include: *Data categories*, *Retention*, *Legal basis*.
- Use `secrets.token_hex()` for salt/keys; never `random`.
- Never print/write raw PII; redact with `redact(value)` helper.
- Prefer functional pipeline style (`map`, `pipe`) over in-place mutation.
Error Handling & Validation
- Define custom exceptions: `ConsentMissingError`, `DataClassificationError`, `AnonymizationError`.
- Validate external payloads with Pydantic; reject unknown fields (`model_config = ConfigDict(extra="forbid")`).
- Check consent & purpose limitation first; `raise ConsentMissingError` on failure—no further processing.
- Use early returns; avoid deep `if/else` nests.
- Log at `INFO/WARNING` sans PII; scrub exception messages.
- On anonymization failure, roll-back transaction and notify DPO via webhook.
Apache Airflow
- DAG id format: `dag_<domain>_<purpose>_<privacy_level>` (e.g., `dag_marketing_segmentation_personal`).
- Always start with `validate_consent` TaskGroup ➜ `anonymize` ➜ business logic ➜ `register_in_collibra`.
- Tag DAGs with `['personal-data', 'gdpr']` or `['anonymous']` to enable selective deployment.
- Pass only record counts & hashes through XCom; never PII.
- Store secrets (DB creds, KMS keys) in Airflow Secrets Backend (AWS Secrets Manager / HashiCorp Vault).
- Use `@task(retries=3, retry_delay=timedelta(minutes=2))` for network calls to governance APIs.
Pandas / Data Processing
- Read with explicit dtypes & column list: `read_csv('file.csv', usecols=['user_id','email','ts'], dtype={'user_id':'int64'})`.
- Immediately anonymize: `df['pii_email_hash'] = sha256_hash(df.pop('pii_email'))`.
- Call `.drop(columns=pii_columns, errors='ignore')` once anonymization complete.
- Use `.astype('category')` for enumerations to reduce footprint.
- For big data, process in chunks (`chunksize=1_000_000`) & write Parquet with `compression='zstd'`.
Collibra / Ataccama / Talend Integration
- After each successful DAG, POST dataset metadata to Collibra REST `/assets`: name, owner, purpose, privacyLevel.
- Attach classification tags: `PERSONAL`, `SENSITIVE`, `ANONYMIZED`.
- Retrieve policy rules from Ataccama/Talend and enforce dynamically via Python decorators.
Security
- Encrypt files with AES-256-GCM; keys stored in KMS, rotated every 90 days.
- Use TLS 1.3 for all data-in-transit; verify certificates.
- Implement `logging.Filter` that redacts values matching PII regex before record emission.
- Deny-by-default IAM; pipelines assume least-privilege roles.
Testing
- Unit tests for every anonymizer function: assert length, irreversibility (rainbow-table attempt), and format.
- Property-based tests (Hypothesis) to ensure no input causes PII leak in logs.
- Integration tests spin up local Postgres + MinIO with fake KMS to execute full DAG.
- Include fairness tests (Aequitas/Fairlearn) in CI; fail if metrics exceed thresholds.
Performance
- Use vectorized Pandas operations; avoid Python loops.
- Prefer `copy=False` params to limit memory.
- Stream large exports directly to S3 (`smart_open`) rather than local disk.
Documentation & Compliance
- Maintain `data-processing-register.yaml` with entries: dataset, lawfulBasis, retentionPeriod, dpoContact.
- Include Data Protection Impact Assessment (DPIA) markdown inside repo `/docs/dpia_<feature>.md`.
- Provide `CONSENT_SCHEMA.json` for external integrators to conform.
Common Pitfalls & Guards
- ❌ Loading *all* columns then discarding PII later – do early minimization.
- ❌ Concatenating user identifiers into log messages.
- ❌ Using timestamp + user_id as salt (predictable) – use cryptographically secure salt.
- ✅ Verify withdrawal of consent: remove subject’s data & models (re-train if needed).
By following these rules, every pipeline produced by this repository will be transparent, privacy-preserving, fair, and fully auditable.