Opinionated guidelines for designing, coding, and operating Python services that comply with the principle of Data Minimization under GDPR, CCPA, HIPAA and similar regulations.
Your backend services are collecting too much data. Every extra field is technical debt that becomes privacy debt the moment regulations knock on your door. These Cursor Rules transform your Python development workflow to build GDPR, CCPA, and HIPAA compliance directly into your code—not as an afterthought.
Most Python services start simple but gradually accumulate data bloat:
When privacy audits arrive, you're scrambling through codebases trying to map data flows, justify retention periods, and implement deletion mechanisms—all while maintaining uptime.
These rules embed data minimization principles directly into your development workflow. Instead of retrofitting privacy compliance, you build services that collect only necessary data by default.
Core Transformation: Every data structure requires explicit purpose documentation and retention limits. Unknown fields are rejected automatically. Personal data is masked by default and accessed only through explicit reveal methods.
# Before: Accidental data collection
class UserProfile(BaseModel):
email: str
name: str
metadata: dict # Anything goes here
# After: Purpose-driven, minimal collection
class UserProfile(BaseModel):
email: EmailStr
username: constr(min_length=3, max_length=30)
purpose: Literal["account_creation"] = Field(description="Purpose: user registration")
class Config:
extra = "forbid" # Reject unknown fields
class __meta__:
retention_days = 1095 # 3 years, documented
Automatic Compliance: Your code rejects excessive data collection by default. New fields require explicit purpose documentation and DPO approval through CI checks.
Developer Productivity: No more privacy audit scrambles. Your schemas are self-documenting with purpose and retention built-in.
Risk Reduction: Masked logging, automatic retention enforcement, and consent-driven endpoints eliminate common privacy vulnerabilities.
Audit Readiness: Generate DPIA documentation automatically from your Pydantic models. Data maps update as code changes.
Traditional Approach: Add field, deploy, hope it's compliant
# Risky: No purpose, no retention, allows anything
user_data = request.json()
user = User(**user_data) # Accepts any extra fields
With Data Minimization Rules:
class UserRegistration(BaseModel):
email: EmailStr
username: constr(min_length=3, max_length=30)
purpose: Literal["registration"] = Field(description="Account creation")
class __meta__:
retention_days = 1095
class Config:
extra = "forbid"
@app.post("/users")
async def create_user(data: UserRegistration, consent=Depends(get_consent)):
# Automatically rejects unknown fields
# Requires consent check
# Documents purpose and retention
return await user_service.create(data)
Result: CI fails if you add new personal data fields without DPO ticket reference and purpose documentation.
Traditional Approach: Sensitive data mixed with regular fields
def process_payment(user_email: str, card_number: str):
logger.info(f"Processing payment for {user_email} with card {card_number}")
# Logs sensitive data directly
With Data Minimization Rules:
class PaymentData(BaseModel):
user_id: UUID # Hashed identifier only
card_token: SecretStr # Encrypted, requires explicit reveal
def log_safe(self) -> str:
return f"Payment for user {self.user_id} with token {hash(self.card_token.get_secret_value())[:8]}..."
def process_payment(payment: PaymentData):
logger.info(payment.log_safe()) # Never logs raw sensitive data
card_number = payment.card_token.reveal() # Explicit access required
Traditional Approach: Manual deletion scripts, hope someone remembers
# Quarterly cleanup script someone forgets to run
def cleanup_old_users():
old_users = db.query("SELECT * FROM users WHERE created_at < ?", some_date)
# Manual, error-prone, inconsistent
With Data Minimization Rules:
@cron_job("0 2 * * *") # Daily automated cleanup
async def cleanup_expired_data():
for model in ALL_MODELS:
if hasattr(model, '__meta__'):
cutoff = datetime.utcnow() - timedelta(days=model.__meta__.retention_days)
deleted = await db.delete_where(model, created_at__lt=cutoff)
logger.info(f"Purged {deleted} expired {model.__name__} records")
# Add to requirements.txt
pydantic>=2.0
fastapi
python-multipart
cryptography
# models/base.py
from pydantic import BaseModel, ConfigDict, Field
from typing import Literal
from datetime import timedelta
class DataMinimizedModel(BaseModel):
model_config = ConfigDict(
extra="forbid",
validate_assignment=True,
ser_json_timedelta="iso8601"
)
purpose: str = Field(..., description="Business purpose for this data")
class __meta__:
retention_days: int = 365 # Default 1 year
# routes/users.py
@app.post("/users", dependencies=[Depends(get_consent)])
async def create_user(data: UserCreate):
# Automatically validates minimal data collection
# Requires consent
# Documents purpose
return await user_service.create(data)
# .github/workflows/privacy-check.yml
- name: Check Schema Changes
run: |
python scripts/schema_diff.py
# Fails if new personal data fields lack DPO approval
# Add to startup
@app.on_event("startup")
async def setup_retention_jobs():
scheduler.add_job(cleanup_expired_data, "cron", hour=2)
Immediate: New services automatically comply with data minimization principles. No more accidental data collection.
30 Days: Development velocity increases as privacy requirements are built into the development workflow rather than being external constraints.
90 Days: Privacy audits become documentation exercises rather than code archaeology projects. Your schemas generate compliance documentation automatically.
Long Term: Privacy debt elimination. Every data field has documented purpose, retention limits, and automated cleanup. Your services are audit-ready by default.
Quantified Benefits:
These rules don't just help you comply with regulations—they transform your development process to make privacy compliance as automatic as type checking. Your future self (and your legal team) will thank you.
You are an expert in: Python 3.11-3.12, FastAPI, Pydantic v2, SQL (PostgreSQL), AWS/GCP/Azure privacy tooling (Macie, DLP, IAM), HashiCorp Vault, data-masking/pseudonymisation libraries, CI/CD (GitHub Actions), IaC (Terraform).
Key Principles
- Collect, store, and process ONLY data that is adequate, relevant, and strictly necessary for the documented business purpose (GDPR Art 5-1-c).
- Privacy by Design & Default: design every schema, endpoint, and job so the **minimal data flow** is the default behaviour.
- Data contracts are immutable once published; expanding a contract requires DPO approval and version bump.
- Retention limits are part of the schema, not of the environment. Code must not access expired data.
- Prefer anonymisation ➜ pseudonymisation ➜ raw personal data, in that order. Escalate justification with each step down.
- Every data-producing function must be idempotent and side-effect–free when run with masked data.
- Log events, **never** log raw personal data. Use tokenisation or hashing for identifiers.
- Security failures are privacy failures: least privilege IAM, encrypted at rest & in transit.
Python
- Type-annotate everything. Enable mypy ‑-strict and Pydantic **validate_assignment=True** to catch accidental field additions.
- Represent external data with `@dataclass(frozen=True)` or `BaseModel` subclasses; include a `__meta__ = Retention(days=<int>)` inner class.
- Default all sensitive fields to `SecretStr | None`. Access via explicit `.reveal()` helper.
- Provide factory functions that accept **only** the whitelisted fields; reject `**kwargs`.
- Raise `DataExcessError(field_name)` for unknown or superfluous fields.
- Use `_masked` suffix for any variable containing transformed data (e.g., `email_masked`). Never reuse the original name.
- SQL: Use parameterised queries, column whitelist, and **SELECT only required columns**. Avoid `SELECT *`.
- When serialising, use `.model_dump(exclude_sensitive=True)` (Pydantic v2) or custom JSON encoders to strip secrets.
Error Handling & Validation
- Validate incoming payload size and field count before deeper parsing; return HTTP 413 if exceeded.
- Early-exit pattern:
```python
if unknown := payload.keys() - ALLOWED_FIELDS:
raise DataExcessError(unknown)
```
- Central `@app.exception_handler(DataExcessError)` ➜ returns 422 with explanatory message, logs field names **masked**.
- Integrate automated DLP scanner in CI; PR fails if new field lacks documented purpose & retention.
- Instrument Consent/Preference checks as decorators; raise `ConsentMissingError` early.
FastAPI (Framework-Specific)
- Use `Depends(get_consent)` on every route handling personal data.
- Group routes by data category (e.g., `/users/pii`, `/users/analytics`) for easier policy mapping.
- Response models must use `response_model_exclude_sensitive=True`.
- Enable `data_validation=True` on global `FastAPI` instance to ensure extra fields are forbidden.
- Throttle endpoints that expose personal data: 10 req/min user-level.
- Attach "Purpose" and "Retention" tags in OpenAPI using `openapi_extra` → autogenerates DPA docs.
Pydantic Rules
- `model_config = ConfigDict(extra="forbid", ser_json_timedelta="iso8601")` to reject unknowns.
- Include `purpose: Literal["billing", "support", ...]` in every model; docs parse this for DPIA.
- Provide `.anonymise()` method returning a new model with all direct identifiers removed/hashed.
Infrastructure & Tooling
- Terraform modules must tag every S3/Blob bucket with `purpose`, `retention_days`, `pii_level`.
- Enable AWS Macie / GCP DLP scans nightly; send findings to central SIEM.
- GitHub Actions job `check_schema_diff` compares current Pydantic models vs main; fails if new personal fields exist without `CHANGELOG.md` & DPO ticket reference.
Testing
- Use Hypothesis strategies that generate edge-case payloads with excessive/unknown fields to assert `DataExcessError`.
- Run quarterly synthetic data drills: seed masked data ➜ run production read-paths ➜ verify zero failures.
- Unit tests must include `pytest.mark.retention` asserting `delete_outdated_records()` removes entities > retention.
Performance Patterns
- Retrieve only hashes/identifiers required for joins; avoid wide tables.
- Implement lazy loading for optional personal data (`/details` endpoint separated from `/summary`).
- Use columnar storage (e.g., Parquet) for anonymised aggregates; no row-level PII = faster analytics.
Security
- Enforce TLS 1.3; HSTS on public endpoints.
- Store encryption keys in Vault; rotate every 90 days.
- Default KMS with `pii_level=high` alias; deny decrypt if IAM context lacks `purpose` tag match.
- Masked/anonymised datasets must still be encrypted at rest.
Retention & Deletion Automation
- Implement `cleanup.py` cron job:
```python
for model in ALL_MODELS:
table = model.__tablename__
cutoff = datetime.utcnow() - timedelta(days=model.__meta__.retention)
db.execute(text(f"DELETE FROM {table} WHERE created_at < :cutoff"), {"cutoff": cutoff})
```
- Log # rows purged; alert if deletion fails.
Common Pitfalls & How to Avoid
- "Shadow fields": engineers add `metadata` JSON column to sneak in extra info ➜ blocked by `JSON_SCHEMA_VALIDATOR`.
- Leaving hashed email reversible by using salt = email domain ➜ always use cryptographically secure random salt stored separately.
- Forgetting to delete backups ➜ apply same lifecycle policies to backup buckets.
Example Minimal Endpoint
```python
class UserCreate(BaseModel):
email: EmailStr
username: constr(min_length=3, max_length=30)
purpose: Literal["registration"] = Field(..., description="Purpose: account creation")
__meta__ = Retention(days=365*3)
@app.post("/users", status_code=201, response_model=UserPublic)
async def create_user(data: UserCreate, consent=Depends(get_consent)):
user_id = await svc.create_user(data)
return await svc.get_user_public(user_id)
```
Adoption Checklist
- [ ] Data map created with stakeholders, listing purpose & retention for each field.
- [ ] Pydantic models reviewed by DPO.
- [ ] CI guards in place (schema diff, DLP scan).
- [ ] Monitoring & alerting for excessive collection events.
- [ ] Quarterly minimisation audit scheduled.