Actionable rules for designing, building, and operating modern data lakes, lakehouses, and data warehouses across cloud platforms.
Modern data teams shouldn't be forced to choose between the raw flexibility of data lakes and the structured performance of data warehouses. Your production systems need both — and the lakehouse architecture delivers exactly that, combining schema-on-read flexibility with schema-on-write performance in a unified platform.
You've probably lived this nightmare: Your team maintains separate ETL pipelines for your data lake (schema-on-read for exploration) and your data warehouse (schema-on-write for reporting). Every new data source means double the work, double the governance overhead, and inevitable consistency issues between systems.
The hidden costs stack up fast:
These Cursor Rules implement a lakehouse architecture that eliminates the false choice between lakes and warehouses. You get the best of both worlds: raw data flexibility for exploration with enterprise-grade governance and performance for production workloads.
What you're building:
Instead of maintaining separate ETL processes for lakes and warehouses, process everything in-place. Teams report 50-70% reduction in data pipeline complexity and 30-40% lower cloud costs from eliminating duplicate storage.
Built-in schema evolution with Delta Lake means your streaming pipelines never break when upstream schemas change. One team eliminated 95% of their schema-related production incidents after implementing these patterns.
Structured Streaming with checkpointing delivers exactly-once semantics while maintaining sub-second query performance on TB-scale datasets. No more choosing between real-time and analytical workloads.
Every dataset automatically gets cataloged with ownership, SLAs, and quality metrics. RBAC/ABAC controls apply consistently across all access patterns, whether via SQL, Spark, or ML notebooks.
Before (Traditional Lake + Warehouse):
# Lake ingestion pipeline
raw_df = spark.read.json("s3://incoming/events/")
raw_df.write.parquet("s3://lake/bronze/events/")
# Separate warehouse ETL
transformed = transform_events(raw_df) # Custom logic
transformed.write.jdbc("jdbc:redshift://...", "events_fact")
After (Lakehouse with These Rules):
from pyspark.sql import DataFrame
def ingest_with_quality_gates(
source_path: str,
target_table: str,
key_cols: list[str]
) -> None:
# Schema validation with auto-evolution
df = (spark.read
.option("mergeSchema", "true")
.format("json")
.load(source_path))
# Built-in data quality checks
quality_suite.validate(df)
# Upsert with ACID guarantees
upsert_to_delta(df, target_table, key_cols)
# Auto-catalog with lineage
register_in_catalog(target_table, source_path)
Impact: New data sources go from 2-3 day implementation to 2-3 hours, with governance and quality built-in automatically.
Before: Separate batch processing for features, real-time serving layer, complex synchronization logic.
After: Unified streaming pipeline serving both real-time predictions and batch training:
# Single pipeline for both batch and streaming
streaming_df = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.load()
.select(from_json("value", schema).alias("data"))
.select("data.*"))
# Transform once, serve everywhere
features_df = compute_features(streaming_df)
# Delta table serves both real-time and batch consumers
(features_df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", checkpoint_path)
.table("feature_store.user_features"))
Before: Manual lineage tracking, scattered audit logs, weeks of preparation for compliance reviews.
After: Automated compliance with full audit trails:
# Every transformation automatically tracked
@track_lineage
def transform_pii_data(df: DataFrame) -> DataFrame:
return (df.withColumn("email_hash", hash_pii("email"))
.drop("email") # Auto-flagged for PII audit
.withColumn("processed_at", current_timestamp()))
# Automatic retention policy enforcement
apply_retention_policy("customer_data", days=2555) # 7 years GDPR
.cursor/rules directory# AWS example
export AWS_PROFILE=your-data-profile
export SPARK_OPTS="--conf spark.sql.adaptive.enabled=true"
# This will auto-configure Delta Lake with optimal settings
spark.sql("CREATE TABLE customer_events USING DELTA LOCATION 's3://your-lake/tables/customer_events'")
Start with your highest-volume, most critical data pipeline:
Ready to eliminate the lake vs warehouse complexity? These Cursor Rules give you production-ready lakehouse patterns that scale from prototype to petabyte. Your data team will thank you for choosing architecture that grows with your needs instead of forcing artificial constraints.
Start with your most critical pipeline — the patterns are designed to prove value immediately while building toward comprehensive data platform transformation.
You are an expert in Python (PySpark), ANSI-SQL, Apache Spark, Delta Lake/Iceberg, Apache Kafka, AWS (S3, Glue, Redshift), GCP (BigQuery, Dataplex, BigLake), Azure (ADLS Gen2, Synapse), Terraform, and CI/CD (GitHub Actions).
Key Principles
- Treat data as a product: each dataset has an owner, contract, schema, SLA, and quality KPIs.
- Prefer ELT → transform in-lake; ingest raw, immutable data first ("bronze") then refine.
- Separate storage (S3/ADLS/GCS) from compute; design for multi-cloud portability.
- Use schema-on-read in lakes, schema-on-write in warehouses; converge via lakehouse formats (Delta/Iceberg).
- Automate governance: cataloging, lineage, RBAC/ABAC, and lifecycle policies.
- Optimize for incremental processing, idempotency, and replayability (event sourcing).
- Security first: encryption in transit & at rest, least-privilege IAM, regular audits.
Python (PySpark)
- Write only PySpark DataFrame/Dataset code—no RDDs.
- Always enable type hints and mypy; use PEP-8 with Black & isort.
- Functions must be pure; no hidden side effects. Avoid global state.
- Prefer immutable variables (`val` semantics via re-assignment avoidance).
- Partition & file format parameters ("partition_by", "format") are function arguments, never constants.
- Use `@udf` only when vectorized alternatives (built-ins, pandas-udf) don’t exist.
- Example:
```
from pyspark.sql import DataFrame
def upsert_to_delta(df: DataFrame, table: str, key_cols: list[str]) -> None:
(spark.table(table)
.alias("t")
.merge(df.alias("s"), " AND ".join([f"t.{k}=s.{k}" for k in key_cols]))
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
```
SQL
- Use CTEs (`WITH`) for readability; nest ≤ 3 levels.
- Never `SELECT *`; list columns explicitly.
- Use snake_case for identifiers; suffix fact tables with `_fct`, dimension tables with `_dim`.
- Time zones: store timestamps in UTC; convert at presentation layer.
- Enforce data contracts with `NOT NULL` and `CHECK` constraints in warehouses.
Error Handling & Validation
- Validate schema on ingestion with Great Expectations or Deequ; fail fast.
- Detect schema drift: compare incoming Avro/JSON schema to catalog; auto-version or reject.
- Apply data quality gates: uniqueness, referential integrity, distribution outliers.
- Use try/except around I/O; always log dataset, partition, and row counts.
- Push metrics to Prometheus/Grafana; alert on SLA violations.
Apache Spark / Lakehouse Rules
- Storage formats: Delta ▶︎ default, fallback Parquet; Iceberg for multi-engine needs.
- Partitioning: choose high-cardinality columns carefully; aim 100 MB ≤ file ≤ 1 GB.
- Enable schema evolution but disable schema overwrite (`mergeSchema=true`, `overwriteSchema=false`).
- Use Structured Streaming + checkpointing to S3/ADLS for exactly-once sinks.
- Tune shuffle partitions: `spark.sql.shuffle.partitions = max(200, inRows/1e6)`.
- Prefer Auto Loader (Databricks) or Glue Streaming for incremental ingestion.
Kafka Ingestion
- Topics are single-entity, append-only, Avro-encoded with schema registry.
- Use compacted topics for dimension changes; enforce idempotent producers.
- Consumer apps commit offsets after successful Delta/Iceberg write + checkpoint.
Testing
- Unit: pytest with `spark_session` fixture; coverage ≥ 90%.
- Contract tests: run Great Expectations suites in CI on sample data.
- Integration: spin up localstack or moto to mock cloud services; test end-to-end DAG with Airflow’s `DagBag`.
Performance Optimization
- Cluster-aware file sizes: `spark.sql.files.maxPartitionBytes = 134217728` (128 MB).
- Enable adaptive query execution (AQE) & dynamic partition pruning.
- Auto-optimize & vacuum Delta tables nightly; retain 30 days of history.
- Use AI-driven resource scaling (Databricks Serverless, AWS Glue Auto-Scaling).
Security
- Encrypt storage (`SSE-KMS`, `CMK`) + TLS 1.2 for all endpoints.
- Implement fine-grained access with Lake Formation, Unity Catalog, or BigLake ACLs.
- Enable row- & column-level security (RLS/CLS) in the warehouse.
- Rotate keys quarterly; run quarterly penetration tests; log IAM changes.
Governance & Catalog
- Every table registered in a central catalog (Glue, Unity, BigQuery Data Catalog).
- Add business metadata: owner, SLA, PII flag, retention policy.
- Use tags for GDPR/PHI classification; automate masking policies.
- Capture lineage via Spark-listener → OpenLineage → Marquez.
Deployment & IaC
- Define storage buckets, IAM, Glue Catalog DBs, Redshift/BigQuery/Synapse via Terraform; lock state in S3 + DynamoDB.
- CI/CD: GitHub Actions → lint, test, build Docker image, deploy DAGs to Airflow.
- Blue/green deployment for Spark jobs using separate job IDs.
Monitoring & Observability
- Log Spark events to cloud-native log services (CloudWatch, Stackdriver, Log Analytics).
- Emit OpenTelemetry traces for DAG runs.
- SLA dashboard: freshness, volume, row-count variance, schema changes.
Common Pitfalls & Mitigations
- Pitfall: Small files → Mitigation: `OPTIMIZE ZORDER BY` and compaction jobs.
- Pitfall: Unbounded schema drift → Mitigation: enforce contracts + auto versioning.
- Pitfall: Data swamp (no governance) → Mitigation: mandatory catalog registration and quality gates.