Actionable rules for building, operating, and maintaining enterprise-grade data-warehousing solutions on Snowflake with dbt, Snowpark, Data Vault 2.0 and Terraform.
Tired of manually managing Snowflake environments, debugging performance issues at 2 AM, and explaining cost overruns to finance? This comprehensive ruleset transforms your Snowflake development workflow from reactive firefighting to proactive engineering excellence.
Every data engineer knows the frustration: Snowflake is incredibly powerful, but that power comes with complexity that can derail projects and budgets. You're dealing with:
Sound familiar? You're not alone. Most teams spend 60-70% of their time on operational overhead instead of building value.
This ruleset codifies the patterns used by teams running petabyte-scale Snowflake environments without breaking budgets or SLAs. It's not just best practices—it's battle-tested automation that eliminates entire categories of problems.
What This Ruleset Delivers:
Before: Developer creates a new dimension table
-- Manual process, error-prone
CREATE TABLE finance_db.reporting.customer_dim AS
SELECT * FROM raw_data.customers; -- Oops, SELECT *
GRANT SELECT ON finance_db.reporting.customer_dim TO reporting_role;
Result: 3 hours to deploy, breaks in production, security review required
After: Using the ruleset
-- models/core/dim_customer.sql
{{ config(
materialized='table',
pre_hook="{{ grant_usage('ANALYST_ROLE') }}"
) }}
SELECT
{{ dbt_utils.surrogate_key(['customer_id']) }} AS customer_key,
customer_id,
customer_name,
customer_type,
_loaded_at
FROM {{ ref('stg_customers') }}
WHERE customer_id IS NOT NULL
Result: 10 minutes to deploy, automatic testing, security built-in
Before: Query performance investigation
-- Finding slow queries manually
SELECT * FROM snowflake.account_usage.query_history
WHERE execution_time > 60000; -- Hunt through thousands of rows
Result: Hours of manual analysis, guesswork on fixes
After: Automated monitoring with the ruleset
-- Automatic performance monitoring
CREATE OR REPLACE VIEW core.query_performance_alerts AS
SELECT
query_id,
query_text,
execution_time,
bytes_spilled_to_local_storage,
CASE
WHEN bytes_spilled_to_local_storage > 0 THEN 'MEMORY_ISSUE'
WHEN execution_time > 300000 THEN 'PERFORMANCE_ISSUE'
END AS alert_type
FROM snowflake.account_usage.query_history
WHERE execution_time > 60000
QUALIFY ROW_NUMBER() OVER (ORDER BY execution_time DESC) <= 10;
Result: Proactive alerts, clear remediation paths, 95% fewer performance issues
Before: Manual hub and satellite creation
-- Error-prone manual implementation
CREATE TABLE hub_customer (
customer_hash_key VARCHAR(64),
customer_id VARCHAR(50),
load_date TIMESTAMP_NTZ,
record_source VARCHAR(50)
);
Result: Inconsistent hashing, audit trail gaps, compliance issues
After: Standardized Data Vault patterns
-- Automated hub creation with proper hashing
{{ config(materialized='incremental', unique_key='hub_hash_key') }}
SELECT
{{ dbt_utils.generate_surrogate_key(['customer_id']) }} AS hub_hash_key,
customer_id,
CURRENT_TIMESTAMP() AS load_datetime,
'CRM_SYSTEM' AS record_source
FROM {{ ref('stg_customers') }}
WHERE customer_id IS NOT NULL
Result: Consistent implementation, full audit trail, compliance-ready
# Clone and configure your Snowflake workspace
git clone your-snowflake-project
cd your-snowflake-project
# Install dependencies
pip install snowflake-snowpark-python==1.* dbt-snowflake
terraform init
.cursorrules file# terraform/main.tf - Infrastructure as Code
module "warehouse" {
source = "./modules/warehouse"
warehouse_name = "transform_wh"
warehouse_size = "MEDIUM"
auto_suspend = 300
tags = {
project = "data_platform"
env = "production"
owner = "data_team"
}
}
-- models/staging/stg_orders.sql
{{ config(
materialized='incremental',
unique_key='order_id',
on_schema_change='sync_all_columns'
) }}
SELECT
order_id,
customer_id,
order_date,
order_amount,
_loaded_at
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE _loaded_at > (SELECT MAX(_loaded_at) FROM {{ this }})
{% endif %}
-- Create monitoring infrastructure
CREATE OR REPLACE PROCEDURE core.monitor_warehouse_usage()
RETURNS STRING
LANGUAGE SQL
AS
$$
BEGIN
INSERT INTO core.warehouse_usage_log
SELECT
warehouse_name,
credits_used,
execution_time,
CURRENT_TIMESTAMP()
FROM snowflake.account_usage.warehouse_load_history
WHERE start_time >= DATEADD(day, -1, CURRENT_TIMESTAMP());
RETURN 'Monitoring data loaded successfully';
END;
$$;
Every day you spend debugging performance issues, explaining cost overruns, or manually managing deployments is a day you're not shipping features that matter to your business.
This ruleset eliminates the operational overhead that's keeping your team from building the data platform your organization needs. You'll spend your time on architecture and analysis instead of firefighting and maintenance.
Ready to transform your Snowflake development experience? Copy these rules into your .cursorrules file and start building production-grade data warehouses with confidence.
Your future self (and your finance team) will thank you.
You are an expert in Snowflake SQL, dbt, Snowpark (Python/Java/Scala), Terraform, Data Vault 2.0, GitHub Actions, and cloud platforms (AWS | Azure | GCP).
Key Principles
- Layered architecture: RAW ➜ STAGING ➜ CURATED/CORE ➜ PRESENTATION; data flows strictly left-to-right.
- Scripts are idempotent and side-effect–free; each rerun produces identical data.
- Prefer set-based transformations over row-by-row logic; never use cursors unless absolutely required.
- Use explicit, fully-qualified object names (DB.SCHEMA.OBJECT) in all code and IaC.
- Enforce Role-Based Access Control (RBAC) using least-privilege and separation of duties (ingest, transform, consume).
- Treat infrastructure and transformation logic as code; every change is version-controlled, peer-reviewed, and deployed via CI/CD.
Snowflake SQL Rules
- Formatting
• Uppercase SQL keywords, lowercase object names: `SELECT col1 FROM core.sales_fct`.
• One clause per line; trailing commas.
- Object Naming
• Databases: `<env>_<domain>_db` (e.g., `prd_finance_db`).
• Schemas: `raw`, `stg`, `core`, `pres` or `dv_hub`, `dv_sat`, `dv_link`.
• Tables/Views: snake_case singular nouns (`customer_dim`, `sales_fct`).
- Coding
• Never use `SELECT *`; always list columns to enable pruning.
• Qualify column references with table aliases to avoid ambiguity.
• Prefer `CREATE OR REPLACE` for idempotency.
• Separate DDL and DML into distinct files.
• Use `MERGE` with `QUALIFY ROW_NUMBER() = 1` for upserts; avoid `DELETE/INSERT` patterns.
• Use `ALTER MATERIALIZED VIEW … REFRESH` only when automatic refresh isn’t sufficient.
- Performance
• Define clustering keys on large (>100 GB) tables that are frequently filtered on the same columns.
• Keep micro-partitions < 16 MB; monitor `bytes_spilled_to_local_storage` as an early warning signal.
• Split heavy transformations across transient warehouses sized for the workload; leverage `AUTO_SUSPEND = 60`.
Snowpark (Python) Rules
- Always pin the package version: `snowflake-snowpark-python==1.*` in `requirements.txt`.
- Use DataFrame API; avoid `collect()` unless data ≤ 10 MB.
- Utilize vectorized UDFs instead of row UDFs for large datasets.
- Push filters/projections early so they are executed in-database.
- Return Snowpark DataFrame to dbt via `@snowpark_api` macro where possible.
Error Handling & Validation
- Encapsulate DML in stored procedures with `EXCEPTION WHEN` blocks; insert errors into `core.etl_error_log` (columns: job_id, step, error_code, error_message, payload, created_at).
- Implement resource monitors per warehouse; set `AUTO_SUSPEND` and `AUTO_RESUME` plus `NOTIFY_USERS`.
- Establish row-count reconciliation between layers; load audits into `core.etl_audit_log`.
- Apply dbt tests (`unique`, `not_null`, `accepted_values`, `relationships`) on every merge.
Framework-Specific Rules
Data Vault 2.0
- Hubs: natural business keys hashed with `SHA2(CONCAT_WS('|', key_cols), 256)`; column `hub_hash_key` as PK.
- Satellites: use `LOAD_DATETIME`, `LOAD_USER`; PK is `(hub_hash_key, load_datetime)`.
- Links: hashed composite keys; PK `link_hash_key`.
- Never update satellite rows; insert new versions.
DBT
- `models/` mirrors warehouse layers. Example path: `models/core/fct_sales.sql`.
- Use `materialized='incremental'` with `unique_key` and `on_schema_change='sync_all_columns'`.
- Place `pre-hook` to grant usage: `{{ grant_using_variable('ANALYST_ROLE') }}`.
- Auto-generate documentation with `dbt docs generate`; publish to internal portal nightly.
Terraform
- Provider block pins version: `source = "Snowflake-Labs/snowflake"` `version = "~> 0.60"`.
- Separate state files by environment; backend in S3 with SSE-KMS.
- Naming convention follows SQL rules; tags: `project`, `env`, `owner`.
- Module structure:
modules/
warehouse/
role/
database/
Additional Sections
Testing & CI/CD
- GitHub Actions: trigger on PR; run `dbt build --target prd --select state:modified+` and `terraform plan`.
- Break build on any dbt test failure or Terraform plan drift.
Performance Optimization
- Materialized views for aggregates accessed ≥ 500×/day; set `TO = 5` minutes.
- Enable automatic clustering; fall back to manual `RECLUSTER` if daily pruning reach < 70 %.
- Monitor Query Profiler; remediate 95th-percentile queries > 1 minute.
Security
- Enable Dynamic Data Masking for PII columns (`email`, `phone`); use masking policies at column level.
- Encrypt data in transit (TLS 1.2) and at rest (Snowflake default); verify with `SHOW PARAMETERS LIKE 'ENABLE_UNENCRYPTED_NETWORK_CAPTURE'`.
- Separate roles: `SYSADMIN` (object creation), `LOAD_APP`, `TRANSFORM_APP`, `REPORTING_RO`.
Cost Management
- Tag compute usage via `WAREHOUSE_TAGS`; analyze `ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY` weekly.
- Schedule nightly `SYSTEM$ABORT_SESSION` for idle sessions > 2 hrs.
Observability
- Use Snowflake Usage Dashboard + DataDog integration.
- Store daily exports of `QUERY_HISTORY`, `METERING_DAILY_HISTORY` in `raw.monitoring` schema.
Common Pitfalls & Mitigations
- Pitfall: Snowballing storage from transient tables. Mitigation: mark transient tables with `DATA_RETENTION_TIME_IN_DAYS = 0`.
- Pitfall: Unbounded result caching. Mitigation: `ALTER SESSION SET USE_CACHED_RESULT = FALSE` during benchmarking only.
- Pitfall: Forgotten warehouses running over weekend. Mitigation: `AUTO_SUSPEND = 300` and weekly audit script.
Example End-to-End Flow (Pseudo-Code)
1. Terraform applies DB + schemas + roles.
2. Ingestion job lands CSV to `@raw.external_stage`.
3. dbt model `stg_orders` parses and loads to `stg.orders` (incremental).
4. dbt model `core.fct_orders` merges into fact table using surrogate keys.
5. Materialized view `pres.orders_summary_mv` aggregates daily metrics.
6. Looker connects to `pres` schema with `REPORTING_RO` role.
7. Resource monitor checks spend; alert to Slack.
Follow these rules to deliver secure, high-performance, and cost-efficient Snowflake data warehouse solutions ready for enterprise scale.