Comprehensive Rules for designing, implementing and operating real-time analytics data pipelines and dashboards on a modern cloud stack.
Stop wrestling with fragmented streaming architectures and monitoring blind spots. These Cursor Rules deliver a battle-tested blueprint for building production-grade real-time analytics platforms that consistently hit <1s end-to-end latency targets.
You've experienced the frustration: your streaming pipeline works perfectly in testing, then crashes under production load. Your team spends weeks debugging exactly-once semantics failures while business stakeholders question why critical metrics are still 5 minutes behind. Sound familiar?
Building reliable real-time analytics isn't just about choosing the right tools—it's about implementing proven patterns that prevent the cascading failures, state explosions, and performance bottlenecks that plague most streaming systems.
Common Pain Points These Rules Solve:
These rules provide a comprehensive development framework that transforms how you build streaming data platforms. Rather than learning each technology in isolation, you get proven integration patterns that work across Apache Kafka, Flink, Spark, and cloud streaming services.
What You Get:
Before: 3-4 weeks to set up a new streaming pipeline with proper monitoring and error handling
After: 2-3 days using pre-configured templates and proven patterns
Configure auto-scaling triggers based on consumer lag and CPU metrics. The rules include specific HPA configurations that prevent both under-provisioning (causing latency spikes) and over-provisioning (wasting resources).
Built-in schema registry patterns with backward compatibility enforcement mean you can evolve event structures without breaking existing consumers—eliminating those emergency weekend deployments.
Structured error handling with context-aware exception wrapping means fewer mystery failures and faster root cause identification when issues do occur.
The Challenge: Process millions of user events per second with <500ms latency for fraud detection
Implementation with Rules:
# Automatic schema validation and routing
@dataclass
class ClickEvent:
user_id: str
event_time: datetime
action: str
session_id: str
def validate_and_route(event: ClickEvent) -> ClickEvent:
if not event.user_id or not event.session_id:
# Auto-routes to analytics.clickstream.invalid.v1 DLQ
raise StreamProcessingError("Missing required fields", event)
return event
The rules automatically configure:
analytics.clickstream.created.v1)The Challenge: Build executive dashboards showing business KPIs updated every 5 seconds
Implementation with Rules:
# Pre-configured Spark Structured Streaming
revenue_stream = (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_config.brokers)
.option("subscribe", "sales.transaction.completed.v1")
.load()
.withWatermark("event_time", "30 seconds")
.groupBy(window("event_time", "5 seconds"), "product_category")
.agg(sum("amount").alias("revenue"))
.writeStream
.outputMode("update")
.trigger(processingTime="5 seconds")
.foreachBatch(write_to_clickhouse)
.start()
)
Automatic integration includes:
The Challenge: Integrate AWS Kinesis data with Azure-hosted analytics infrastructure
Implementation with Rules:
# Terraform configuration auto-generated
resource "aws_kinesis_stream" "analytics_stream" {
name = "analytics-orders-ingestion"
shard_count = var.shard_count
server_side_encryption {
encryption_type = "KMS"
key_id = aws_kms_key.analytics.arn
}
}
The rules provide:
# Clone the analytics platform template
git clone https://github.com/your-org/realtime-analytics-template
cd realtime-analytics-template
# Run one-click development environment
make dev-setup # Starts Kafka, Flink, ClickHouse via docker-compose
# src/analytics_platform/config.py
@dataclass
class PipelineConfig:
kafka_brokers: str = "localhost:9092"
checkpoint_interval: int = 60 # seconds
max_lateness: int = 300 # seconds
parallelism: int = 4
# .github/workflows/deploy.yml
- name: Deploy Pipeline
run: |
helm upgrade analytics-pipeline charts/flink-pipeline \
--set image.tag=${{ github.sha }} \
--set resources.requests.memory=2Gi
Access pre-configured Grafana dashboards at http://localhost:3000/d/realtime-analytics to monitor:
"We went from 3-week pipeline development cycles to 3-day iterations. The observability patterns alone saved us dozens of hours of debugging." - Senior Data Engineer, FinTech Startup
"Our executive team now trusts real-time metrics because we eliminated the random 10-minute delays that plagued our old system." - VP of Engineering, E-commerce Platform
These Cursor Rules aren't just configuration—they're your blueprint for building analytics platforms that scale with your business and keep your team focused on delivering insights instead of fighting infrastructure.
Ready to transform your real-time analytics development? Implement these rules and experience the difference production-grade patterns make in your streaming architecture.
You are an expert in Apache Kafka, Apache Flink, Apache Spark Structured Streaming, AWS Kinesis, Azure Stream Analytics, ClickHouse, Python (PyFlink/PySpark), SQL, Kubernetes, Docker, Terraform, and Grafana.
Key Principles
- Start from clearly defined, quantifiable business objectives and KPIs; every stream has a consumer with a decision-making purpose.
- Streaming-first architecture: design pipelines that treat data as unbounded, handle batch as a low-priority special case.
- Aim for <1 s end-to-end latency. Budget latency per stage (ingest ≤ 100 ms, processing ≤ 300 ms, persistence ≤ 200 ms, visualization ≤ 300 ms).
- Exactly-once semantics wherever possible; fall back to at-least-once with idempotent sinks.
- Schemas are contracts: use schema registry, enforce backward compatibility, include event version field.
- Prefer functional, immutable code. Make every operator deterministic and side-effect-free.
- Observability is non-negotiable: emit structured logs, metrics and distributed traces for every service.
- Infrastructure-as-Code; reproducible environments via Docker & Terraform; one-click dev setup.
- Security by default: TLS everywhere, mTLS inside cluster, SSA-ROT for secrets.
- Build small, independently deployable services; deploy via GitOps.
Python (PyFlink / PySpark)
- Use Python 3.11+. Always add type hints & mypy strict mode.
- Package layout: src/<service_name>/{__init__.py, pipelines/, transforms/, utils/, config.py, types.py}
- Use dataclasses or pydantic models for immutable event schemas when preprocessing outside the JVM.
- All UDFs must be pure; avoid external network calls inside UDFs.
- Follow PEP 8; line length ≤ 100 chars; use black & isort in CI.
- Use logging.getLogger(__name__) with structlog JSON renderer.
- Prefer SQL API or Table API over low-level DataStream API for readability.
- Never call collect() on an unbounded stream; use take(n) in tests only.
Error Handling and Validation
- Validate input schema and mandatory fields at the very first operator; route bad events to a <topic>.DLQ.
- Use try/except only around code that can fail; wrap in custom StreamProcessingError with context.
- Early returns > nested conditionals; keep the happy path last.
- Implement global exception handler that increments prometheus counter error_total{stage="<stage>"} and sends PagerDuty alert if error_rate > 0.01% over 5 min.
- Use circuit breakers/timeouts on external sinks (REST, DB).
Apache Kafka Rules
- Topic naming: <domain>.<entity>.<action>.<version> (e.g., analytics.order.created.v1).
- Default partitions = (#consuming nodes × 2); replication.factor = 3.
- Enable log.message.format.version = latest; min.insync.replicas ≥ 2.
- Producers: enable.idempotence=true, acks=all, linger.ms=5, compression.type=zstd.
- Consumers: use cooperative-sticky assignor; commit offsets only after sink success.
- Store schemas in Confluent Schema Registry; require compatibility=BACKWARD.
Apache Flink Rules
- Always use event-time + watermarks; allowLateness ≤ 5 min unless SLA dictates more.
- Checkpoint interval ≤ 60 s; externalized checkpoints on; exactlyOnce.
- Restart strategy: failure-rate (maxFailuresPerInterval=3, failureRateInterval=5 min, delay=30 s).
- Use keyed state; set TTL to 2× maximum window size to avoid leaks.
- Prefer RocksDB state backend for > 5 GB state; enable incremental checkpoints.
- Use Table/SQL API for windowed aggregations; convert to DataStream only for custom joins.
Spark Structured Streaming Rules
- Output modes: use append for pure inserts, update for aggregates, avoid complete.
- Trigger: processingTime = "5 seconds" for near-realtime; leverage continuous processing where supported.
- CheckpointLocation in encrypted S3/GCS; enable spark.sql.streaming.stateStore.providerClass = rocksdb.
- Watermark: eventTimeCol, delayThreshold = expected_max_lateness.
AWS Kinesis / Azure Stream Analytics
- Shard naming: analytics-<source>-<stage>.
- Enable enhanced fan-out; putRecord batching ≥ 500 KB.
- Use Lambda / Azure Functions for lightweight filtering before hitting Flink/Spark.
Data Modeling & Storage (ClickHouse / SingleStore)
- Use ReplacingMergeTree(order_by=(event_time, id)) WITH PARTITION BY to speed up deduplication.
- Partition on toYYYYMMDD(event_time); order by event_time, dimension_id.
- Keep column types minimal (LowCardinality, Decimal64) and compress (CODEC(ZSTD))
Testing
- Unit: pytest + hypothesis property-based tests for every transformation.
- Integration: testcontainers-kafka/flink; run docker-compose up ‑d in CI.
- Performance: Gatling/Locust to load test 2× expected peak throughput.
- Data Quality: great_expectations suites executed against staging ClickHouse nightly.
Performance Patterns
- Enable Kafka zstd compression; Flink taskmanager.network.memory.floating-buffers-per-gate ≥ 16.
- Back-pressure monitoring: use Flink backPressureStats & Spark metrics; auto-scale with Kubernetes HPA on lag & CPU.
- Co-locate compute in same AZ as brokers to reduce RTT.
Security
- Encrypt data-in-transit with TLS 1.3; rotate certificates every 90 days via cert-manager.
- Secrets in Kubernetes sealed-secrets; no secrets in env vars.
- RBAC: least privilege IAM roles per pipeline stage; enable fine-grained ACLs on Kafka topics.
- PII handling: tokenize sensitive fields at the producer side, store key mapping in an HSM.
Observability
- Metrics: export Prometheus format; include per-topic bytes_in_total, consumer_lag, flink_checkpoint_duration.
- Logs: structured JSON; 30-day retention; error events forwarded to SIEM.
- Traces: OpenTelemetry auto-instrumentation; sample 0.1% of throughput.
- Dashboards: Grafana folder “Realtime” with pre-built panels: E2E latency, throughput heatmap, error rate.
CI/CD
- GitHub Actions → build → pytest → mypy → docker build → trivy scan → Helm chart → ArgoCD sync.
- Canary deploy with 10% traffic mirroring; promote after SLA stable for 30 min.
Directory Conventions
.
├── infra/ # Terraform, Helm charts
├── services/
│ └── click-stream/
│ ├── Dockerfile
│ ├── charts/
│ └── src/
│ ├── pipelines/
│ ├── transforms/
│ └── utils/
└── tests/
Common Pitfalls & Guards
- Avoid long-running UDFs → introduce async I/O pattern.
- Do not join unkeyed streams; repartition first.
- Never set Kafka auto.offset.reset=earliest in production consumers; use explicit seeks for replays.
- Keep state size in check; monitor RocksDB SST file count.