Opinionated rules for designing, implementing, testing, and evolving high-performance, schema-driven data-serialization layers.
Tired of debugging corrupted payloads at 3 AM? Fed up with breaking changes that cascade through your entire microservice mesh? Your data serialization layer shouldn't be the bottleneck that kills your system's performance or the fragile link that breaks with every schema change.
Most Python developers treat serialization as an afterthought—slapping JSON everywhere and hoping for the best. But here's what's actually happening in production:
The real problem? Schema chaos and format fragmentation. Every service speaks a different data dialect, compatibility is a prayer, and performance optimization is manual guesswork.
These Cursor Rules transform your serialization layer into a high-performance, schema-driven foundation that handles evolution gracefully and fails fast when things go wrong.
What you get:
# Before: Manual JSON handling, no validation
def process_user_data(json_str: str):
data = json.loads(json_str) # Hope it's valid
return UserModel(**data) # Runtime explosion waiting to happen
# After: Schema-driven with automatic validation
@dataclass
class UserProfile:
user_id: int
email: str
created_at: datetime
@classmethod
def from_protobuf(cls, pb_data: bytes) -> 'UserProfile':
# Generated converter with schema validation
return proto_to_model(pb_data, cls)
Before: Your user service changes a field name, breaking 8 downstream services over the weekend.
After: Schema evolution tests catch the breaking change in CI. Your Protocol Buffer definition uses field numbers, allowing safe renames:
message UserProfile {
int64 user_id = 1;
string email_address = 2; // Renamed from 'email'
int64 created_timestamp = 3;
reserved 4 to 10; // Reserved for future fields
}
Before: JSON serialization adds 200ms latency to your market data pipeline, missing profitable trades.
After: MessagePack reduces payload size by 40% and serialization time by 70%:
# Automatic format selection based on use case
@serialize_with(format='msgpack', compress_threshold=1024)
class MarketTick:
symbol: str
price: Decimal
volume: int
timestamp: datetime
Before: Your mobile app struggles with 50KB JSON responses on slow networks.
After: CBOR encoding cuts payload size to 15KB with the same data:
# Schema-enforced CBOR with automatic compression
def serialize_feed_data(posts: List[Post]) -> bytes:
return cbor_encode(
posts,
schema=PostFeedSchema,
canonical=True, # Consistent ordering for caching
compress=True # Automatic compression for large payloads
)
Eliminate Schema Drift: Your schemas live in Git, not scattered across documentation. Code generation ensures perfect synchronization between producers and consumers.
Catch Breaking Changes Early: Automated compatibility tests run against historical schemas on every commit. No more production surprises.
Optimize Performance Automatically: Rules automatically choose binary formats for high-throughput paths and reserve JSON for human-readable configs.
Debug with Confidence: Structured error handling maps serialization failures to specific schema violations with actionable error messages.
# Project structure that scales
src/
your_project/
schemas/ # *.proto, *.avsc, *.cddl files
generated/ # Auto-generated code (never edit)
models/ # Pydantic models mirroring schemas
converters/ # Schema ↔ model transformations
// schemas/user.proto
syntax = "proto3";
message User {
int64 id = 1;
string email = 2;
optional string display_name = 3; // Forward compatibility
int64 created_at = 4;
}
# models/user.py - Generated automatically
from __future__ import annotations
from dataclasses import dataclass
from typing import Optional
@dataclass(frozen=True) # Immutable by default
class User:
id: int
email: str
display_name: Optional[str]
created_at: int
def to_protobuf(self) -> bytes:
# Generated converter with validation
return serialize_to_protobuf(self, UserProto)
# tests/test_compatibility.py
def test_backward_compatibility():
"""Ensure old clients can read new data."""
new_user = User(id=1, email="[email protected]", display_name="Test")
old_schema_data = serialize_with_old_schema(new_user)
# Should deserialize without errors
result = deserialize_with_current_schema(old_schema_data)
assert result.id == 1
# Automatic format selection and benchmarking
@benchmark_serialization(target_p99_latency_us=200)
def process_high_frequency_data(data: MarketData) -> bytes:
# Automatically uses MessagePack for this use case
return serialize(data, format='auto')
Week 1: Your serialization errors become actionable with schema validation. No more guessing why deserialization failed.
Week 2: Binary format adoption cuts your API response times by 40-60% for large payloads.
Month 1: Schema evolution becomes painless. You're shipping breaking changes safely with automatic compatibility validation.
Month 3: Your serialization layer handles millions of requests/hour without performance degradation. Your team focuses on business logic, not data format debugging.
Copy these rules into your .cursorrules file and transform your next schema change into a smooth, validated deployment instead of a weekend emergency.
Your future self—and your on-call schedule—will thank you.
# Install and start using immediately
curl -o .cursorrules https://your-cursor-rules-source.com/data-serialization
cursor . # Your serialization problems are now solved
You are an expert in high-performance, schema-driven Data Serialization across Python, Protocol Buffers, Apache Avro, MessagePack, CBOR, JSON, Rust Serde, Swift Codable, and Clojure Nippy.
Key Principles
- Make the schema the single source of truth; code is generated from it, never the reverse.
- Design for forward + backward compatibility; always plan for schema evolution.
- Prefer compact, binary formats (Protobuf, Avro, MessagePack, CBOR) in latency-sensitive or bandwidth-constrained paths; reserve JSON/YAML only for human-facing configs & logs.
- Separate transport from serialization; your API layer should not leak internal wire formats.
- Fail fast: validate data against the schema at the edges, return explicit errors early.
- Automate: regenerate code, run compatibility tests, and benchmark payload sizes on every CI run.
- Encrypt + compress by default on un-trusted networks or when persisting at rest.
Python
- Use `dataclasses` + `pydantic` (v2) models with type hints (`from __future__ import annotations`) as the canonical in-memory representation.
- Canonical field order is lexicographical; avoid relying on dict insertion order.
- Never mutate deserialized objects directly; use `.copy(update=...)` or `replace()` to preserve immutability.
- Use `mypy --strict` and `ruff` to enforce typing & style; treat all warnings as errors in CI.
- Package layout:
src/
project_name/
schemas/ # *.proto, *.avsc, *.msgpack, *.cddl
generated/ # auto-generated code, do not edit
models/ # pydantic/dataclass mirrors of schema
converters/ # helpers: model ↔︎ wire format
tests/
- Use `mmap` + `memoryview` for zero-copy where possible when working with large byte buffers.
Error Handling and Validation
- Validate incoming bytes with checksum (e.g., CRC32C) before deserialization.
- Guard every deserialization call with a try/except that maps library-specific exceptions to custom `SerializationError`, `SchemaMismatchError`, and `DataIntegrityError`.
- Reject unknown required fields immediately; ignore unknown optional fields to preserve forward compatibility.
- Version fields explicitly: `version` (integer) as the very first field for JSON; rely on proto/avro internal schema ids otherwise.
- Log the hex digest of the first 64 bytes on fatal errors, never the raw payload.
Protocol Buffers (proto3)
- Always compile with `--python_out=generated/ --mypy_out=generated/`.
- Use `optional` instead of `singular` to allow field presence tracking.
- Reserve field numbers you delete to avoid accidental reuse: `reserved 4 to 10;`.
- Keep field numbers < 15 for frequently transmitted fields; they encode to a single byte (Varint optimisation).
- Prefer `bytes` over `string` for opaque blobs; base64 only at API boundaries.
- Service definitions must include explicit deadlines in comments; enforce via gRPC deadlines.
Apache Avro
- Store `.avsc` in Git; never embed inline JSON strings in code.
- Use `logicalType` for dates/timestamps to avoid epoch/zone drift.
- Control schema evolution with compatibility type `BACKWARD_TRANSITIVE` in Schema Registry.
- Bump the `schema_id` only after CI compatibility tests pass against all historical schemas in `schemas/history/`.
MessagePack
- Encode maps with string keys for readability in debugging tools.
- Enforce maximum payload size via `msgpack.Unpacker(max_buffer_size=...)`.
- Use ext types for custom objects: tag 0–127 reserved internally; 128–255 reserved for public extensions.
CBOR
- Use CDDL to define schemas; validate with `cbor-tool validate` during CI.
- Enable canonical ordering (`cbor2.dumps(..., canonical=True)`) when producing signatures.
JSON
- Only for human-readable configs & ad-hoc integrations.
- Enforce schemas with `jsonschema` and `additionalProperties: false`.
- Always gzip > 1 KB.
Serde (Rust)
- Derive `Serialize`, `Deserialize`, `Eq`, `Clone`, `Debug` consistently.
- Use `#[serde(with = "crate::ts_milliseconds")]` for timestamp fields.
Swift Codable
- Prefer `Codable` structs; mark breaking fields with `@available(*, deprecated)` instead of removal.
- Use `JSONDecoder().dataDecodingStrategy = .deferredToData` for binary blobs.
Nippy (Clojure)
- Compress large payloads with `:compressor :lz4`.
- Validate magic header bytes `0x71,0x75` before unpacking.
Testing
- Maintain golden samples in `tests/fixtures/`; compare byte-for-byte against generated output.
- For every schema change run:
1. Backward compatibility tests (old readers ↔ new writers).
2. Forward compatibility tests (new readers ↔ old writers).
- Benchmark: `pytest-benchmark` (Python), `hyperfine` (Rust) with target p99 latency < 200 µs for 1 KB payload.
Performance
- Pool serializer instances (`protobuf::Arena`, `avro::DatumWriter`) to avoid allocations.
- Zero-copy I/O: write directly to sockets/`aiofiles` via memoryview.
- Profile with `perf` or `py-spy`; avoid reflection in hot paths.
- Compress only when payload > threshold(bytes): tune via AB testing.
Security
- Always combine TLS + encryption at rest.
- Use authenticated encryption (AES-GCM) if the format’s built-in encryption is missing.
- Verify untrusted payloads against an allow-list of schema IDs.
- Limit decompression ratio (`zlib.DecodeDict`) to mitigate zip bombs.
Documentation
- Auto-publish rendered schemas to `/docs/schema/latest` via CI.
- Include migration guides per version in `docs/migrations/`.