Comprehensive Rules for designing, implementing, and maintaining high-quality data-augmentation pipelines across vision, NLP, and audio tasks.
You're tired of manually collecting thousands more training samples. Your models plateau because your dataset doesn't capture real-world variations. Class imbalances kill performance on edge cases you care about most.
Here's the reality: top-performing ML teams don't just collect more data—they systematically generate high-quality synthetic variations that expose their models to scenarios they'll encounter in production.
Your carefully curated dataset represents maybe 5% of the variations your model will see in production. Traditional approaches fail because they:
Meanwhile, you're burning through cloud compute budget training on the same static examples repeatedly, wondering why your F1 scores won't budge on underrepresented classes.
These Cursor Rules implement production-grade data augmentation systems that generate realistic training variations on-demand, directly on GPU, without exploding storage costs or introducing label noise.
What you get:
# Before: 10K static images eating disk space
train_dataset = ImageFolder('train/', transform=basic_resize)
# After: Infinite variations generated on-demand
VISION_TRAIN_AUG = nn.Sequential(
kornia.augmentation.RandomResizedCrop((224, 224), scale=(0.8, 1.0)),
kornia.augmentation.ColorJitter(0.2, 0.2, 0.2, 0.1),
kornia.augmentation.RandomErasing(p=0.2)
)
Real example: Medical imaging dataset with 90/10 class split sees minority class recall jump from 0.42 to 0.67 after targeted augmentation.
GPU-native transforms through Kornia eliminate CPU→GPU transfer bottlenecks that kill training throughput.
Built-in safeguards ensure rotated stop signs still look like stop signs, not yield signs.
Before: Manually collecting edge cases
# Spending weeks collecting images of objects at weird angles
# Storage costs exploding with duplicate-ish samples
# Model fails on rotated/occluded objects in production
After: Systematic augmentation targeting failure modes
@dataclass(frozen=True)
class DetectionAugCfg:
rotation_range: tuple[float, float] = (-15, 15)
occlusion_prob: float = 0.3
lighting_variance: float = 0.2
# Generates realistic variations that stress-test object boundaries
DETECTION_AUG = create_detection_pipeline(DetectionAugCfg())
Before: Models fail on slightly different phrasings
# Production text doesn't match training corpus style
# Brittle to typos, synonyms, sentence restructuring
# Manual data collection can't cover linguistic variations
After: Contextual augmentation preserving semantics
from nlpaug.flow import Sequential
TEXT_AUG = Sequential([
SynonymAug(aug_max=3, stopwords=['critical', 'domain', 'terms']),
ContextualWordEmbsAug(model_path='bert-base-uncased', aug_p=0.1)
])
# Handles paraphrases, synonyms while preserving domain vocabulary
Before: Clean studio recordings vs. noisy production audio
# Models trained on pristine audio fail with background noise
# Can't collect samples in every acoustic environment
# SNR variations kill transcription accuracy
After: Realistic noise injection with quality controls
AUDIO_AUG = Compose([
AddBackgroundNoise(sounds_path="./noise_samples", min_snr_db=10),
TimeStretch(min_rate=0.8, max_rate=1.2),
PitchShift(min_semitones=-2, max_semitones=2)
])
# Maintains ≥10dB SNR while exposing model to real-world conditions
pip install torch torchvision kornia albumentations nlpaug textattack audiomentations
import kornia.augmentation as K
import torch.nn as nn
# GPU-native augmentation - stays in CUDA memory
VISION_TRAIN_AUG = nn.Sequential(
K.RandomResizedCrop((224, 224), scale=(0.8, 1.0)),
K.RandomHorizontalFlip(p=0.5),
K.ColorJitter(0.2, 0.2, 0.2, 0.1),
K.RandomErasing(p=0.2, scale=(0.01, 0.05)),
)
# Compile for production deployment
compiled_aug = torch.jit.script(VISION_TRAIN_AUG)
torch.jit.save(compiled_aug, 'vision_aug_v1.pt')
from collections import Counter
def get_weighted_sampler(dataset, target_column='label'):
"""Prioritize minority classes during augmentation"""
class_counts = Counter(dataset[target_column])
weights = {cls: 1.0/count for cls, count in class_counts.items()}
sample_weights = [weights[label] for label in dataset[target_column]]
return WeightedRandomSampler(sample_weights, len(sample_weights))
@dataclass(frozen=True)
class AugmentationConfig:
seed: int = 42
version: str = "flip_rotate_jitter_v2"
magnitude: float = 0.2
def create_reproducible_pipeline(config: AugmentationConfig):
torch.manual_seed(config.seed)
# Pipeline creation with fixed seed
return compiled_pipeline
def benchmark_augmentation(dataloader, num_batches=100):
"""Ensure <15% overhead target"""
start_time = time.time()
for i, batch in enumerate(dataloader):
if i >= num_batches: break
# Time augmentation + forward pass
throughput = (num_batches * batch_size) / (time.time() - start_time)
print(f"Augmentation throughput: {throughput:.2f} samples/sec")
Your training pipeline becomes a force multiplier: every original sample generates dozens of realistic variations that stress-test your model's decision boundaries exactly where it needs strengthening.
Ready to transform your training data from a static liability into a dynamic asset that scales with your model's needs? These rules give you the production-grade augmentation infrastructure that top ML teams use to ship robust models faster.
You are an expert in Python, NumPy, Pandas, PyTorch, TorchVision, Kornia, Albumentations, TensorFlow tf.data, NLPAug, TextAttack, Audiomentations, and CUDA/TPU acceleration.
Key Principles
- Apply augmentations exclusively to the training split; never mutate validation or test data.
- Target class imbalance: prioritise minority classes when sampling for augmentation.
- Combine simple geometric/statistical transforms with domain-specific or learned policies (AutoAugment/RandAugment).
- Keep augmentations realistic; transformations must preserve label semantics.
- One transformation ≠ one dataset write-back: build on-the-fly, hardware-accelerated pipelines to avoid storage bloat.
- Reproducibility first: fix random seeds at pipeline entry, expose seed as hyper-parameter.
- Version every augmentation recipe; name with verb-noun pattern (e.g., flip-rotate-color_jitter_v2).
Python
- Use type-annotated pure functions for transforms; signature: `Callable[[T], T]` where `T` is `np.ndarray`, `Tensor`, or string.
- Group transforms in immutable tuples: `AUGMENT_VISION_BASIC: tuple[Callable[[Tensor], Tensor], ...]`.
- Compose pipelines with functional utilities (`functools.partial`, `torchvision.transforms.Compose`, `albumentations.Compose`).
- Use dataclass configs for parameter bundles:
```python
@dataclass(frozen=True)
class RandCropCfg:
size: tuple[int,int] = (224, 224)
scale: tuple[float,float] = (0.8, 1.0)
```
- Never create global state inside transforms; pass PRNG or torch.Generator explicitly.
- Persist random params in `sample["meta"]` when needed for paired tasks (e.g., image ↔ mask).
Error Handling and Validation
- Validate input dtype, shape, label consistency at transform entry; raise `ValueError` early.
- Clip numeric ranges after transforms (e.g., enforce 0–255 uint8 or 0–1 float32).
- Detect and log NaNs or infs immediately; abort batch not entire epoch.
- For NLP, ensure token indices stay within vocab size after augmentation; fallback to original token on failure.
- Guarantee paired data integrity (image+mask, audio+transcript) through shared random state.
- Provide privacy guardrails: strip/obfuscate PII tokens before text augmentation; hash sensitive metadata not needed for training.
PyTorch / TorchVision / Kornia Rules
- Use `torchvision.transforms` for CPU, switch to `kornia.augmentation.*` for GPU/TPU to stay inside graph.
- Wrap augmentations in `nn.Sequential` with `random_apply=...` to leverage TorchScript.
- Cache compiled TorchScript modules; ship to production as `.pt` artefacts.
- For AutoAugment/RandAugment:
- Treat `num_ops` and `magnitude` as tunable hyper-parameters recorded in experiment tracker.
- Export learned policies to YAML for cross-project reuse.
TensorFlow tf.data Rules
- Map deterministic preprocessing first, then `tf.image`/`tensorflow_addons` stochastic ops.
- Call `dataset = dataset.shuffle(..., seed=SEED, reshuffle_each_iteration=True)` before augmentation.
- Use `tf.data.experimental.prefetch_to_device('/gpu:0')` and `tf.data.AUTOTUNE` for parallelism.
NLP Framework Rules (NLPAug / TextAttack)
- Chain character, word, and contextual augmenters; cap replacement ratio at ≤0.3 to preserve semantics.
- Back-translation: cache results in SQLite/LMDB with key = hash(sentence).
- Maintain stop-word list & domain glossary to avoid altering critical tokens (e.g., medical terms).
Audio Rules (Audiomentations)
- Sample-rate normalise first, then augment.
- Ensure SNR (signal-to-noise ratio) ≥ 10 dB after noise injection; automatically reject otherwise.
- For time-stretch/pitch-shift, adjust annotation timestamps accordingly.
Testing
- Golden sample test: run pipeline on fixed seed and checksum output tensors.
- Sanity check per-transform with `pytest.param` covering min/max parameter extremes.
- Implement regression test that trains tiny model (e.g., 100 samples, 1 epoch); assert ≥ baseline accuracy.
Performance Optimisation
- Prefer on-GPU augmentations (Kornia, tf.image, torch.compile-enabled) to minimise PCIe transfer.
- For CPU heavy vision pipelines, use `opencv` + `albumentations` in multi-processing workers: `num_workers = min(8, os.cpu_count())`.
- Benchmark end-to-end throughput (samples/s) with and without augmentation; target <15% overhead.
Security & Privacy
- Apply differential privacy noise only after augmentation to prevent pattern leakage.
- Maintain audit log with hash(original) → hash(augmented) mapping.
Naming & Directory Conventions
- Folder lowercase-kebab: `augmentations/vision`, `augmentations/nlp`.
- Module names end with `_aug.py`, e.g., `rand_augment_aug.py`.
- Export config YAMLs under `configs/augmentation/` with same base name as module.
Example Vision Pipeline (PyTorch)
```python
VISION_TRAIN_AUG = nn.Sequential(
kornia.augmentation.RandomResizedCrop((224, 224), scale=(0.8, 1.0)),
kornia.augmentation.RandomHorizontalFlip(p=0.5),
kornia.augmentation.ColorJitter(0.2, 0.2, 0.2, 0.1),
kornia.augmentation.RandomErasing(p=0.2, scale=(0.01, 0.05)),
)
```
Example NLP Pipeline
```python
from nlpaug.flow import Sequential
from nlpaug.augmenter.word import SynonymAug, ContextualWordEmbsAug
TEXT_AUG = Sequential([
SynonymAug(aug_max=3),
ContextualWordEmbsAug(model_path='bert-base-uncased', action='substitute', aug_p=0.1)
])
```
Checklist Before Commit
- [ ] Pipeline unit tests green
- [ ] Updated YAML config version
- [ ] Random seed logged
- [ ] Augmentation disabled on val/test
- [ ] Performance benchmark < 15% overhead