Comprehensive Rules for designing, testing, and maintaining high-quality prompts for Large Language Models (LLMs)
The difference between a working LLM integration and a production-ready one isn't the model—it's the prompts. You've probably spent hours debugging inconsistent outputs, handling edge cases, and dealing with users who've figured out how to break your carefully crafted prompts. These Cursor Rules transform your prompt engineering from guesswork into a systematic, reliable process.
Every developer working with LLMs faces the same frustrations:
Context Window Chaos: Your prompts work perfectly until they don't—usually when you hit token limits or need to handle dynamic content that breaks your carefully structured instructions.
Output Format Roulette: One day your LLM returns perfect JSON, the next it's wrapped in markdown code blocks, and sometimes it decides to write a paragraph explaining why it can't help instead of following your schema.
The Jailbreak Problem: Users discover creative ways to manipulate your prompts, bypass safety measures, or extract your system instructions—turning your helpful AI assistant into a security liability.
Version Control Nightmare: You've got prompts scattered across codebases, tweaked by different developers, with no systematic way to test changes or roll back when something breaks.
Cost Explosion: Without token awareness and systematic optimization, your LLM costs scale unpredictably—especially when prompts start hitting retry loops or context limits.
These rules establish a complete framework for building, testing, and maintaining prompts that work reliably in production. Instead of treating prompts as strings in your code, you'll manage them like the critical infrastructure they are.
Structured Prompt Architecture: Every prompt follows a consistent six-section format—system role, context, task instructions, output format, examples, and user input. This eliminates ambiguity and makes prompts debuggable.
Built-in Safety and Validation: Defensive prompting patterns handle adversarial inputs, malformed responses, and edge cases before they reach your application logic.
Framework-Specific Optimizations: Specialized patterns for ReAct reasoning, RAG implementations, and AutoPrompt optimization that go beyond basic completion.
Here's what changes when you implement systematic prompt engineering:
# Your current reality
response = openai.chat.completions.create(
messages=[{"role": "user", "content": f"Analyze this: {user_input}"}],
model="gpt-4"
)
# Pray it works, debug when it doesn't
# With structured prompts
prompt = load_prompt("analyze-sentiment.prompt.md", {
"user_input": sanitize_input(user_input),
"output_schema": SENTIMENT_SCHEMA
})
response = execute_with_validation(prompt, expected_schema=SENTIMENT_SCHEMA)
# Guaranteed format, built-in error handling
Prompt Development Speed: Instead of iterating through trial-and-error, you follow proven patterns. New prompts start from templates that handle common edge cases—cutting initial development time by 60%.
Debugging Time Reduction: Structured sections and consistent formatting mean you can quickly identify whether issues are in your task instructions, examples, or output formatting. No more guessing which part of your prompt caused the problem.
Production Reliability: Built-in validation and error handling patterns catch malformed outputs before they break your application. Your retry logic is systematic, not reactive.
When you need your LLM to perform research or complex analysis:
### System
You are a research analyst. Use the ReAct framework: Thought → Action → Observation → Answer.
### Task
Research the given topic using available tools. Maximum 5 iterations.
### Format
Thought: [your reasoning]
Action: [tool_name: parameters]
Observation: [tool response]
... (repeat until complete)
Answer: [final conclusion]
### User Input
<user_input>{{query}}</user_input>
Your application gets predictable, parseable outputs with clear reasoning chains—no more mysterious conclusions or incomplete analysis.
For knowledge-intensive applications where citations matter:
### Context
Relevant documents (≤3, ≤150 tokens each):
<documents>
{{retrieved_docs}}
</documents>
### Task
Answer using ONLY the provided documents. Include [source{n}] citations.
### Output Format
{
"answer": "string with [source1] citations",
"confidence": "high|medium|low",
"sources_used": ["source1", "source2"]
}
### Validation
IF insufficient information THEN respond: {"answer": "INSUFFICIENT_INFORMATION"}
Your RAG system becomes auditable and reliable—users can verify claims, and you can debug retrieval quality.
Built-in token budgeting prevents runaway costs:
# Automatic token management
if estimate_tokens(prompt + context) > MODEL_LIMITS["gpt-4"] * 0.75:
context = truncate_with_priority(context, keep_recent=True)
# Temperature adjustment for retries
config = {
"temperature": 0.7, # Default for creativity
"temperature_retry": 0.2, # Deterministic for fixes
"max_retries": 2
}
Create your first structured prompt file:
mkdir prompts/{system,examples,formats}
touch prompts/analyze-code.prompt.md
### System
You are a senior code reviewer. Version: 1.0.0
### Context
Programming language: {{language}}
Code complexity: {{complexity_level}}
### Task
1. Analyze code quality step by step
2. Identify specific issues with line numbers
3. Suggest concrete improvements
### Output Format (JSON)
{
"quality_score": 1-10,
"issues": [{"line": number, "type": "string", "description": "string"}],
"suggestions": ["string"],
"overall_assessment": "string"
}
### Example
// include:examples/code-review-python
### User Input
<user_input>
{{code_block}}
</user_input>
def execute_code_review(code, language):
prompt = load_prompt("analyze-code.prompt.md", {
"code_block": sanitize_code_input(code),
"language": language,
"complexity_level": estimate_complexity(code)
})
try:
response = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
model="gpt-4",
temperature=0.3 # Deterministic for code analysis
)
result = validate_json_response(response.content, CODE_REVIEW_SCHEMA)
return result
except ValidationError as e:
# Automatic retry with correction prompt
return retry_with_schema_correction(prompt, e, max_retries=2)
# tests/test_prompts.py
def test_code_review_prompt():
test_cases = load_test_cases("code-review-tests.json")
for case in test_cases:
result = execute_code_review(case["input"], case["language"])
assert result["quality_score"] >= 1 and result["quality_score"] <= 10
assert len(result["issues"]) >= case["expected_min_issues"]
assert validate_json_schema(result, CODE_REVIEW_SCHEMA)
Consistency Improvement: Structured prompts with examples reduce output variance by ~70%. Your application behavior becomes predictable across different inputs and edge cases.
Development Velocity: Template-driven prompt creation and reusable components let you build new LLM features 3x faster. Instead of starting from scratch, you compose proven patterns.
Error Rate Reduction: Built-in validation and retry logic with schema correction reduces production errors by ~85%. Your LLM integrations become as reliable as your other API calls.
Cost Optimization: Token-aware prompting and temperature management typically reduce API costs by 30-40% while maintaining output quality.
Security Posture: Systematic input sanitization and adversarial testing patterns protect against prompt injection and jailbreaking attempts.
The difference between experimental LLM features and production-ready AI capabilities isn't the underlying model—it's the systematic approach to prompt engineering. These rules give you that system, turning your LLM integrations from unpredictable experiments into reliable, scalable infrastructure.
Stop debugging LLM outputs. Start engineering them.
You are an expert in Prompt Engineering for Large Language Models (LLMs) using OpenAI GPT-4, Claude, PaLM, and Hugging Face models.
Technology Stack Declaration
- LLM APIs: OpenAI, Anthropic, Cohere, Hugging Face Inference
- Tooling: Python/TypeScript SDKs, LangChain, LlamaIndex, ZenML
- Prompt Frameworks: AutoPrompt, ReAct, RAG, Chain-of-Thought, Few-Shot & Zero-Shot
- Deployment Targets: Chatbots, Retrieval-augmented pipelines, Agentic workflows
Key Principles
- Precision over prose: write minimal, unambiguous instructions.
- Separate concerns: keep system, user, and example messages distinct with clear delimiters (```---``` / XML-style tags).
- Think-then-respond: explicitly request step-by-step reasoning where factual accuracy matters.
- Show, don’t tell: include 1-3 high-quality examples (few-shot) when format or style is non-obvious.
- Deterministic outputs: define schemas (JSON, YAML, CSV) and validation rules inside the prompt.
- Fail fast, iterate: test, measure, and refine prompts continuously, versioning every change.
- Safety first: assume adversarial inputs—sandbox user text, instruct refusal for disallowed content.
- Token-budget awareness: stay below 75 % of model context to allow room for responses and retries.
English Prompt Syntax & Conventions
- Voice: imperative (“Generate…”, “Return…”). Avoid passive.
- Delimiters: wrap dynamic user input in <user_input></user_input> or triple-backticks.
- Sections order: 1) System role, 2) Context / Knowledge, 3) Task instructions, 4) Output format, 5) Examples, 6) User input.
- Naming: prompt files use kebab-case ending with .prompt.md (e.g., summarize-article.prompt.md).
- Reusable blocks: store `system/`, `examples/`, and `formats/` folders; compose via import statements (// include:examples/news-summary).
- Length constraints: prefer ≤ 25 sentences or 2 000 tokens for base prompt; larger context via RAG.
- Jargon: introduce domain terms only after providing a brief definition.
- Numbered steps: force chain-of-thought by prefacing with “Let’s work this out step by step:” and requesting “### Answer” after reasoning.
- JSON output: always finish with `{"result": …}`; escape newlines with \n.
Error Handling & Validation
- Early detection: include `IF the answer is unsure OR data is missing THEN respond with "INSUFFICIENT_INFORMATION"`.
- Guardrails: provide explicit refusal style—single sentence apology + single sentence refusal.
- Retry logic: client code retries up to 2× with temperature ↓0.2 and adds “You previously made an error: <error>” section.
- Schema validation: post-parse LLM output; on failure, send corrective prompt: “Output did not match schema X. Fix only the formatting.”
- Adversarial testing: run prompts against a malicious corpus weekly; log jailbreak success rate < 0.5 % target.
Framework-Specific Rules
AutoPrompt
- Use gradient-free search to discover optimal trigger tokens; freeze once BLEU/accuracy plateaus.
- Store generated triggers in `autoprompt/` with metadata (model, date, task).
ReAct
- Enforce `Thought:` → `Action:` → `Observation:` cycle; max 5 iterations.
- Terminate with `Answer:` segment; client halts on first `Answer:`.
RAG
- Retrieval step must return ≤ 3 snippets, each ≤ 150 tokens.
- Cite sources: append `[source{n}]` tokens and provide URL list under “References”.
Additional Sections
Testing & Evaluation
- Automated eval suite: accuracy, brevity, toxicity, cost; run on CI for every prompt change.
- Use OpenAI GPT-4 as judge model with rubric scoring 1-5; gate merge at ≥ 4 average.
Performance & Cost
- Default temperature 0.7; lower to 0.2 for deterministic needs.
- Track tokens per request; alert when > 90 % of cost budget.
Security & Compliance
- Strip PII from logs; hash user IDs.
- For restricted content, prepend policy excerpt and “You must refuse if …”.
Versioning & Documentation
- Semantic versioning: MAJOR.MINOR.PATCH in prompt header comment.
- Changelog stored in `CHANGELOG.md`; summary auto-generated on merge.
Example Prompt Skeleton
```
### System
You are an expert financial analyst. Follow regulations. Version:1.2.0
### Context
Relevant market data: <data_block>
### Task
1. Analyze trends step-by-step.
2. Output JSON using the schema.
### Output Format (JSON)
{
"summary": string,
"risk_level": "low" | "medium" | "high",
"citations": string[]
}
### Example
<example omitted for brevity>
### User Input
<user_input>
```
Adopt these rules to ensure clarity, safety, and high-fidelity outputs across all LLM integrations.