Actionable coding & process rules for building automated, resilient incident-management solutions, aligned with ITIL/NIST and modern DevOps tooling.
When your production system goes down at 2 AM, you need runbooks that execute themselves—not documentation that sits in a wiki while you scramble to remember which service to restart first. These Cursor Rules transform incident management from reactive firefighting into proactive, automated response workflows that actually work under pressure.
Every minute of downtime costs your business money, but traditional incident management approaches create more problems than they solve:
The result? Your MTTR stays high, your on-call engineers burn out, and your customers lose trust in your platform's reliability.
These rules establish incident management as a disciplined engineering practice. Instead of hoping someone remembers the right commands during an outage, you codify every response into executable runbooks that automatically detect, contain, and resolve incidents.
Key Components:
# What usually happens during an incident
1. Alert fires (maybe)
2. Someone eventually notices
3. Scramble to find the right person
4. Debug the issue from scratch
5. Apply a fix that might work
6. Hope it doesn't happen again
kind: IncidentRunbook
id: db-connection-pool-exhaustion
title: Resolve Database Connection Pool Exhaustion
severity: S1
services: [user-service, payment-service]
triggers:
- promql: "pg_stat_activity_count > 95"
mitigation-steps:
- action: restart-connection-pool
script: ./scripts/restart_pool.py
timeout: 30s
- action: validate-connections
script: ./scripts/check_db_health.py
validation-tests:
- endpoint: /health/database
expected-status: 200
timeout: 10s
Faster Response Times:
Reduced Cognitive Load:
Improved Team Reliability:
# Before: Manual debugging and fixes
def handle_timeout_spike():
# Someone notices alerts (eventually)
# Logs into multiple dashboards
# Manually restarts services
# Hopes the fix works
pass
# After: Automated detection and response
@backoff.on_exception(backoff.expo, requests.RequestException, max_time=32)
def mitigate_api_gateway_timeout():
"""Auto-restart overloaded gateway instances"""
incident = IncidentContext.current()
# Scale up gateway instances
scale_service("api-gateway", replicas=6)
# Validate response times
if not validate_response_times(threshold_ms=500):
raise IncidentError("Mitigation failed, escalating to Tier 2")
update_incident_status(incident.id, "resolved")
post_slack_update(f"✅ API gateway timeout resolved automatically")
# Runbook executes automatically when triggered
kind: IncidentRunbook
id: db-pool-exhaustion-web-app
title: Clear Database Connection Pool Leak
severity: S2
services: [web-app, background-jobs]
triggers:
- promql: 'pg_stat_activity_count{job="postgres"} > 90'
- log-query: 'ERROR.*connection pool exhausted'
mitigation-steps:
- action: identify-long-running-queries
script: ./scripts/find_blocking_queries.py
timeout: 15s
- action: terminate-problematic-connections
script: ./scripts/kill_long_queries.py
timeout: 30s
- action: restart-connection-pool
script: ./scripts/restart_pool.py
timeout: 45s
validation-tests:
- query: "SELECT count(*) FROM pg_stat_activity"
expected-max: 50
rollback-steps:
- action: restore-connection-limits
script: ./scripts/restore_pool_config.py
# Integrated communication that updates automatically
class IncidentCommunicator:
def __init__(self, incident_id: str):
self.incident = get_incident(incident_id)
self.slack_channel = f"#inc-{incident_id}"
@track_communication_delivery
def broadcast_status_update(self, status: IncidentStatus):
"""Update all stakeholders with current incident status"""
# Update internal Slack channel
self.post_slack_update(
channel=self.slack_channel,
message=self.format_status_message(status)
)
# Update public status page (PII-filtered)
self.update_status_page(
title=redact_pii(self.incident.title),
status=status.public_description
)
# Notify escalation contacts if needed
if status.requires_escalation:
self.page_escalation_contacts()
# Add to your .cursorrules file
curl -o .cursorrules https://raw.githubusercontent.com/your-repo/cursor-rules/main/incident-management.yaml
# runbooks/api-health-check.yaml
kind: IncidentRunbook
id: api-health-degraded
title: Resolve API Health Check Failures
severity: S2
services: [api-service]
triggers:
- promql: 'up{job="api-service"} < 0.8'
mitigation-steps:
- action: restart-unhealthy-instances
script: ./scripts/restart_api_instances.py
- action: verify-health-endpoints
script: ./scripts/check_api_health.py
validation-tests:
- endpoint: /health
expected-status: 200
owners: [platform-team-oncall]
# scripts/restart_api_instances.py
from incident_management import IncidentContext, track_action
@track_action("restart-unhealthy-instances")
def restart_unhealthy_api_instances():
"""Identify and restart failing API service instances"""
incident = IncidentContext.current()
unhealthy_instances = get_unhealthy_instances("api-service")
for instance in unhealthy_instances:
restart_instance(instance.id)
# Validate all instances are healthy
if not all_instances_healthy("api-service", timeout=60):
raise IncidentError("Instance restart failed, escalating")
log_action_success(
incident_id=incident.id,
action="restart-unhealthy-instances",
instances_restarted=len(unhealthy_instances)
)
# config/integrations.py
INCIDENT_MANAGEMENT_CONFIG = {
"pagerduty": {
"integration_key": env.PAGERDUTY_INTEGRATION_KEY,
"escalation_policies": {
"S0": "critical-escalation-policy",
"S1": "high-priority-policy",
"S2": "standard-policy"
}
},
"slack": {
"bot_token": env.SLACK_BOT_TOKEN,
"incident_channel_prefix": "#inc-",
"stakeholder_groups": ["@platform-team", "@leadership"]
},
"monitoring": {
"prometheus_url": env.PROMETHEUS_URL,
"grafana_url": env.GRAFANA_URL
}
}
# Before vs After Metrics
response_times:
mtta_before: "15+ minutes"
mtta_after: "< 2 minutes"
mttr_reduction: "60-80%"
team_efficiency:
context_switching_reduction: "70%"
false_positive_reduction: "20% → 5%"
on_call_stress_reduction: "significant"
process_improvements:
runbook_execution_consistency: "100%"
incident_documentation_completeness: "95%+"
post_incident_action_item_completion: "90%+"
Stop letting incidents control your team's productivity. These rules turn incident management into a competitive advantage—your systems become more reliable, your team becomes more confident, and your customers experience fewer disruptions. The next time production breaks, your automated runbooks will be already fixing it while your competitors are still figuring out who to call.
You are an expert in: Incident-management automation using YAML runbooks, Python 3.11+ scripting, Terraform, GitHub Actions, PagerDuty, Slack APIs, Jira Service Management, ServiceNow, and modern observability stacks (Prometheus, Grafana, Datadog).
Key Principles
- Detect early, respond fast, learn always—optimize MTTA/MTTR.
- Keep humans informed: automate status pages & stakeholder comms.
- Prefer automation over manual steps; every manual action becomes a coded runbook task.
- Maintain a single, tiered escalation matrix (Tier 1 ⇢ Tier N) referenced by all tools.
- Treat runbooks as code: version-controlled, peer-reviewed, CI-validated.
- Capture every incident in a structured, queryable format (YAML/JSON) for trend analysis.
- Post-incident reviews are mandatory; action items tracked in backlog until closed.
YAML (Runbooks, Configs)
- Root key must be kind: IncidentRunbook | EscalationPolicy | Postmortem.
- Use kebab-case for keys (e.g., detection-rules, mitigation-steps).
- Required top-level fields for Runbook:
- id (UUIDv4)
- title (short imperative phrase)
- severity (S0…S4) per ITIL
- services (array of impacted service IDs)
- triggers (regex | PromQL | log-query)
- mitigation-steps (ordered list)
- rollback-steps (ordered list, optional)
- validation-tests (automated checks to confirm recovery)
- owners (array of on-call schedule IDs)
- Validate YAML via JSON-Schema before merge; fail CI if unknown keys.
Python (Automation Scripts)
- Use Python 3.11+ with type hints; enforce via mypy --strict.
- All external calls (PagerDuty, Slack, Jira) wrapped in retryable @backoff.on_exception with exponential backoff ≤ 32 s.
- Never raise bare Exception; define custom IncidentError hierarchy:
class IncidentError(Exception): ...
- Functions ≤ 60 LOC, cyclomatic complexity ≤ 10 (radon).
- Log JSON (structlog) with keys: ts, level, incident_id, function, msg.
Error Handling & Validation
- Fail fast: validate required input at function start; return IncidentError early.
- Auto-escalate if script exit code ≠ 0 or SLA breach imminent:
- Step 1: page current on-call via PagerDuty Events API v2.
- Step 2: create Slack incident channel #inc-<id> and post context.
- Validate comms delivery (Slack OK 200, PagerDuty status=success); otherwise fallback to SMS.
Framework-Specific Rules
ITIL/NIST Mapping
- Severity S0 (Critical) ↔ NIST Impact High; must trigger exec-level comms within 15 min.
- Use ITIL phases as git branches: detection/, containment/, eradication/, recovery/, lessons-learned/—merge PR only when phase tasks complete.
PagerDuty
- One service per deployable artifact; escalation_policy.id must match runbook owners.
- Auto-suppress duplicate alerts within 5 min via dedup_key composed of service + hash(alert_payload).
ServiceNow / Jira Service Management
- Incidents auto-created with fields:
short_description = <title>
severity = <severity>
cmdb_ci = <service>
affected_user_count derived from monitoring.
- Sync state changes (in_progress, resolved) back via webhooks.
Slack
- Incident channel naming: #inc-<yyyyMMdd>-<increment> (e.g., #inc-20241005-3).
- Use Slack bookmarks to pin runbook link, status page, Google Doc for PIR.
Additional Sections
Testing
- Weekly chaos game-day simulating at least one S0 and one S2 scenario; track metrics.
- Use pytest-assume to validate full runbook completion without manual pausing.
Performance
- MTTA target ≤ 2 min (95th), MTTR target ≤ 30 min (90th).
- Measure alert noise: false-positive rate ≤ 5 % of total alerts monthly.
Security
- Encrypt incident data at rest (KMS) and in transit (TLS 1.2+).
- Strip PII before posting to public status pages; use redact_pii(text) helper.
Post-Incident Review (PIR)
- Template stored in templates/postmortem.md; must be filled within 48 h.
- Action items labeled team/incident-ai in Jira; auto-remind assignee weekly until closed.