Opinionated rules for designing, implementing, testing, and operating HTTP health checks (liveness, readiness, startup) in Node.js/TypeScript microservices running on Kubernetes and observed with Prometheus/Grafana.
Your Node.js microservices are failing silently, and you're finding out from angry customers instead of monitoring dashboards. These Cursor Rules transform your health checking from reactive firefighting to proactive system observability.
You've probably been there: your service passes its basic "hello world" health check while the database connection pool is exhausted, Redis is timing out, and users are getting 500 errors. Standard health checks are theater—they tell you the process is running, not whether it can actually serve traffic.
Here's what's wrong with most Node.js health check implementations:
/health endpointThese Cursor Rules implement the three-probe pattern that actually works in production Kubernetes environments:
// What you get: Properly separated concerns
GET /health/liveness // "Is my process alive?" (K8s decides whether to restart)
GET /health/readiness // "Can I serve traffic?" (K8s decides whether to route traffic)
GET /health/startup // "Am I done initializing?" (K8s waits before other probes)
Each probe tests exactly what it needs to test, completes in under 100ms, and returns actionable status information that both Kubernetes and your monitoring stack can act on immediately.
Eliminate Production Debugging Sessions
Accelerate Deployment Confidence
Reduce Mean Time to Resolution
Before: Your service returns 200 OK on /health while throwing database timeout errors on actual requests. You discover the issue when users report problems, then spend 20 minutes figuring out it's a connection pool issue.
After: Your readiness probe fails when the database indicator can't complete SELECT 1 within 100ms. Kubernetes stops routing traffic immediately, Prometheus fires an alert, and you're investigating before users are affected.
// Readiness probe catches this immediately
export const dbIndicator: HealthIndicator = {
name: 'postgres',
async check() {
try {
await pgPool.query('SELECT 1');
return { status: 'UP' };
} catch (e) {
return { status: 'DOWN', details: { error: 'Connection pool exhausted' } };
}
},
};
Before: Your application starts responding slowly because Redis is timing out, but health checks pass. Performance degrades gradually until someone notices the dashboard.
After: Your readiness probe includes Redis connectivity. When Redis starts timing out, the service becomes unready, traffic shifts to healthy instances, and you get an immediate alert with specific failure details.
Before: Your pipeline deploys successfully, but the new version can't connect to a required service. The deployment "succeeds" but serves errors until you manually investigate.
After: Your pipeline's health-check job runs immediately after container start:
- name: Verify service health before deployment
run: |
docker compose up -d
npx wait-on http://localhost:3000/health/readiness -t 10000
# Deployment only proceeds if readiness passes
npm install express @types/express
npm install -D jest supertest @types/jest @types/supertest
The rules generate this file structure automatically:
src/health/
├── router.ts # Express routes for all health endpoints
├── indicators.ts # Individual health check functions
├── types.ts # TypeScript interfaces
└── __tests__/ # Jest test suite
// indicators.ts - Cursor Rules generate this pattern
export const dbIndicator: HealthIndicator = {
name: 'postgres',
async check() {
try {
await pgPool.query('SELECT 1');
return { status: 'UP' };
} catch (e) {
return { status: 'DOWN', details: { error: (e as Error).message } };
}
},
};
export const redisIndicator: HealthIndicator = {
name: 'redis',
async check() {
try {
await redisClient.ping();
return { status: 'UP' };
} catch (e) {
return { status: 'DOWN', details: { error: 'Redis unreachable' } };
}
},
};
# The rules provide this exact configuration
livenessProbe:
httpGet:
path: /health/liveness
port: 3000
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/readiness
port: 3000
initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 5
# GitHub Actions job that prevents unhealthy deployments
jobs:
health-check:
runs-on: ubuntu-latest
steps:
- name: Start service and verify health
run: |
docker compose up -d
npx wait-on http://localhost:3000/health/readiness -t 10000
deploy:
needs: health-check # Only deploy if health passes
runs-on: ubuntu-latest
steps:
# deployment steps here
Week 1: Stop getting surprised by "healthy" services that can't serve traffic. Your readiness probes catch dependency failures before users do.
Week 2: Deployment confidence increases dramatically. Failed deployments get caught in CI/CD instead of production, reducing rollback frequency by 60-80%.
Month 1: Mean time to resolution drops significantly because health responses tell you exactly which dependency is failing. No more "service is down, let me check everything" debugging sessions.
Ongoing: Your Prometheus/Grafana dashboards actually correlate health status with user-facing metrics. You start predicting issues instead of just reacting to them.
The most significant change? You'll stop learning about service issues from user reports and start catching them from monitoring alerts. That alone makes these rules worth implementing immediately.
These Cursor Rules implement battle-tested health check patterns used by major Node.js applications running at scale. Stop debugging production health issues—start preventing them.
You are an expert in Node.js (TypeScript), Express, Docker, Kubernetes, Prometheus, Grafana, GitHub Actions, and AWS.
Key Principles
- Every deployable unit MUST expose deterministic, idempotent health-check endpoints.
- Separate concerns: liveness ≠ readiness ≠ startup. Never multiplex them into a single probe.
- Health checks MUST complete in <100 ms and perform no heavy computation.
- Return machine-parsable JSON with a single word status ("UP" | "DOWN").
- Fail fast & loudly: unhealthy results MUST exit with HTTP 500 and trigger alerting.
- Never leak secrets or PII in health payloads—surface only minimal diagnostics.
- Do not introduce additional dependencies in health-check code paths (e.g. templating engines).
- All checks run in parallel; the slowest result governs overall status.
- Add health verification as the very first step of every CI/CD stage.
JavaScript / TypeScript Rules
- Use TypeScript 5.x with strict mode; `tsconfig.json` sets `"strict": true` and `"target": "ES2022"`.
- File layout (example):
src/
└─ health/
├─ router.ts // Express Router exposing /health/*
├─ indicators.ts // Individual indicator functions
├─ types.ts // `HealthIndicator` & `HealthReport` interfaces
└─ __tests__/ // Jest tests
- Naming conventions:
• Main combined endpoint: GET /health (overall)
• Liveness probe: GET /health/liveness (process-level)
• Readiness probe: GET /health/readiness (down-stream deps)
• Startup probe: GET /health/startup (lengthy init tasks)
- Sample indicator implementation (typescript):
```ts
// indicators.ts
export interface HealthIndicator {
name: string;
check(): Promise<{ status: 'UP' | 'DOWN'; details?: unknown }>;
}
export const dbIndicator: HealthIndicator = {
name: 'postgres',
async check() {
try {
await pgPool.query('SELECT 1');
return { status: 'UP' };
} catch (e) {
return { status: 'DOWN', details: { error: (e as Error).message } };
}
},
};
```
- Aggregator (router.ts) uses `Promise.allSettled` and early return for `DOWN`.
- Response shape:
```json
{
"status": "UP",
"uptimeSeconds": 1234.56,
"timestamp": "2024-03-17T15:04:05.123Z",
"checks": {
"postgres": "UP",
"redis": "UP"
}
}
```
- Export `healthRouter` and mount at root of Express app: `app.use('/', healthRouter)`.
- Always set `Cache-Control: no-store` header on health endpoints.
Error Handling & Validation
- Perform parameterless GET; fail if any indicator rejects or times out > 1 s.
- Indicators MUST wrap their logic in try/catch and return `{ status: 'DOWN' }`—never throw.
- Use `asyncHandler` middleware to propagate unhandled failures to global error handler.
- Instrument timeouts with `AbortController` to avoid hanging probes.
- Log unhealthy details at WARN level; emit no details in HTTP body when `NODE_ENV === 'production'`.
Framework-Specific Rules
Express
- Use `express.Router()`; avoid app-level route pollution.
- Do not register body-parsing middleware on health routes; they are GET only.
Kubernetes
- Example probe spec:
```yaml
livenessProbe:
httpGet:
path: /health/liveness
port: 3000
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/readiness
port: 3000
initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 5
startupProbe:
httpGet:
path: /health/startup
port: 3000
failureThreshold: 30
periodSeconds: 5
```
GitHub Actions (CI)
- Insert a reusable job named `health-check` before testing & deployment:
```yaml
- name: Verify local health endpoint
run: |
docker compose up -d
npx wait-on http://localhost:3000/health -t 10000
```
Additional Sections
Testing
- Use Jest + Supertest: simulate unhealthy dependency with test doubles and assert HTTP 500.
- Every indicator MUST have a failure-mode unit test.
Performance
- Run indicators in parallel via `Promise.allSettled` to keep total latency low.
- Warn if aggregate duration > 50 ms in production; include `X-Health-Time` header.
Security
- Protect `/health` with network-layer controls (e.g., only internal LB), not application auth.
- Strip stack traces; return generic reason codes ("DB_UNREACHABLE", "CACHE_TIMEOUT").
Monitoring
- Expose `/metrics` Prometheus endpoint alongside `/health`.
- Record `health_status` gauge (1 for UP, 0 for DOWN) with labels per indicator.
- Create Grafana dashboard panel visualizing readiness status over time.
Alerting & Remediation
- Configure PrometheusRule:
```yaml
- alert: ServiceDown
expr: health_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} is unhealthy"
```
- Use K8s deployment `maxUnavailable: 0` to ensure zero downtime during rollout.
Common Pitfalls & Anti-patterns
- DO NOT connect to third-party services with rate limits (e.g., payment gateways) in health checks.
- DO NOT execute migrations or heavy computations.
- DO NOT reuse main business logic code paths that load templates or large configs.