Comprehensive, production-ready rules for implementing and validating graceful shutdown in TypeScript/Node.js micro-services running on Docker, Kubernetes, AWS ECS or Istio service-mesh.
Tired of watching your metrics spike with 5xx errors during deployments? Your users shouldn't pay the price for your scaling events, rolling updates, or infrastructure maintenance. This comprehensive ruleset transforms unreliable service shutdowns into bulletproof graceful termination.
Every time your microservice receives a SIGTERM—whether from Kubernetes rolling updates, ECS task replacements, or autoscaling events—you face a critical moment. Without proper graceful shutdown:
These aren't edge cases—they're daily occurrences in production environments that directly impact user experience and system reliability.
This ruleset implements a deterministic, time-bounded shutdown process that never drops active work. Instead of letting the platform kill your process abruptly, you take control:
// Stop accepting new work immediately
server.close();
// Wait for active requests to complete
while (activeRequests > 0 && timeRemaining > 0) {
await delay(100);
}
// Clean up resources within timeout bounds
await Promise.allSettled([
database.close(),
messageQueue.disconnect(),
cache.shutdown()
]);
The system gracefully transitions from "accepting work" to "draining work" to "fully terminated" with complete observability at each stage.
Before implementing these rules:
# Deployment causes 30-second error spike
kubectl apply -f deployment.yaml
# Watch errors in monitoring dashboard
# Manual intervention required to verify completion
After implementation:
# Smooth deployment with zero user impact
kubectl apply -f deployment.yaml
# Readiness probes automatically drain traffic
# Shutdown completes within 30-second window
# All metrics remain green
Your service handles 200 concurrent requests when SIGTERM arrives:
// Track active work with middleware
export const requestTracker = (req: Request, res: Response, next: NextFunction) => {
if (isShuttingDown) {
return res.status(503).json({ error: 'Service shutting down' });
}
activeRequests++;
res.once('finish', () => activeRequests--);
next();
};
// Drain logic waits for completion
export async function close(signal: NodeJS.Signals): Promise<void> {
isShuttingDown = true;
server.close(); // Stop accepting new connections
// Wait for active requests with timeout
const deadline = Date.now() + SHUTDOWN_TIMEOUT;
while (activeRequests > 0 && Date.now() < deadline) {
await delay(100);
}
await database.close();
logger.info(`Shutdown complete: ${activeRequests} requests drained`);
}
With Istio sidecar coordination:
# Kubernetes pod spec
spec:
terminationGracePeriodSeconds: 40
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "wget -qO- http://localhost:3000/readyz && sleep 5"]
The preStop hook ensures your application starts draining before the Envoy sidecar, preventing connection resets.
Create src/shutdown.ts:
import { createServer, Server } from 'http';
import { once } from 'events';
let server: Server;
let isShuttingDown = false;
let activeRequests = 0;
const SHUTDOWN_TIMEOUT = parseInt(process.env.SHUTDOWN_TIMEOUT || '30000');
export async function init(): Promise<Server> {
server = createServer(app);
server.keepAliveTimeout = 65000; // > load balancer timeout
return server;
}
export async function close(signal: NodeJS.Signals): Promise<void> {
if (isShuttingDown) return; // Idempotent
isShuttingDown = true;
logger.warn({ signal }, 'SHUTDOWN: Starting graceful shutdown');
// Stop accepting new connections
server.close();
// Create timeout controller
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), SHUTDOWN_TIMEOUT);
try {
// Wait for active requests to complete
await Promise.race([
waitForDrain(),
once(controller.signal, 'abort')
]);
// Clean up resources
await Promise.allSettled([
database.close(),
redis.quit(),
messageQueue.disconnect()
]);
logger.info('SHUTDOWN: Graceful shutdown completed');
} finally {
clearTimeout(timeoutId);
}
}
async function waitForDrain(): Promise<void> {
while (activeRequests > 0) {
await new Promise(resolve => setTimeout(resolve, 100));
}
}
export const requestTracker = (req: Request, res: Response, next: NextFunction) => {
if (isShuttingDown) {
return res.status(503).json({
error: 'Service unavailable - shutting down',
retryAfter: 5
});
}
activeRequests++;
res.once('finish', () => {
activeRequests--;
metrics.gauge('shutdown.active_requests', activeRequests);
});
next();
};
// Mount first in middleware stack
app.use(requestTracker);
app.get('/healthz', (req, res) => {
// Always healthy until process exits
res.status(200).json({ status: 'healthy' });
});
app.get('/readyz', (req, res) => {
if (isShuttingDown) {
return res.status(503).json({
status: 'not ready',
reason: 'shutting down'
});
}
res.status(200).json({ status: 'ready' });
});
In src/main.ts:
import { init, close } from './shutdown';
async function main() {
const server = await init();
const signals: NodeJS.Signals[] = ['SIGTERM', 'SIGINT', 'SIGQUIT'];
signals.forEach(signal => {
process.once(signal, async () => {
logger.warn({ signal }, 'Signal received - starting graceful shutdown');
await close(signal);
process.exit(0);
});
});
const port = process.env.PORT || 3000;
server.listen(port, () => {
logger.info(`Server listening on port ${port}`);
});
}
main().catch(err => {
logger.error(err, 'Failed to start server');
process.exit(1);
});
Docker:
# Use proper signal handling
STOPSIGNAL SIGTERM
# Allow sufficient time for graceful shutdown
# docker run with --stop-timeout=35
Kubernetes:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
terminationGracePeriodSeconds: 40
containers:
- name: app
env:
- name: SHUTDOWN_TIMEOUT
value: "30000"
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
readinessProbe:
httpGet:
path: /readyz
port: 3000
periodSeconds: 1
AWS ECS:
{
"containerDefinitions": [{
"environment": [
{ "name": "ECS_CONTAINER_STOP_TIMEOUT", "value": "35s" },
{ "name": "SHUTDOWN_TIMEOUT", "value": "30000" }
]
}]
}
// test/shutdown.test.ts
describe('Graceful Shutdown', () => {
it('completes active requests before shutdown', async () => {
const server = await init();
// Start long-running request
const requestPromise = fetch('http://localhost:3000/slow-endpoint');
// Send SIGTERM after 100ms
setTimeout(() => process.kill(process.pid, 'SIGTERM'), 100);
// Request should complete successfully
const response = await requestPromise;
expect(response.status).toBe(200);
});
it('rejects new requests during shutdown', async () => {
// Trigger shutdown
process.kill(process.pid, 'SIGTERM');
// Wait for shutdown to start
await delay(50);
// New requests should be rejected
const response = await fetch('http://localhost:3000/test');
expect(response.status).toBe(503);
});
});
# Add to CI pipeline
npm run test:chaos-shutdown
# Kubernetes chaos testing
kubectl apply -f - <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-kill-chaos
spec:
engineState: 'active'
chaosServiceAccount: litmus
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
EOF
Track these key indicators to validate implementation success:
// Metrics to monitor
metrics.timer('shutdown.duration_ms');
metrics.gauge('shutdown.active_requests_at_start');
metrics.counter('shutdown.result', { status: 'success|timeout|error' });
metrics.histogram('shutdown.request_drain_time_ms');
Success criteria:
This ruleset transforms unreliable shutdowns into a competitive advantage. Your deployments become invisible to users, your on-call load decreases, and your team ships features with confidence knowing the infrastructure won't drop user requests.
Stop treating graceful shutdown as an afterthought—make it a cornerstone of your service reliability strategy.
You are an expert in TypeScript, Node.js (Express/Koa/Fastify), Docker, Kubernetes, AWS ECS, and Istio.
Key Principles
- Never drop an in-flight request: stop accepting new work first, then wait for all active tasks to finish or time-out.
- All shutdown paths must be deterministic, idempotent, and bounded by a configurable deadline (default 30 s, override via env var SHUTDOWN_TIMEOUT).
- Use OS signals (SIGTERM/SIGINT) as the single source of truth; do NOT rely on process.on('exit').
- Prefer async/await with promise aggregation over callbacks to simplify drain logic.
- Expose real-time liveness/readiness endpoints that reflect shutting-down state immediately.
- Never block the event-loop; long-running cleanup must be awaited, not spin-waited.
- Unit-test and chaos-test shutdown paths with the same rigor as happy-path logic.
TypeScript / Node.js Rules
- All services export a single async init() and close() pair:
```ts
export async function init(): Promise<http.Server> { /* start */ }
export async function close(signal: NodeJS.Signals): Promise<void> { /* drain */ }
```
- Register signal listeners in ./src/main.ts ONLY:
```ts
const server = await init();
const signals: NodeJS.Signals[] = ['SIGTERM','SIGINT','SIGQUIT'];
signals.forEach(sig =>
process.once(sig, async () => {
logger.warn({sig}, 'Signal received – starting graceful shutdown');
await close(sig);
process.exit(0);
})
);
```
- Stop accepting new connections immediately:
`server.close();` for http, `listener.close();` for gRPC, `channel.shutdown();` for AMQP.
- Track in-flight requests with a RequestCounter middleware:
```ts
let active = 0;
export const track = (req: Request, res: Response, next: NextFunction) => {
if (shuttingDown) return res.status(503).end(); // reject new work
active++;
res.once('finish', () => active--);
next();
};
```
- Expose `/healthz` and `/readyz`:
• /readyz returns 503 once SIGTERM received.
• /healthz continues to return 200 until process exits.
- Use AbortController for timeout enforced drains:
```ts
const ac = new AbortController();
setTimeout(() => ac.abort(), timeoutMs);
await Promise.race([gracefulTasks(), once(ac.signal,'abort')]);
```
- Code style: 2-space indent, semicolons mandatory, `camelCase` for vars, `PascalCase` for types, `kebab-case` for file names.
Error Handling & Validation
- Perform defensive checks at top of close(): if already shutting down, return immediately.
- Wrap each cleanup task in try/catch; aggregate with Promise.allSettled and log individual failures.
- Surface non-fatal cleanup errors to observability stack but do not block exit beyond timeout.
- Emit metrics: `shutdown_in_flight_requests`, `shutdown_duration_ms`, `shutdown_result{status="success|timeout|error"}`.
Framework-Specific Rules
Express / Koa / Fastify
- Mount `track` middleware first.
- For Fastify set `options.forceCloseConnections: true`.
- Always use `server.keepAliveTimeout = 65000` (> LB idle timeout) to avoid mid-flight drops.
Docker
- Use `STOPSIGNAL SIGTERM` in Dockerfile.
- Avoid `docker stop --time 0`; default 10 s is too short—set to 35 s.
Kubernetes
- Set `terminationGracePeriodSeconds: 40` (>= SHUTDOWN_TIMEOUT + safety 5 s).
- Add:
```yaml
lifecycle:
preStop:
exec: { command: ["/bin/sh","-c","wget -qO- http://localhost:3000/readyz && sleep 5"] }
```
- Enable `readinessProbe` that flips to failed on SIGTERM to drain endpoints.
AWS ECS / Fargate
- Listen for `ContainerInstanceDraining` event; publish to SNS if shutdown exceeds 90 s.
- Use `ECS_CONTAINER_STOP_TIMEOUT` env var to extend default 30 s.
Istio
- Inject sidecar with `terminationDrainDuration: 45s` matching pod grace period.
- Ensure application flips readiness before pilot-agent starts draining.
Additional Sections
Testing & Validation
- Integration test: `npm run test:shutdown` sends SIGTERM during load and asserts zero 5xx responses.
- Chaos test weekly with Kubernetes `pod-kill` fault using Litmus or Chaos-Mesh.
- Add GitHub Action that runs `docker kill --signal=SIGTERM $CONTAINER_ID` in CI.
Performance
- Keep SHUTDOWN_TIMEOUT < LB idle timeout to avoid 502 errors.
- Tune DB pool idle timeout to 2 × shutdown timeout to prevent premature pool disposal.
Observability & Logging
- All shutdown logs prefixed with `SHUTDOWN:` and include correlation id.
- Emit OpenTelemetry span named `service.shutdown` with attributes: reason, start, end, status.
Security
- Do not leak secrets in shutdown logs.
- Ensure TLS listeners call `server.close()` before key material is freed.
Common Pitfalls
- FORGETTING to close message-queue consumers → duplicate processing.
- Registering multiple SIGTERM handlers → process hangs.
- Neglecting to update readiness probe → load balancer continues sending traffic.
Reference Timeout Matrix
- LB idle: 60 s
- Istio drain: 45 s
- Pod grace: 40 s
- App shutdown: 30 s (configurable)
- preStop hook sleep: 5 s (buffer)
Use this rule set verbatim in every new micro-service repository.