Comprehensive rules for designing, instrumenting, and operating highly-observable microservices with OpenTelemetry, Prometheus/Grafana, Last9 and other modern tooling.
Your microservices are running, users are (hopefully) happy, but when something breaks at 3 AM, you're debugging with kubectl logs and prayer. Sound familiar?
Here's what happens when observability is an afterthought:
The 3 AM Scenario: Your checkout service is throwing 500s. You're jumping between five different tools—Kubernetes dashboard, application logs, Prometheus, maybe some APM tool—trying to piece together what went wrong. By the time you trace the issue to a downstream payment API timeout, you've lost 20 minutes and probably some customers.
The Deployment Roulette: You push a new feature. Everything looks fine initially, but latency is creeping up. Without proper correlation between traces, metrics, and logs, you're playing detective instead of engineering.
The Alert Fatigue: Your Slack channel gets flooded with monitoring alerts, but half are false positives and the other half don't give you enough context to act quickly.
These rules transform your Go microservices into self-documenting, automatically observable systems that tell you exactly what's happening—before you need to ask.
Here's the difference:
Before: Manual instrumentation scattered throughout your codebase, inconsistent metric naming, traces that don't connect, and logs that don't correlate.
After: Every function automatically propagates context, every database query is traced, every error is properly recorded with structured attributes, and all your observability signals flow into a unified analytics plane.
The rules enforce OpenTelemetry-first development where observability isn't bolted on—it's built in from day one.
Instead of discovering issues through user complaints, you get intelligent alerts the moment SLO burn rates exceed thresholds. The rules configure SLO-driven alerting with error budgets, so you know about problems before your users do.
When everything breaks, you get distributed traces that show the exact request path, the specific database query that timed out, and correlated logs with structured context—all in one view. No more jumping between tools.
AIOps integration through Last9 correlates related alerts and applies intelligent noise reduction. Instead of 50 alerts for one cascading failure, you get one actionable alert with full context.
Automated synthetic probes and continuous SLO monitoring mean you know immediately if a deployment affects user experience. The rules include golden signal tracking (latency, errors, saturation) with automatic baseline detection.
Before These Rules:
# Your typical debugging session
kubectl logs -f checkout-service-abc123 | grep ERROR
# Switch to Grafana, try to correlate timestamps
# Check different Prometheus metrics manually
# Maybe spin up Jaeger and hope you can find the right trace
# Spend 30 minutes just gathering context
After Implementation:
// Every function automatically includes tracing
func ProcessPayment(ctx context.Context, paymentReq PaymentRequest) error {
// Automatic span creation with semantic attributes
span := trace.SpanFromContext(ctx)
span.SetAttributes(
attribute.String("payment.method", paymentReq.Method),
attribute.Float64("payment.amount", paymentReq.Amount),
)
defer span.End()
// Automatic correlation across all signals
logger.Info("processing payment",
zap.String("trace_id", span.SpanContext().TraceID().String()),
zap.String("payment_id", paymentReq.ID))
}
Result: One trace ID gives you the complete request journey across all services, with correlated logs and metrics automatically linked.
Before: Reactive scaling based on CPU/memory when users are already experiencing slowdowns.
After: The rules include predictive scaling via Last9's autopilot using p99 latency and queue depth metrics. Your services scale before users notice performance degradation.
// Automatic golden signal metrics for every HTTP endpoint
var (
httpDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_server_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
}, []string{"method", "endpoint", "status"})
)
Before: Generic error counters that don't help you understand business impact.
After: Structured error tracking with business context:
// Business-aware error metrics
func (s *CheckoutService) ProcessOrder(ctx context.Context, order Order) error {
if err := s.validateOrder(order); err != nil {
// Automatic error recording with business context
span := trace.SpanFromContext(ctx)
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
// Structured business metrics
s.metrics.ErrorTotal.WithLabelValues("validation", order.CustomerTier).Inc()
return fmt.Errorf("order validation failed: %w", err)
}
}
This gives you error rates broken down by customer tier, error type, and automatically correlated with traces.
// internal/observability/telemetry.go
func InitTelemetry(serviceName, version string) func() {
// Global tracer provider with OTLP export
exp, err := otlptrace.New(context.Background(),
otlptracegrpc.NewClient(
otlptracegrpc.WithEndpoint("your-otel-collector:4317"),
otlptracegrpc.WithInsecure(),
))
tp := tracesdk.NewTracerProvider(
tracesdk.WithBatcher(exp),
tracesdk.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(serviceName),
semconv.ServiceVersionKey.String(version),
)),
tracesdk.WithSampler(tracesdk.ParentBased(
tracesdk.TraceIDRatioBased(0.05)), // 5% sampling in prod
)
otel.SetTracerProvider(tp)
return func() { tp.Shutdown(context.Background()) }
}
// HTTP server with automatic instrumentation
func NewServer() *http.Server {
mux := http.NewServeMux()
// Automatic tracing for all HTTP endpoints
instrumentedMux := otelhttp.NewHandler(mux, "checkout-service")
mux.HandleFunc("/checkout", handleCheckout)
mux.HandleFunc("/healthz", handleHealth)
mux.HandleFunc("/metrics", promhttp.Handler().ServeHTTP)
return &http.Server{
Addr: ":8080",
Handler: instrumentedMux,
}
}
// Automatic database tracing
db, err := sql.Open("postgres", dsn)
if err != nil {
return err
}
// Wrap with OpenTelemetry
db = otelsql.Wrap(db,
otelsql.WithAttributes(
semconv.DBSystemPostgreSQL,
),
otelsql.WithSpanOptions(
otelsql.SpanOptions{
Ping: false,
RowsNext: true,
RowsClose: true,
RowsAffected: true,
LastInsertID: true,
},
),
)
# kubernetes/deployment.yaml
apiVersion: v1
kind: Service
metadata:
name: checkout-service
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
ports:
- name: metrics
port: 9090
targetPort: 9090
- name: http
port: 8080
targetPort: 8080
# prometheus/rules.yaml
groups:
- name: checkout-service
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{job="checkout-service",status=~"5.."}[5m]) > 0.01
for: 2m
annotations:
runbook_url: "https://runbooks.company.com/checkout-high-errors"
summary: "Checkout service error rate above 1%"
These rules don't just add monitoring—they fundamentally change how you develop microservices. When observability is built into every function, every database query, and every external call, you're not debugging systems anymore. You're optimizing well-understood, fully-instrumented applications.
The unified analytics approach means correlation happens automatically. When a trace shows high latency, you immediately see the correlated logs and metrics without manual detective work.
Most importantly, the AIOps integration learns your system's patterns and reduces alert noise to only actionable items. No more 3 AM pages for temporary blips that self-resolve.
Your microservices become self-documenting, self-monitoring systems that tell you exactly what they're doing and why—before you need to ask.
You are an expert in cloud-native monitoring, observability and SRE for Go-based microservices running on Kubernetes, using OpenTelemetry, Prometheus, Grafana, Last9, Jaeger and Datadog.
Key Principles
- Treat observability as a first-class, testable requirement; write code that is traceable, measurable and debuggable by default.
- Prefer open, vendor-neutral standards (OpenTelemetry) to avoid lock-in.
- Centralize all signals (metrics, logs, traces) into a single analytics plane (e.g., Last9, Datadog) for unified correlation.
- Fail fast, alert early: design SLO-driven alerting with error budgets and actionable runbooks.
- Automate everything: health checks, dashboards, alert routing, incident response, post-mortems.
- Continuously review instrumentation coverage and adapt dashboards/SLOs as services evolve.
Go (Instrumentation & Coding Rules)
- Always accept and propagate context.Context as the first argument (ctx) in every public function/method.
- Inject trace/span via otel.TracerFromContext(ctx) and add semantic attributes (http.method, db.system, messaging.system, etc.).
- Use structured, leveled logging via zap or zerolog; encode logs as JSON with RFC3339 timestamps.
- Wrap errors with %w and record them with span.RecordError(err); set span.SetStatus(codes.Error, err.Error()).
- Avoid global variables for metrics; instead create a metrics struct injected via dependency injection to ease testing.
- Export Prometheus metrics using promauto and name them with <subsystem>_<operation>_<unit> (e.g., cart_checkout_duration_seconds).
- Guard metric cardinality: never use unbounded label values (e.g., user_id) in a time-series.
- Instrument HTTP clients/servers with otelhttp; gRPC with otelgrpc; database/sql with otelsql; messaging with otelcontrib instrumentation.
Error Handling and Validation
- Validate input up-front; return rich error types that implement Is/As for comparability.
- For transient failures, wrap with otel attribute retry=true and use backoff libraries (e.g., go-retryablehttp).
- Expose /healthz (liveness) and /readyz (readiness); include checks for DB, cache, downstream APIs.
- Surface business errors via metrics (counter error_total{type="validation"}).
- Treat panic as fatal: recover in main only, log stack trace, emit Last9 panic metric, then terminate to let orchestrator restart.
OpenTelemetry (Framework-Specific Rules)
- Always initialise one global TracerProvider with BatchSpanProcessor + OTLP exporter.
- Configure resource attributes: service.name, service.version, deployment.environment.
- Export traces to Jaeger OTLP or Datadog; metrics to Prometheus; logs to stdout JSON → fluentbit → Last9.
- Sample: use ParentBased(TraceIDRatioBased(0.05)) for production; 1.0 in staging.
- Propagation: use W3C TraceContext + Baggage via otel.GetTextMapPropagator().
- Auto-instrument Kubernetes jobs via OpenTelemetry Operator CRD where possible.
Prometheus & Grafana
- Use a dedicated /metrics endpoint served on :9090; protect with network policy or mTLS.
- Record rules: 95th percentile latency per endpoint => histogram_quantile(0.95, rate(http_server_duration_seconds_bucket[5m])).
- Alert rules: rate(http_requests_total{status=~"5.."}[1m]) > 0 triggers page; annotate with runbook_url.
- Dashboards: one service overview (RED metrics), one dependency map, one SLO burn-down.
- Keep panel count < 20 per dashboard; use consistent units & legend format.
Last9 & AIOps
- Pipe Prometheus remote-write to Last9 for intelligent alert correlation.
- Enable Last9 control-center to auto-generate SLO suggestions based on historic data.
- Configure AIOps noise reduction: min-time-between-alerts = 10m, merge_duplicates = true.
Testing
- Unit tests: assert span attributes with go.opentelemetry.io/otel/oteltest.
- Integration tests: spin up docker-compose (postgres, jaeger, prometheus) and assert metrics existence via promql test harness.
- Synthetic probes: implement /probe endpoints and schedule with Blackbox Exporter; treat failures as P0.
Performance & Capacity
- Record Go runtime metrics (gc_duration_seconds, go_memstats_alloc_bytes) and set eBPF CPU flamegraphs for top pods.
- Enable predictive scaling via Last9 autopilot using p99 latency + queue depth.
- Cap label cardinality ‑ max 10 unique values per label per pod.
Security
- Scrub PII from logs before export; use otel attribute data.sensitivity="low|high".
- Encrypt OTLP traffic with mTLS using cert-manager on Kubernetes.
- Restrict Grafana read/write via RBAC; enforce SSO.
Common Pitfalls
- Forgetting to close spans on error paths → always defer span.End().
- Missing context propagation in goroutines → use context.WithCancel/WithTimeout and pass ctx.
- Alert fatigue: keep < 10 paging alerts per service; everything else → ticket.
Directory Structure Example
- /cmd/service/main.go ─ entrypoint, tracer setup, HTTP server
- /internal/metrics/ ─ metric instruments & registration
- /internal/observability/ ─ otel setup, exporters, propagators, logging
- /internal/handlers/ ─ HTTP handlers, instrumented
- /deployment/k8s/ ─ Helm charts (Prometheus annotations, OTEL collector sidecar)