Observability Revamp

Implemented end-to-end tracing, SLOs, and error budgets to align engineering with business reliability goals.

OpenTelemetryGrafanaPrometheusSLOs

Problem

Multiple tools, no unified view; incidents required ad-hoc digging across logs and metrics.

Standardized instrumentation, propagated trace context, created SLOs per service, and set error budgets.

OTel SDKs, Tempo traces, Loki logs, Prom metrics; dashboards and alerts mapped to customer journeys.

Initial instrumentation effort and some runtime overhead vs substantial gains in MTTR and prevention.

MTTR down 54%, proactive detection up; culture shift to reliability as a product.