What are the three pillars of observability?

Metrics (numeric time-series like latency and error rate), logs (timestamped event records), and traces (the path of a request across services). Together they give breadth, detail and causality.

o11y is a numeronym for observability — the letter o, then 11 letters, then y. It's common shorthand in the SRE and platform-engineering community.

What is observability? Metrics, logs, traces & the o11y stack

Q: What is observability?

Observability is the ability to understand a system's internal state from the data it emits — metrics, logs and traces. Unlike traditional monitoring, which checks known failure modes, observability lets engineers ask new questions of a running system and debug problems they didn't predict.

Observability vs. monitoring

Traditional monitoring answers questions you already knew to ask — "is CPU over 90%?", "is the site up?". Observability goes further: it lets you ask new questions of a live system and debug failures nobody predicted, without shipping new code to investigate. Monitoring tells you that something broke; observability helps you understand why.

The three pillars: metrics, logs & traces

Metrics — numeric time-series (latency, error rate, throughput, saturation). Cheap to store, great for dashboards and alerting.
Logs — timestamped records of discrete events. The detail you reach for once a metric tells you something is wrong.
Traces — the journey of a single request as it hops across services, so you can see where time and errors accumulate.

Increasingly these are joined by profiling (down to the line of code) and events, all correlated together.

The modern observability stack

The open ecosystem most teams build on:

OpenTelemetry (OTel) — the vendor-neutral standard for generating and exporting telemetry. The connective tissue of the whole stack.
Prometheus — the de-facto metrics database and query language (PromQL).
Grafana — visualisation and dashboards (plus Loki for logs, Tempo for traces).
ClickHouse — the columnar store increasingly powering high-volume telemetry backends.
eBPF — kernel-level instrumentation with near-zero overhead.

Who works on observability?

It's the daily work of Site Reliability Engineers (SRE) and platform engineers, and a core skill for backend and infrastructure teams. They own SLOs, instrumentation, incident response and the telemetry pipeline — keeping systems fast, reliable and debuggable as they scale.