guide

What is observability?

Observability is the ability to understand what's happening inside a system from the outside — using the data it emits. Often shortened to o11y (o + 11 letters + y), it's how modern engineering teams keep complex, distributed software reliable.

Observability vs. monitoring

Traditional monitoring answers questions you already knew to ask — "is CPU over 90%?", "is the site up?". Observability goes further: it lets you ask new questions of a live system and debug failures nobody predicted, without shipping new code to investigate. Monitoring tells you that something broke; observability helps you understand why.

The three pillars: metrics, logs & traces

  • Metrics — numeric time-series (latency, error rate, throughput, saturation). Cheap to store, great for dashboards and alerting.
  • Logs — timestamped records of discrete events. The detail you reach for once a metric tells you something is wrong.
  • Traces — the journey of a single request as it hops across services, so you can see where time and errors accumulate.

Increasingly these are joined by profiling (down to the line of code) and events, all correlated together.

The modern observability stack

The open ecosystem most teams build on:

  • OpenTelemetry (OTel) — the vendor-neutral standard for generating and exporting telemetry. The connective tissue of the whole stack.
  • Prometheus — the de-facto metrics database and query language (PromQL).
  • Grafana — visualisation and dashboards (plus Loki for logs, Tempo for traces).
  • ClickHouse — the columnar store increasingly powering high-volume telemetry backends.
  • eBPF — kernel-level instrumentation with near-zero overhead.

Who works on observability?

It's the daily work of Site Reliability Engineers (SRE) and platform engineers, and a core skill for backend and infrastructure teams. They own SLOs, instrumentation, incident response and the telemetry pipeline — keeping systems fast, reliable and debuggable as they scale.

About Observability Jobs

Observability Jobs is a curated board for exactly this niche — currently 394+ open roles across OpenTelemetry, Prometheus, Grafana, SRE, platform and reliability engineering. We pull listings straight from companies' public ATS feeds, filter out the noise, and refresh every few hours. No recruiters, no reposts — just the teams actually building and running the monitoring stack.