Defined Term practice updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Observability

The ability to understand a system’s internal state from its external outputs — classically metrics, traces, and logs, extended here to the service-topology (the live dependency graph). A pillar of platform-ops and the substrate site-reliability-engineering investigates over.

A recurring design lesson from the sources

No single telemetry source is complete; fuse several. netflix-service-topology makes this explicit, merging three sources because each has a blind spot:

Source	Strength	Blind spot
eBPF network flow logs	broad, kernel-level coverage	no application context
IPC metrics (instrumented)	rich app-level endpoints	misses uninstrumented services
Distributed traces	real request paths	sampled → incomplete

The merged graph beats any single source — a concrete instance of the platform-ops through-line that the value is in integrating signals, not collecting them (cf. the kubernetes-integration-tax, where Cilium metrics are invisible to Prometheus without the right glue).

Used by

site-reliability-engineering reads these signals to locate failures; in particular aiops agents (google-sre-agentic-ai) reason over observability + topology to investigate incidents.

Observability

A recurring design lesson from the sources

Used by

Linked from