Observability
The ability to understand a system’s internal state from its external outputs — classically metrics, traces, and logs, extended here to the service-topology (the live dependency graph). A pillar of platform-ops and the substrate site-reliability-engineering investigates over.
A recurring design lesson from the sources
No single telemetry source is complete; fuse several. netflix-service-topology makes this explicit, merging three sources because each has a blind spot:
| Source | Strength | Blind spot |
|---|---|---|
| eBPF network flow logs | broad, kernel-level coverage | no application context |
| IPC metrics (instrumented) | rich app-level endpoints | misses uninstrumented services |
| Distributed traces | real request paths | sampled → incomplete |
The merged graph beats any single source — a concrete instance of the platform-ops through-line that the value is in integrating signals, not collecting them (cf. the kubernetes-integration-tax, where Cilium metrics are invisible to Prometheus without the right glue).
Used by
site-reliability-engineering reads these signals to locate failures; in particular aiops agents (google-sre-agentic-ai) reason over observability + topology to investigate incidents.