Spokes.wiki Search Graph Growth About

platform-ops-wiki

Defined Term practice updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Observability

The ability to understand a system’s internal state from its external outputs — classically metrics, traces, and logs, extended here to the service-topology (the live dependency graph). A pillar of platform-ops and the substrate site-reliability-engineering investigates over.

A recurring design lesson from the sources

No single telemetry source is complete; fuse several. netflix-service-topology makes this explicit, merging three sources because each has a blind spot:

SourceStrengthBlind spot
eBPF network flow logsbroad, kernel-level coverageno application context
IPC metrics (instrumented)rich app-level endpointsmisses uninstrumented services
Distributed tracesreal request pathssampled → incomplete

The merged graph beats any single source — a concrete instance of the platform-ops through-line that the value is in integrating signals, not collecting them (cf. the kubernetes-integration-tax, where Cilium metrics are invisible to Prometheus without the right glue).

Used by

site-reliability-engineering reads these signals to locate failures; in particular aiops agents (google-sre-agentic-ai) reason over observability + topology to investigate incidents.