Platform Ops (umbrella)
The practice of running and operating cloud-native distributed systems in production — the work that begins after software is written and deployed. This wiki’s umbrella concept, spanning three intertwined sub-disciplines that the founding sources each illuminate:
- site-reliability-engineering — the reliability/operations discipline and its incident-response practice (google-sre-agentic-ai).
- platform-engineering — building and integrating the internal platform that application teams run on (kubernetes-integration-tax).
- observability — knowing what the system is actually doing, including its service-topology (netflix-service-topology).
Cutting across all three: aiops — AI/agents applied to operations.
The through-line
The founding sources converge on one theme: in production, the hard problem is the seams, not the components. Individual tools, services, and telemetry sources each work in isolation; the operational cost — and the value of platform-ops work — lies in making them cohere into a system you can understand and trust under failure. See synthesis.