Spokes.wiki Search Graph Growth About

platform-ops-wiki

Defined Term practice updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Site Reliability Engineering (SRE)

The discipline of operating production systems for reliability — treating operations as a software problem, governed by SLOs (service-level objectives) and practised through incident management, postmortems, runbooks, and anomaly detection. A pillar of platform-ops.

In the sources

google-sre-agentic-ai shows Google SRE folding aiops into the full SRE workflow — runbook generation, TimesFM-based anomaly detection, comms consolidation, postmortem drafting, and incident investigation — while insisting on SLOs + fallbacks, identity/permissions, explainability, and auditability. The mandate there is telling: even AI-augmented SRE keeps the human accountable and the system constrained.

Relation to neighbours