Defined Term practice updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Site Reliability Engineering (SRE)

The discipline of operating production systems for reliability — treating operations as a software problem, governed by SLOs (service-level objectives) and practised through incident management, postmortems, runbooks, and anomaly detection. A pillar of platform-ops.

In the sources

google-sre-agentic-ai shows Google SRE folding aiops into the full SRE workflow — runbook generation, TimesFM-based anomaly detection, comms consolidation, postmortem drafting, and incident investigation — while insisting on SLOs + fallbacks, identity/permissions, explainability, and auditability. The mandate there is telling: even AI-augmented SRE keeps the human accountable and the system constrained.

Relation to neighbours

Investigates over observability signals and the service-topology (netflix-service-topology) to locate a failure’s blast radius.
Overlaps but does not equal platform-engineering: SRE owns reliability of what runs; platform engineering owns the platform it runs on (kubernetes-integration-tax).

Site Reliability Engineering (SRE)

In the sources

Relation to neighbours

Linked from