Service Level Objectives (SLOs) & error budgets
The quantification backbone the synthesis said the founding sources lacked (“hard numbers… would strengthen the thesis”). From Google’s SRE book, the canonical primary source for site-reliability-engineering. The three-term hierarchy:
- SLI (indicator) — “a carefully defined quantitative measure of some aspect of the level of service”: request latency, error rate, throughput, availability.
- SLO (objective) — a target for an SLI (
SLI ≤ target, or bounded). Canonical example: “99% of Get RPC calls complete in < 100 ms (averaged over 1 minute).” - SLA (agreement) — a contract with users carrying consequences for missing the SLO.
The error budget — the load-bearing idea
An SLO of 99.9% availability implies a 0.1% error budget: a permitted rate of failure, tracked daily/weekly. Its function is to gate feature velocity against reliability — while budget remains, teams ship freely; once it’s spent, releases stop and the team works on reliability until the budget recovers. This turns “reliability vs. speed” from an argument into a measured control loop, and gives the spoke its missing numbers axis: toil, MTTR, and incident impact all become budget math rather than anecdote.
Why it anchors the spoke
SLOs are the mechanism that makes site-reliability-engineering “reliability as a software problem” concrete. They sit directly atop observability — you can’t define an SLI you can’t measure, so SLOs are the consumer of the metrics/traces the observability pillar produces, and the input to the aiops loop (google-sre-agentic-ai‘s agents operate “under SLOs + fallbacks”). They also reframe the AIOps reliability paradox: an agent acting on operations should itself be governed by an error budget. The SLI→SLO→error-budget chain is the quantitative spine the whole “seams, not components” thesis can hang numbers on.
Related
site-reliability-engineering · observability · distributed-tracing · aiops · platform-ops · google-sre-agentic-ai · dora-metrics