How Google SRE is using agentic AI to improve operations
Google Cloud blog on Google SRE applying agentic AI across the SRE workflow to
augment, not replace, humans. A leading example of aiops — agents applied to
site-reliability-engineering. (Routed to this spoke 2026-06-05; previously parked
in the hub _inbox under platform-ops-sre.)
Where agents are applied
- Reliability design — auto-generating / improving runbooks.
- Anomaly detection — TimesFM forecasting vs. static thresholds.
- Incident management — consolidating comms, drafting postmortems.
- Incident investigation — hypothesis → mitigation from observability + topology.
- Risk & insights — mining past incidents for patterns.
Stack
Gemini + Gemini Enterprise Agent Platform, ADK, MCP servers, BigQuery / vector DBs.
Design mandates
Strong guardrails: identity/permissions, SLOs + fallbacks, transparency/explainability over black-box, and auditability. The thesis is that agents in ops must be constrained and accountable, not autonomous black boxes.
Why it routed here
Subject is SRE / production operations practice (AIOps, incident response,
reliability) — an application of agents to ops, not agent-building tooling, so it
is distinct from agentic-tooling-wiki (which owns ADK/MCP as builder tools).
Cross-spoke adjacency: the ADK/MCP/Gemini stack is documented as builder tooling
in agentic-tooling-wiki. Pairs with netflix-service-topology (the
observability+topology layer agents investigate over) and kubernetes-integration-tax.
See synthesis.