Blog Posting Tech Article source ↗ source url updated Fri Jun 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

How Google SRE is using agentic AI to improve operations

Google Cloud blog on Google SRE applying agentic AI across the SRE workflow to augment, not replace, humans. A leading example of aiops — agents applied to site-reliability-engineering. (Routed to this spoke 2026-06-05; previously parked in the hub _inbox under platform-ops-sre.)

Where agents are applied

Reliability design — auto-generating / improving runbooks.
Anomaly detection — TimesFM forecasting vs. static thresholds.
Incident management — consolidating comms, drafting postmortems.
Incident investigation — hypothesis → mitigation from observability + topology.
Risk & insights — mining past incidents for patterns.

Stack

Gemini + Gemini Enterprise Agent Platform, ADK, MCP servers, BigQuery / vector DBs.

Design mandates

Strong guardrails: identity/permissions, SLOs + fallbacks, transparency/explainability over black-box, and auditability. The thesis is that agents in ops must be constrained and accountable, not autonomous black boxes.

Why it routed here

Subject is SRE / production operations practice (AIOps, incident response, reliability) — an application of agents to ops, not agent-building tooling, so it is distinct from agentic-tooling-wiki (which owns ADK/MCP as builder tools). Cross-spoke adjacency: the ADK/MCP/Gemini stack is documented as builder tooling in agentic-tooling-wiki. Pairs with netflix-service-topology (the observability+topology layer agents investigate over) and kubernetes-integration-tax. See synthesis.

How Google SRE is using agentic AI to improve operations

Where agents are applied

Stack

Design mandates

Why it routed here

Linked from