platform-ops-wiki

Synthesis — Platform Ops

The evolving thesis of this wiki. Sits above the schema.org pages. Records the current best understanding, open questions, and explicitly flagged contradictions.

Current understanding

This spoke was spun out (2026-06-05) from three sources that arrived separately yet converge on one operational discipline: platform-ops — running cloud-native distributed systems in production. They map cleanly onto three pillars:

observability / service-topology. netflix-service-topology builds a live dependency graph of thousands of microservices by fusing three telemetry sources — eBPF flow logs, IPC metrics, distributed traces — because each has a blind spot (no app context / misses uninstrumented services / sampled). The merged graph beats any single source. distributed-tracing (OpenTelemetry) is the per-request view of the same dependency structure the topology shows in aggregate (a topology ≈ traces summed over time): the literal record of a request crossing service seams, the third signal beside metrics and logs, and the source of the latency SLIs that SLOs are defined on.
platform-engineering. kubernetes-integration-tax argues the dominant cost of production kubernetes is the “integration tax” — making ~20–30 CNCF tools (Prometheus, Cilium, cert-manager, GitOps) work together. The failures are seams (Cilium metrics invisible to Prometheus without a ServiceMonitor), not the tools themselves. The cure is the internal-developer-platform (project-as-a-service, Belastingdienst): pay the integration cost once, centrally, then expose it as a self-service, golden-path product (“make the right way the easiest way”) so teams don’t each re-pay the tax. Mechanically it’s a gitops-shaped reconcile loop applied to project provisioning (one YAML → namespaces/RBAC/quota via an operator), so the “seams, not components” thesis gains its scaling resolution: productize the seams. Half of that work is social (enablement-over-support, Communities of Practice across 99+ teams) — platform-as-a- product is an org pattern, not only tooling. Open tension: a golden path can harden into a golden cage if defaults lag team needs.
site-reliability-engineering + aiops. google-sre-agentic-ai applies agentic AI across the SRE workflow (runbooks, anomaly detection, postmortems, incident investigation over “observability + topology”) under strict mandates: SLOs + fallbacks, identity/permissions, explainability, auditability. The quantified backbone is service-level-objectives (Google SRE): the SLI → SLO → error-budget chain turns “reliability vs. velocity” into a measured control loop (spend the budget → stop shipping → fix reliability), so toil/MTTR/impact become budget math, not anecdote. It also reframes the AIOps reliability paradox — an ops agent should itself run under an error budget.

The unifying thread: in production, the hard problem is the seams, not the components. Each source is a different face of the same lesson — individual services, tools, and telemetry streams each work in isolation; the operational value (and cost) lies in integrating them into a system you can understand and trust under failure. Netflix integrates telemetry sources; the CNCF piece integrates platform tools; Google SRE integrates investigation signals + judgement (now agent-assisted). The service-topology is where these meet: it is the integrated map that site-reliability-engineering reads and that aiops agents reason over.

The pillars close into a loop, and the loop is now measured. gitops (OpenGitOps/CNCF: declarative, versioned-immutable, pulled, continuously reconciled) is the deployment face of the “seams, not components” thesis — make the integrated desired state the versioned artifact and let an agent (Argo CD/Flux) reconcile drift, instead of humans imperatively wiring ~30 tools. That joins the pillars into a cycle: observability (distributed-tracing) → SLIs/SLOs (service-level-objectives) → reconcile/operate (gitops/aiops). Two measurement systems sit across it: service-level-objectives quantifies the running service’s reliability (error budgets gate whether to ship), and dora-metrics (Google’s DORA / State of DevOps) quantifies the delivery pipeline — throughput (deploy frequency, change lead time) + stability (change fail rate, failed- deployment recovery time). They interlock: a blown error budget should surface as DORA stability pressure, and an aiops agent that applies fixes is itself a “deployer” with its own Change Fail Rate. DORA’s headline finding — “speed and stability are not tradeoffs” — is the evidence base under the integrate-the-seams thesis: better integration of the delivery system buys both at once. (Terminology note: DORA retired “MTTR” for Failed Deployment Recovery Time, scoped to deployments rather than all incidents.) The standing caveat across all of this is qualitative-only sourcing — the frameworks are named, but real MTTR/toil deltas from this spoke’s own topology/AIOps work remain to be measured.

Open questions

Build vs. buy the topology. Netflix built a bespoke system on internal infra (Pekko/Kafka/KV store). What does this look like for teams without Netflix-scale platform engineering? Partly answered (2026-06-09): the off-the-shelf stack is opentelemetry (vendor-neutral instrumentation → any backend) + prometheus (metrics/alerting) over ebpf (low-overhead kernel telemetry, the same substrate Netflix leans on). The remaining gap is the topology-graph assembly Netflix built on top — OTel/Prom give the signals, not the merged dependency graph. Also note ebpf‘s operational cost (CAP_BPF, verifier complexity limits, kernel-version dependence) — the second open question, now sourced.
AIOps reliability paradox. Agents drafting postmortems and investigating incidents must themselves be reliable. How is agent error handled in the loop, and does it interact with the “AI doesn’t deliver reliable production software” critique parked in the hub _inbox (ai-productionization cluster)? Open.
eBPF’s operational cost. netflix-service-topology leans on eBPF for coverage; the overhead, security, and kernel-version constraints of eBPF at scale aren’t covered here. Natural next source.
Quantification. Like most founding sets, claims are qualitative (the “integration tax,” the three-source merge). Hard numbers on toil reduction or incident MTTR would strengthen the thesis. Now framed (2026-06-10 + 2026-06-12): two measurement systems — service-level-objectives (reliability of the running service) + dora-metrics (delivery throughput & stability). Remaining: applying them to this spoke’s own claims (real MTTR/toil deltas from the topology/AIOps work), not just naming the frameworks.

Contradictions flagged

None yet — the founding sources and the later additions (SLOs, GitOps, tracing, DORA, the IDP) are complementary facets that reinforce the pillars rather than competing claims.

Cross-spoke adjacency

agentic-tooling-wiki owns the agent-building stack (ADK, MCP, Gemini Enterprise Agent Platform). aiops / google-sre-agentic-ai is the application of that stack to operations — the same tools, a different altitude. This split mirrors the hub’s general rule (tools vs. their application). Watch for bridge sources: agents acting over a service-topology. New seam (2026-06-15): agentic-tooling’s [[agent-loops-verification]] argues that as agent loops replace prompts, verifying agent-written cloud-native code becomes a runtime problem — validating behaviour against a real running system (Kubernetes ephemeral environments). That verification substrate is this spoke’s domain: the loop/verification paradigm is agentic-tooling’s, the runtime that makes the feedback truthful is platform-ops’. A candidate cluster if more “verify agents against live infra / CI-CD” sources arrive.
cloud-wiki owns cloud-hosting providers & pricing (where you rent servers). This spoke owns the practice of operating what runs on them. Managed-Kubernetes pricing → cloud-wiki; running kubernetes in production → here.
llm-inference-wiki shares a latency/throughput sensibility but at the model- serving layer, not the distributed-systems-ops layer.
Parked adjacency (hub _inbox): nvidia-doca-in-silicon-security (ai-infrastructure) is the hardware/silicon-security layer beneath this one — related but a distinct sub-domain. Fold in only if a hardware-ops cluster forms here.

Index — Platform Ops Wiki

Catalog of all pages, grouped by @type. The spine: synthesis (thesis), log.md (history), this file (catalog).

DefinedTerm (concepts / disciplines)

platform-ops — umbrella: running cloud-native distributed systems in production; the three pillars. · domain
site-reliability-engineering — SRE; reliability/operations as a software problem, SLOs & incident response. · practice
platform-engineering — building/integrating the internal platform teams run on (the “integration tax”). · practice
observability — metrics/traces/logs + topology; fuse multiple signals, none is complete alone. · practice
service-topology — the live service-dependency graph; blast-radius & local-vs-upstream. · mechanism
aiops — AI/agents applied to operations; constrained & accountable, not autonomous. · practice
ebpf — sandboxed in-kernel programs; low-overhead telemetry/networking/security substrate · source · mechanism
service-level-objectives — SLI/SLO/SLA + error budgets; the quantification backbone gating velocity vs reliability · source · standard
gitops — Git as source of truth; declarative + pulled + continuously reconciled (Argo CD/Flux); the deploy/operate loop · source · practice
distributed-tracing — spans/traces across services; the per-request view of the service topology; third observability signal · source · mechanism
dora-metrics — DORA Four Keys: deploy frequency/lead time (throughput) + change-fail-rate/recovery-time (stability); delivery-performance quantification, complements SLOs · source · standard
internal-developer-platform — IDP / golden paths / platform-as-a-product; the self-service product platform-engineering builds; the cure for the integration tax · mechanism

SoftwareApplication

kubernetes — container orchestration; the production substrate of platform engineering.
opentelemetry — vendor-neutral CNCF telemetry standard (traces/metrics/logs); the off-the-shelf instrumentation layer · source
prometheus — CNCF metrics + alerting toolkit; pull-based time series + PromQL · source

TechArticle / BlogPosting (source summaries, `source: true`)

netflix-service-topology — InfoQ: Netflix maps thousands of microservices in real time (eBPF + IPC + traces). (url-only)
google-sre-agentic-ai — Google Cloud: agentic AI across the SRE workflow. (url-only)
kubernetes-integration-tax — CNCF: the Kubernetes integration tax (Prometheus, Cilium, production reality). (url-only)
project-as-a-service — InfoQ/KubeCon: Belastingdienst’s IDP pattern (one YAML → namespaces/RBAC/quota; golden paths; enablement over support; 99+ teams). (url-only)