SRE Team Topologies: Embedded, Centralized, and Hybrid Models
How SRE teams are typically organized — embedded, centralized, and hybrid models — and the trade-offs each makes between context and consistency.
There’s no single correct way to organize an SRE function, and the right model depends heavily on company size, the diversity of services being operated, and how much standardization across teams is realistic or desirable.
Embedded SRE: deep context, less consistency
In this model, SRE engineers sit directly within product/feature teams, sharing on-call and operational responsibility for that team’s specific services. The advantage is deep context — embedded SREs understand the specific service’s quirks, history, and architecture intimately. The downside is consistency: practices (SLO definitions, incident response conventions, tooling choices) can diverge meaningfully between teams without active coordination, and a small embedded SRE presence on each team doesn’t scale efficiently if the organization has many small teams.
Centralized SRE: consistency, less per-service context
A central SRE team owns reliability practices and often shared infrastructure (observability tooling, the incident response process, sometimes the deployment platform itself) across the whole organization. This gives strong consistency and lets specialized expertise concentrate in one place, but the team can become a bottleneck as the number of services they’re nominally responsible for grows, and they inherently have less deep context on any single team’s specific service than an embedded engineer would.
Hybrid: a central platform team plus embedded liaisons
A common middle path: a central platform/SRE team builds and maintains shared tooling and sets baseline practices (this overlaps significantly with the platform engineering concept covered elsewhere on this site), while individual product teams have a designated reliability-focused engineer or rotation who applies those practices to their specific service and maintains the team-specific context a fully centralized model would lack.
”You build it, you run it” as a related but distinct model
Separate from any of the above SRE-specific structures, some organizations skip a dedicated SRE function in favor of having product engineering teams be fully responsible for operating what they build, with platform engineering (again, distinct from SRE specifically) providing self-service tooling rather than direct operational involvement. This works well when most teams have sufficient operational maturity and the self-service tooling is genuinely good; it works poorly when teams lack either and end up reinventing reliability practices inconsistently and often badly.
What actually drives the right choice
- Service diversity — highly heterogeneous services (different languages, architectures, failure modes) favor embedded context; largely homogeneous services favor centralization.
- Organizational size and growth rate — small organizations often can’t afford embedded SRE per team and default to centralized or “you build it, you run it”; larger organizations have more options and often evolve toward hybrid models over time.
- Existing tooling maturity — strong, mature self-service platform tooling makes “you build it, you run it” more viable; weak tooling pushes toward needing more direct SRE involvement somewhere.
A common evolution pattern
Many organizations start centralized (a small dedicated team is the only way to bootstrap good practices initially), then move toward hybrid or more embedded models as they scale and a single central team can no longer maintain sufficient context across a growing number of increasingly diverse services.
Takeaway: there’s no universally correct SRE team topology — it’s a trade-off between context and consistency that should be revisited as an organization’s size and service diversity actually change, not chosen once and assumed permanent.
Comments are powered by Giscus (GitHub Discussions). Enable them by
configuring GISCUS in src/consts.ts — see
giscus.app.