What is Site (Software) Reliability Engineering

In the summer of 2018, I was invited to attend Google Next in San Francisco, one of the tech giant’s flagship conferences. The city was filled with developers, product managers, and executives ready to learn what’s new with Google cloud services. But, for me, the highlight was a chance encounter with members of Google’s Site Reliability Engineering (SRE) team. At the time, I was serving as Director of DevOps at Infor, where we had been quietly adopting SRE principles for some of our most critical, large-scale production systems. My conversation with Google’s SRE folks that day would not just affirm my commitment to the approach, it would also clarify the power of SRE to transform how organizations think about reliability, operations, and continuous improvement.

‍

‍

Where SRE Came From

Site Reliability Engineering originated at Google in the early 2000s. At first, “site” literally meant google.com, the site where even brief outages could cost millions of dollars and damage user trust. Google needed a way to handle this growing complexity while still rolling out updates at a rapid clip. Their solution was to treat operations with the same engineering rigor used in software development: define service-level objectives (SLOs), assign error budgets, and automate everything possible. Instead of a purely reactive team babysitting servers, Google set out to build stable, scalable systems through thoughtful design and data-driven processes.

Over time, the “site” moniker expanded. As Google’s approach proved effective across more than just google.com, the same principles applied to every large-scale service they built. Some have since referred to it as Software Reliability Engineering to better capture its broad reach.

Today, SRE is widely seen as the natural successor to DevOps. While DevOps bridges development and operations, SRE explicitly measures and enforces reliability as a core feature. Teams agree on reliability targets, watch those metrics in real time, and act quickly when thresholds are at risk. It is a continuous feedback loop that depends on both cultural shifts and disciplined engineering, no matter what “site” (or software) you are running.

‍

The Google Next Encounter

My Google Next conversation started casually. I mentioned to the SRE leads that we at Infor were exploring error budgets, which are one of the signature concepts of SRE. Error budgets quantify the permissible level of downtime or errors your service can afford before you must stop shipping new features and address reliability. The Googlers nodded in approval, asking how we were implementing budgets in practice. Their curiosity revolved around data: “How are you monitoring real-time performance? How do you trigger a freeze on new deployments? Do you have a formal system to escalate critical incidents?”

‍

‍

What struck me was not just their knowledge, but the structured mindset they brought to the table. They did not jump straight to fancy orchestration platforms (IstIO was the rave back then). Instead, they stressed fundamental principles: define your metrics carefully, use clear alerting thresholds, automate rollbacks, and constantly review postmortems. SRE was a philosophy, not just a set of tools.

‍

The Infor Challenge: Running on Spot Instances

At the time, Infor was the second largest consumer of AWS EC2 Spot instances in the world, using them in production to reduce costs. Spot instances, by definition, can be “taken away” at any time if AWS needs that capacity elsewhere. That unpredictability made Spot instances seem like a no-go for mission-critical workloads. But we saw an opportunity to innovate if we could master the reliability challenge.

Following SRE principles, we built strong “reliable” self healing mechanisms. Continuous health checks identified when an instance was about to be reclaimed, and our orchestrators automatically spun up replacement instances. Load balancers shifted traffic seamlessly, and we made sure application state would not be lost during a swap. Logs and metrics fed into a real-time dashboard that alerted us at the slightest dip in performance. If performance thresholds fell below our SLO, automated remediation scripts engaged, rolling back new code or provisioning additional infrastructure.

Were there hiccups? Certainly. Spot interruptions happened at unexpected moments, forcing us to refine our scaling policies or optimize our price bidding (which is a different topic for another post). But those challenges actually forced progress. By iterating on our SRE foundation (monitor, measure, automate, repeat) we transformed these ephemeral compute resources from a liability into a strategic advantage. Our monthly cloud bills shrank dramatically while uptime stayed high, proving that controlled chaos can yield efficiency if you engineer around it.

‍

Why SRE Matters

1. Reliability Becomes a First-Class Feature
In traditional ops, teams often rely on manual checklists and personal heroics to keep servers running. Under SRE, reliability is baked into your product roadmap, with explicit SLOs shaping how you allocate engineering resources. The days of “it works on my machine” vanish when your success hinges on meeting concrete uptime or error-rate objectives.

2. A Balance Between Velocity and Stability
The challenge of modern software is shipping fast without breaking everything in the process. SRE addresses this by creating an “error budget.” This budget tells you how much failure is tolerable for the business, given user expectations. If the system stays within that budget, you can push new features. If reliability slips, you must focus on fixes until you are back in the green zone. It is a balancing act enforced by data.

3. Automation and Observability at the Core
SRE compels teams to automate tedious tasks—think patching servers or restarting stuck services—so engineers can concentrate on deeper problems. Coupled with robust observability (logs, metrics, traces), this automation becomes your early warning system. You spot anomalies before they become user-facing incidents, and you can roll out fixes quickly and confidently.

4. Cultural and Organizational Transformation
Perhaps the biggest barrier to SRE adoption is not technical but organizational. Shifting from a reactive, top-down ops approach to an engineering-driven reliability culture can create friction. SRE needs cross-functional collaboration, transparent communication, and continuous learning from incidents. Everyone must feel responsible for uptime, not just a small ops team.

‍

SRE Practices‍

If you want to get more in-depth into Site Reliability Engineering, Google offers a free resource called Site Reliability Engineering: How Google Runs Production Systems, available at https://sre.google/sre-book/table-of-contents/. Part III of that book focuses specifically on core SRE practices. Here are a few highlights:

Eliminating Toil
Chapter 5: SRE’s goal is to automate repetitive “toil” wherever possible. This practice frees engineers from routine tasks and enables them to focus on more strategic improvements.
Monitoring Distributed Systems
Chapter 6: Effective monitoring is essential for detecting anomalies before they impact users. It covers choosing the right metrics, setting alerting thresholds, and leveraging logs and traces for rapid troubleshooting.
Managing Incidents
Chapter 14: When incidents occur, SRE prescribes an organized, data-driven response. Swift triage, clear escalation paths, and role-based responsibilities help minimize downtime and user impact.
Postmortems and Root Cause Analysis
Chapter 15: Rather than assigning blame, SRE uses postmortems to learn from failures. Detailed root cause analysis pinpoints both technical and organizational weaknesses, fueling continuous improvements.
Release Engineering
Chapter 8: SRE principles extend to how code is built, tested, and deployed. Automated pipelines, canary releases, and rollbacks reduce risk while supporting faster iteration.

These practices work together to ensure high reliability and efficient operations, all while enabling teams to move at the speed modern users demand. If you are building or refining an SRE function, the Google SRE book is a great place to start for detailed guidance on implementation.

‍

Looking Ahead

Since those days in 2018, the world has seen an explosion of interest in SRE. Enterprises of all sizes now realize the importance of data-driven reliability. We also see SRE frameworks popping up in surprising places, from e-commerce websites to government cloud services, as organizations grapple with scaling demands and user expectations for near-constant availability.

I left Google Next convinced that SRE was not just an advanced form of DevOps, but a philosophy that can redefine operational excellence. Over the years, that belief has only grown stronger. Running mission-critical workloads on inherently unstable resources taught me that with careful engineering, you can thrive in environments where others fear to tread. And more broadly, adopting SRE promotes a reliability mindset: one in which every team member sees reliability and continuous improvement as part of their job description.

Whether you run an online marketplace, a big data platform, or a small SaaS app, the principles remain the same: Define what good looks like, measure it, automate every repetitive step, and treat incidents as opportunities to learn. SRE is ultimately about forging resilient systems and resilient teams, and in a digital landscape that changes by the hour, resilience may be your most valuable asset of all.

‍

Author

Quentin O. Kasseh

Quentin has over 15 years of experience designing cloud-based, AI-powered data platforms. As the founder of other tech startups, he specializes in transforming complex data into scalable solutions.

Read Bio

To unlock intelligent systems, enterprises must let co of yesterday’s database logic.