The Smart Way to Run Site Reliability Engineering: Lessons for 2021

Arvind Mehrotra
6 min readOct 17, 2021

Site reliability engineering (SRE) is a relatively new role in the ITOps world. However, they can be invaluable in maintaining infrastructure readiness, planning an emergency response, and ensuring capacity so that your organisation’s digital and business sides are always in sync. Despite this, organisations rely primarily on traditional software engineers and system administrators to do the job.

It was Google that first realised that wearing multiple hats isn’t always the best approach. So in 2003, Ben Treynor Sloss (currently VP of engineering at Google) started the company’s first site reliability team, which soon scaled to over 1000 site reliability engineers by 2016. The premise was simple: train a crack team in both software development skills and system and networking expertise so you can balance both on-ground infrastructure maintenance and high-level modernisation.

So the question ahead of mid-sized to large enterprises is this — provided you can afford it, does it make sense to have a dedicated SRE team for IT services?

SREs: Nice-to-Have or IT Imperative

The site reliability engineer’s workload has two parts:

· First, they have to make sure that existing systems run without a glitch and accommodate whatever demand comes in from the business end.

· Second, they are tasked with strategic development so that your systems can continuously improve, become more efficient, and are ready to support the anticipated business needs of the enterprise.

Typically, a balanced approach is often required when managing an SRE role in an organisation.

The alternative to SRE is to have separate teams for these jobs, bundling maintenance with the rest of ITSM and development with your more extensive software or DevOps team. But this has several drawbacks:

● There is no single owner with visibility across the two jobs. As a result, opportunities for improvement may go unnoticed.

● ITSM and software may conflict: for instance, if a new development task threatens to disrupt existing processes and ITSM perceives the risk as too high.

● You could risk cost leakage if there is effort duplication across teams. In the long term, digital transformation may slow down if both teams do not have enough coordinated effort.

For these reasons, industry leaders like Google and now LinkedIn, Dropbox, Airbnb, IBM, and Netflix have adopted an SRE model. As a result, the adoption of SREs grew from 10% in 2019 to 15% in 2020.

4 Recommendations for Companies to Max Their SRE Potential

Putting together a practical SRE function signals a structural and cultural overhaul. At the high level, you revisit what it means to be a digital company and focus more on value generation than steadfast business continuity. On the ground, site reliability engineers require a unique mix of engineering, infrastructure, and soft skills that make them a valuable asset for any enterprise.

Here are four ways to maximise this potential:

1. Find your 50/50 balance

While a perfect 50/50 split is advisable, it isn’t easy to achieve in real life. A 2021 survey of SREs worldwide found that responding to incidents (44%) and conducting post-mortem analysis (35%) were the top two tasks occupying a site reliability engineer’s time. Developing new applications or capabilities came in fourth, followed by knowledge/skill expansion as a “moderate” priority. Depending on your organisation’s needs, arrive at a workload mix that’s feasible for your business and end customers. Then, tweak the SLAs and SLOs (or service level objectives) accordingly. It is important to note that more significant than (>) 50% on Automation with Less than (<) 50% on Operations should be the starting point for SRE’s. With more thrust on automation and developing accelerators, Operational incidents tend to reduce, resulting in high stability and reliability of systems.

2. Develop a broader skillset

Their skillset is what differentiates site reliability engineers from pure-play TIOps or ITSM professionals. Some of the areas where you might need to train your existing team include business intelligence, infrastructure as code, automation, and the various aspects of product development. Indeed, Google cites hiring site reliability engineers as the no.1 challenge when switching to an SRE model. It is an ongoing pressure point for almost all organisations involved in outsourcing or doing digital transformation. It means that the training approach needs to be central while addressing the talent shortage. In my opinion, the development and scaling of talent can be best addressed through in-house training and leveraging an SRE organisation’s support to help address the gap. SRE skill profile should include a solid understanding of operations with deep insight into systems from an application and infrastructure perspective. To get this mix right, we must develop engineers from multiple dimensions rather than a singular view of development and operations.

3. Invest in automation

Monitoring is an essential and critical part of SRE practice. It’s often the starting point for managing the overall system and services reliability. With dashboards of reports and charts, the team can keep a four-eye model in place for anything new and unusual. Effective automation is the secret to SRE success as it frees engineers from routine, infrastructure-related jobs to focus on actual development. According to the report I just cited, to know what to automate, investigate the causes of toil

· companies struggle with too much technical debt (42%),

· misalignment of priorities (32%),

· lack of collaboration (17%),

· and insufficient training (20%),

all of which add to the manual effort (i.e., non-development work) performed by site reliability engineers. Address these causes and automate to free up time for value generation.

4. Build observability

Observability tries to give software systems a definite shape or form through technologies like telemetry. As a result, site reliability engineers have a visual grasp of what they are currently working on. According to the principles of observability, an SRE model must have clear SLOs for target reliability levels and an error budget associated with them. Engineers draw from this budget for every incident, ensuring that incidents are being resolved as efficiently and cheaply as possible. Metrics like SLOs, error budgets aid in observability.

No Longer Only Toil and Trouble: How to Deliver an Enriched SRE Experience

The 2021 survey starts with the following phrase “SREs are beautiful.”

It means that systems do not run themselves (a fact which often stays unacknowledged due to abstraction and the proliferation of digital), and site reliability engineers are the people powering your organisation. In this context, toil is not only a hindrance to an engineer’s work experience, but it is also a drain on your precious SRE talent. Fortunately, toil numbers have visibly dropped from 40% in 2020 to 25% this year, signalling a shift in organisational attitudes towards the SRE function. And we must keep this going. Automated systems, aids like AIOps, constantly upgraded skill sets, and culture of observability and collaboration can help you:

● Empower engineers to solve incidents with minimal effort and budget investment

● Gain from an engineer’s exceptional coding skills to drive genuine value

I will continue this conversation with other blogs on the subject covering need for an SRE platform, SRE operating guide and SRE pitfalls, you can email me at Arvind@AM-PMAssociates.com if you want to address any other topic. Share your thoughts with me in the comments section below, and I will try and respond to you at the earliest.

--

--

Arvind Mehrotra

Board Advisor, Strategy, Culture Alignment and Technology Advisor