Site Reliability Engineering -understand your agility

Johann Ingibergsson

Delivery Lead, cVation

At cVation we have a heavy DevOps focus. With DevOps as a principle, we try to address all tasks with a customer centric view. A critical driver for us is to tackle the interests between development and operations, where when formulating the interests in the baldest terms clash ("We want to launch anything, any time, without hindrance" versus "We won’t want to ever change anything in the system once it works").

The tenets for us to achieve customer centricity are to have a focus on availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. We therefore have a focus on how we do operations to achieve these goals. We cannot solely rely on monitoring uptime or availability, instead we must monitor larger user flows directly affecting the users experience of the software we deliver, thus we employ site reliability engineering to achieve our goal.

Site Reliability Engineering

Site Reliability Engineering (SRE) is an engineering discipline coined by Google, which is devoted to help an organization sustainable achieve the proper level of reliability in their systems. Appropriate is an important aspect, since this allows us to find a balance between agility and stability. Responsibilities of the SRE teams are:

Availability / reliability / latency / performance / efficiency / change management / monitoring / emergency response and capacity planning of their services.

To allow a team to do this on a large project we aim to automate or cut anything repetitive. The SRE team will emphasize that the systems design will work reliably amidst frequent updates from development teams. For the SRE team to be able to adapt to a changing team the system needs to be monitored, i.e., having a high level of observability. Observability, as in, logging enough data in the system to be able to do proper investigations. This allows the SRE team to constantly assess new user flows and monitor critical paths for the project.

Agility versus Stability

Historically the division between Development and Operation has resulted in a clash of what is most important agility in releasing new features or maintaining stability for the customer. The SRE concept allows us to discuss this conundrum, by utilising Service Level Indicators (SLIs), Service Level Objectives (SLOs) and a resulting error budget. The SLI can be interpreted as monitoring an important user flow, e.g., in an IoT setup it could be the user installing a sensor and confirms it being alive in a management portal. The SLO would then be the rules for latency, availability, and correctness that the user should expect. The SLO allows the SRE team and the business to communicate and agree on concrete numbers of uptime and performance. The difference between the measured SLI and the defined SLO is the error budget. When we are performing better than agreed the development teams can ship features ASAP. However as soon as our budget is negative, the SRE team is charged with halting releases, investigating issues, removing or tasking a development team to remove impediments. Basically, a negative SLO means that the project needs to focus on quality.

Thus, the battle between operation and development can be reduced to being over or under our agreed SLO. To explicitly define the error budget, we can interpret it as how much “unreliability” is available in the measured time frame. This is an important notion, since all deployments introduce a risk for unreliability, even the biggest companies such as Microsoft can release something that despite all tests and best practices can result in an outage. The SRE team is often seen as only focussing on user-flows related to availability or latency, but a user-flow could also be seen from a business perspective such as revenue or churn. Let us examine this with the following example, where we have a Machine Learning setup which generates recommendations for our users.

Here we see a mock-up of a pipeline which combines user events, and purchases, on a weekly basis. In order to refresh a Machine Learning model that generates recommendations. If the pipeline fails, the model will be trained on old obsolete data, and the user might no longer be interested in the recommended items. Thus, it could directly affect the revenue stream of the company. In such a scenario, the final business outcome would be decreasing conversion rates and increasing churn rates, which is an important user-flow to monitor for the business.

Value gained from SRE teams

It is not a goal or a value to have many SLOs, but rather to have the right ones. A good indicator to decide if a SLO is relevant is: “if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO”. Therefore, the SRE team can bring immense value for the organisation in standardizing and translating business requirements into code. The SRE team is staffed with developers who can contribute to infrastructure, CI/CD pipelines and business features. This specific combination of deep technical and customer centric skills are very hard to find.

The value we have seen from an SRE team is, continuously driving structural improvements of the project. This enables other development teams to deliver value faster, since underlying technical requirements affecting the entire project can be handled by a specific team. In different projects the SRE team have illustrated their worth by conducting load tests, monitoring SLOs, acting as a CoE team, implementing cross functional features for the development teams. This has allowed us to catch issues and identify bottlenecks before they are manifested for the user.

The final and important element of the SRE team is that any incident can be identified and prioritized immediately. This does not mean that the SRE team needs to solve the underlying issue. They can delegate this part of the job to other teams, but they will be bringing production forward, as to ensure it is functioning. We always prefer to move forward instead of rolling back changes, to keep agility in the project.

SLA contracts

The observant reader would notice that we have not discussed Service Level Agreements (SLA), which is what most will be familiar with. The SLA is an explicit or implicit contract with your users that includes consequences of meeting (or missing) them. The SLA describes the whole system with a specific measure towards the users, but it is not something that can be readily materialized into actions for the team. This is what the SLOs have, the objectives the team must comply with for the agreement (SLA) is uphold.

Finally, the actual number that the SRE team is looking at, is the SLI. The SRE team does not typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions.

A post-mortem culture - without blame

Despite all these efforts, problems will unfortunately still arise in production. To reduce the risk of a similar problem occurring in production, we have a culture of no-fault post-mortems. The exercise is done without pointing fingers, instead we focus on what and how. Then we follow four steps:

Define and describe the problem
Implement tests that replicate the problem
Create a solution
Implement the solution in production

In this way, we ensure that a similar problem cannot arise, and by defining the problem, we can uncover deficiencies in our processes that need to be addressed. This results in a clear picture of what happened. Furthermore, it supports clear and precise communication to stakeholders, assuring them that we have identified a solution to the root causes.

We find this is a great way to drive the transformation forward on large projects, regardless of the maturity level of your business. The value you achieve is immediate and the work can continue without some of the typical clashes between development and operations.

New Tech solutions. Experienced advice.

Do you have questions regarding which kind of services, integrations and methods will best help you achieve your goals? We have answers.