At cVation we have a heavy DevOps focus. With DevOps as a principle, we try to address all tasks with a customer centric view. A critical driver for us is to tackle the interests between development and operations, where when formulating the interests in the baldest terms clash ("We want to launch anything, any time, without hindrance" versus "We won’t want to ever change anything in the system once it works").
The tenets for us to achieve customer centricity are to have a focus on availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. We therefore have a focus on how we do operations to achieve these goals. We cannot solely rely on monitoring uptime or availability, instead we must monitor larger user flows directly affecting the users experience of the software we deliver, thus we employ site reliability engineering to achieve our goal.
Site Reliability Engineering
Site Reliability Engineering (SRE) is an engineering discipline coined by Google, which is devoted to help an organization sustainable achieve the proper level of reliability in their systems. Appropriate is an important aspect, since this allows us to find a balance between agility and stability. Responsibilities of the SRE teams are:
Availability / reliability / latency / performance / efficiency / change management / monitoring / emergency response and capacity planning of their services.
To allow a team to do this on a large project we aim to automate or cut anything repetitive. The SRE team will emphasize that the systems design will work reliably amidst frequent updates from development teams. For the SRE team to be able to adapt to a changing team the system needs to be monitored, i.e., having a high level of observability. Observability, as in, logging enough data in the system to be able to do proper investigations. This allows the SRE team to constantly assess new user flows and monitor critical paths for the project.
Agility versus Stability
Historically the division between Development and Operation has resulted in a clash of what is most important agility in releasing new features or maintaining stability for the customer. The SRE concept allows us to discuss this conundrum, by utilising Service Level Indicators (SLIs), Service Level Objectives (SLOs) and a resulting error budget. The SLI can be interpreted as monitoring an important user flow, e.g., in an IoT setup it could be the user installing a sensor and confirms it being alive in a management portal. The SLO would then be the rules for latency, availability, and correctness that the user should expect. The SLO allows the SRE team and the business to communicate and agree on concrete numbers of uptime and performance. The difference between the measured SLI and the defined SLO is the error budget. When we are performing better than agreed the development teams can ship features ASAP. However as soon as our budget is negative, the SRE team is charged with halting releases, investigating issues, removing or tasking a development team to remove impediments. Basically, a negative SLO means that the project needs to focus on quality.
Thus, the battle between operation and development can be reduced to being over or under our agreed SLO. To explicitly define the error budget, we can interpret it as how much “unreliability” is available in the measured time frame. This is an important notion, since all deployments introduce a risk for unreliability, even the biggest companies such as Microsoft can release something that despite all tests and best practices can result in an outage. The SRE team is often seen as only focussing on user-flows related to availability or latency, but a user-flow could also be seen from a business perspective such as revenue or churn. Let us examine this with the following example, where we have a Machine Learning setup which generates recommendations for our users.