Reliability

Reliability through Expectation Chains

Summary

Monitoring, observability and other buzzwords abound but they are usually the after thought of the design process and result in little more than a collection of metrics and dashboards that answer the known knowns about the environment. Essentially they are the pretty picture of green we want to see rather than the reality we need to see. This article discusses something I have been thinking about for a few years which I have yet to see implemented properly. I call this the 'expectation chain,' and the way I see it, this is the strategic vision through which real observability could be realized in companies (regardless of the software used to collect the metrics and other data needed to measure the state of the world).

A Customer-Driven Approach to Observability

Face it, too many people in the technology space came up the easy way and didn't spend enough time face-to-face with customers, and it shows. We lose empathy and treat customers as if they serve us. Companies don't make this any better when they emphasize their short-term bottom line over their long term-sustainable growth. This manifests itself in higher support costs for companies and a larger branding issue. But more importantly it is causing an erosion of trust by consumers. People just don't trust the technology in a perpetual downward spiral of cynacism. It's disgusting and it does not have to be this way. Technology should serve the customer!

So what does customer service and observability have to do with one another? Well, to borrow from Charity Majors at Honeycomb: "Nines don't matter if the customers aren't happy." I'm proud to have bought one of her t-shirts that say just that because it's true. You can be 99.99999999% available, reliable and performant. But if your 9's measure the wrong thing, you're just 99.99999999% off the mark. You're measuring the wrong thing. It's like you're feeding candy to a kid, saying "hey, I am feeding the kid three times a day." But you're feeding the kid the wrong thing! Stop feeding your business bullshit.

Stop feeding your business bullshit.

A common problem is measuring useless systems metrics (e.g. CPU, Memory, Disk I/O). These are diagnostic metrics. They are not measuring customer value. You're taking the temperature of the patient, not listening to their problem. Customers don't value your system resources. They value the service you provide to them. To borrow from a former Engineering leader overseeing a now-defunct chat product. Customers have one question: "Can I fucking login and chat?!" They don't care that your dashboards are green or that your graphs are pretty. Customers don't care about your kibana dashboards! They are paying for a service, and if you aren't measuring your ability to deliver that service, you're measuring the wrong thing.

Customers don't care about your kibana dashboards!

Customers have EXPECTATIONS! Measure your ability to satisfy those expectations. That's where we are going to start. For every interaction there exists three parts:

Customer (or consumer)
Expectation
Provider (or producer)

At the top level, the customer (defined as the persons or entities which justify the existence of the organization) has some need that is unmet by the market. The organization exists to meet that need as the provider of some complementary service. This means the customer expects his/her/its need to be met by the providing organization. Mathematically we can graph this as...

customer -[expects]-> provider

This creates the simplest construct of business value measurement. Any metric that does not directly measure the provider's ability to satisfy the expectation of the customer is a useless bit of noise at worst case or a diagnostic tool for investigating problems in the best case. Let's go further.

Three Dimensions of Expectation

There are three dimensions of expectations in any (c)-[e]->(p) graph:

Availability
Reliability
Performance (Latency or Throughput)

These can best be demonstrated by example. It's Friday night. You want to watch a movie. When you arrive at the theater,you find that the doors are open, the power is on and someone is there to sell you a ticket. The theater is 'available.' You are satisfied along that dimension. You go into the movie theater with your ticket, take a seat and the movie starts within a reasonable amount of time. The theater met your latency (performance) expectation. As the movie plays, you are immersed in the film, forgetting about the world. The movie finishes without any errors or interruptions. Accordingly, you are satisfied that the theater is reliable. This means we can say that the theater has succeeded as provider in meeting your expectations as the customer on all three dimensions.

The next week you go to the same theater. You approach the window to buy a ticket, but no one is there. Twenty minutes later, you're still standing there at the window with no one there to sell you a ticket. The lights are on. The schedule says the movie should be showing. You have an expectation that the theater is available, but eventually you give up and go home unsatisfied. The theater is "unavailable." The theater has failed, and though the theater may have had the ability to show the film you wanted to see realiably and within your latency expectations, the loss of availability prevented any measure of these two characteristics. You, the customer, are not happy. If theater management measures their success by the temperature of popcorn, the reliability of their projectors or the amount of time between doors opening and the movie starting (latency), they will eventually fail as a business. They may have the best popcorn in town. 99.999% tastier popcorn than the rest of the planet. But the customer still was not happy because no one sold the ticket.

I could go on to point out how in different scenarios the customer can be disappointed. But suffice it to say that if your business does not satisfy each of the three dimensions (availability, reliability and performance), you will not survive. The same is true of a theater as it is for a software business.

The Recursive Nature of Expectations

Measuring success must be consistent if a business is going to really get value out of this. But that is the beauty of an expectation chain. The same graph we illustrate as (c)-[e]->(p) can apply to the end customer as well as to internal customers. And here is where customer service breaks down in many companies. No company has ever set out to intentionally fail its customers. That would be a suicidal endeavor that would guarantee business failure. Okay. No rational business has ever done this. I can think of a large computer manufacturer that screwed me as a customer a few times intentionally. But I digress.

In most otherwise rational businesses, the customer's needs are known and management tries to more or less degree to meet those needs. But the business fails to meet them due to internal issues (often politics and other crap that should have been left on a junior high playground). Business leaders let this exist because they can't measure its impact. We all think it's acceptable because everyone is just trying to get ahead. This usually means one person withholds information or deprioritizes actions for his/her team in order to focus on their personal agenda rather than the customer's needs. I know...this doesn't happen in your company. You're all a bunch of angels, right? Bullshit! This is where the expectation chain concept can help a company quantify the problem everyone knows exists and help measure the business efforts to minimize this dysfunction.

How does this work?

Simple, we recursively dissect the business into a larger graph of expectation chains, where a provider becomes a customer of some other entity within the business. Take for example our friends Alice, Bob and Charlie. Alice is a customer who wants to buy a sandwich for lunch. Bob owns a sandwich shop. Charlie owns a vegetable farm. The first graph is obviously (customer:alice)-[expects: sandwich]->(provider:bob). But this decomposes further as (customer:bob)-[expects: lettuce]->(provider:charlie). Now we can create very complex graphs of every aspect of a business operation from the paying end customer down to the vendors, individual employees and equipment. Then we can measure our ability to meet each expectation in this graph as both a means of measuring the business performance and a means of identifying areas of improvement.

Service Level Indicators and Objectives

In recent years there has been a lot of talk about "Service Level Objectives" (SLOs) and "Service Level Indicators" (SLIs) as part of the Site Reliability Engineering movement. But most companies just use SLOs and SLIs as aliases for the same broken observability concepts they always used. This really sucks, almost as much as the known-knowns of metrics we commonly use sucks. A real service level objective should be a SERVICE (to the customer) objective, measured by some related indicator.

Expectation chains are a way of illustrating your business operations graphically so as to state your environment, define your indicators and assert your objectives...in a way that actually matters. Rather than measuring some useless metric that doesn't really matter to the cusomer, you measure the expectations that justify your existence as a business. You define objectives with regards to those measurements. From there you can even directly tie your business OKRs (Objective Key-Results) to the SLOs.

Rather than metrics, you get business solutions.

(Note: I really would like to see this in action some day.)