Risky Business

Posted on November 1, 2020 | 13 minute read | Share via

I have recently been thinking once again about how to continuously drive quality through improvements availability, securability and predictability in order to meet the needs and high expectations of developers and customers. Ideally, just like with Functional Requirements, when we define our products we also consider other parts of the contract such as SLAs. Under these conditions customers ideally expect that the software does exactly what it “says on the tin.”

In reality as a member of a product development team (Systems Engineer, Product Owner, Software Engineer, etc) things are never quite as simple. There is importance not just in the definition but also in HOW and WHEN that definition is met. In other words we recognize that there are various competing forces and in many agile texts these are framed as:

– Build the Right thing – Build the thing Right – Build the thing Fast

The consequence of not reaching a balance between these competing forces can, for example, focusing to much on speed such that teams, once they ship their product, may suddenly find that the support burden is simply to high to continue innovating. At the same time if to much pressure is on building things the right way the product may not ship with the minimum required set of features. The challenge that is often faced however is determining the best method of communicating the trade-offs being made. An interesting observation that I have made in a number of different Engineering Organizations I have worked is that while many teams are able to clearly recognize if they have built the right thing (feedback on a feature or as an engineer having tested to the requirements) and they can directly measure the speed at which code is shipped to customers (and receiving feedback); those same teams struggle to measure if the correct thing has been built. Here are two quick examples…

A manager I once worked with once told me: “if we week on testing of course we will find more bugs - but we will also never ship our product.” They are absolutely right!
In spending some time talking with a small group of engineers they complained that their team was constantly struggling in production and they were being called by support frequently. Later in the same conversation I asked if I could get the team a month to work on whatever they wanted, what would it be? The answer was that they would focus on building new features for customers.

The thing that I have come to observe is that there are actually two pieces at play here and I have started to come to believe that they have become conflated. Specifically; how do we draw the distinction between:

Building things the right way.
Choosing HOW to build (fast and right).

To be clear - both are important but they are also distinct. The first relates to meeting some defined expectation (focus on the discrete, the output) while the second actually relates to the overall process and behavior (focus on the systemic) that achieves the right mix of Build the Right Thing, Build it Right and Build it Fast. What I have observed over the years is that as soon as there is a question about some output, often times as a reaction to a problem such as a system outage, many stakeholders are not careful about the questions they ask and thus conflate both what has been built (was a previous decision wrong) as compared to the systemic question of are the right decisions being made. For example; I will hear the question of “is the quality of our software correct?” The problem is that the question is ambiguity - are we asking if the sum of choices correct (are we making the right trade-offs) or are we asking if once we have made a choice if there is a discrete problem in the software. The end result (and outcome) can be equally messy. I have seen:

Everyone jump to focusing on attempting to solve a narrow vertical problem.
- Think about a retrospective or post-mortem that you may have contributed to recently. Was the discussion about WHY that problem occurred or was there discussion on the attributes of what happened and if the processes in place could prevent many cases of the style of problem from occurring?
- Experience tells me likely one was focused on but BOTH questions are important.
Focusing on one architectural ‘-ility’
- This about after some large outage - did an organization become hyper focused on availability? Or if there was a security incident did everyone jump to ask if all teams had the right security focus? Similar to become over focused on a narrow vertical problem I argue that focusing to much attention on a single horizontal quality can equally drive too much engineering effort (and investment) with diminishing returns.
Team become paralyzed by FUD.
- Because there is always something better that could be done then perhaps we should do a little more testing. I can’t tell you how many times near the end of a release I have seen teams and management become nervous.

Understanding that both questions (discrete concerns and systemic concerns) exist and should always be talked about may seem obvious at this point but such discussions happen infrequently. Leaders, such as myself, need to take note to be careful in the questions we ask and do not allow ourselves (and our teams) to fall victim. The results can be painful:

If attention is always on the discrete then slowly over time the number of risks and consequence of such risk is likely to grow. This isn’t because of any one problem but rather because more software is built and deployed and therefore the likelihood of observing an undesirable consequence grows along with it. Taking an over-simplified (but directionally correct) model we could state that every N thousand lines of code deployed will result in M human errors being made of which L percent will cause an outage. Total outages will equal N * M * L. Assuming a company in successful N will grow over time even if M and L stay constant. If N outpaces re-imagining M and L then at some point the pain of the outages will be felt. In similar fashion we could apply a model such as this to other risks such as compliance failure, security breach, etc.
If attention always focuses on the systemic then the ability to ship new code (new features) will slow to a crawl. Teams will constantly look to re-invent and re-debate past decisions. In short, there are diminishing returns relative to the engineering cost required to achieve each incremental improvement.

At this point it is worth asking about what are some ways to recognize and strike a balance between any discrete decision and watching for how decisions are more generally made.

Many organizations will recognize the there are behavior and systemic changes needed (both in process and in architectural approach) once the some set of risks have been realized (an outage, security vulnerability, etc). While this is not ideal it is also not a reason to panic (frankly doing so won’t help anyhow - it leads to either a deer in headlights approach OR a squirrels all running in different directions jumping from issue to issue). The following are some of my thoughts on how to keep an organization aligned. It focuses on the following outcomes:

Making things measurable by talking about RISK tolerance instead of “-ilitiy” improvement.
Building a culture that is fearless in discussing risk.
Maintaining a beginners mind.
Creates a fast feedback cycle

Make things measurable

It is all to easy at this point to begin a discussion focused on “improving” but the first question to ask is if EVERYONE agrees on a definable outcome. If I want to improve our security stance - when will we know we have achieved the goal? Setting a goal of improvement almost always will fail (I wonder if this is actually correlated with the size of the organization) as soon as any stakeholder begins to questions “if we are there yet.” If an organization has seen a streak of issues for example, it is easy to build near term momentum and focus on targeted improvements BUT if things are quite for a period of time, months for example, then stakeholders may question if everything is already where we want them to be and the motivation for continuing to invest will wane.

A simple trick that I have found can help here is to focus the discussion on setting an acceptable risk goal and to catalog risks that exist. I have seen that getting everyone aligned that a risk exists is far easier then attempting to align on an abstract improvement initiative. The key is that a risk exists or does not exist. While the consequence of the risk may be debated the process of talking about risk and asking if a risk is acceptable is much easier then asking if the quality of what we build is good enough.

When thinking about measurements understanding if/when things are deviating from a steady state is equally important. In other words; looking at the derivate of a metric that is being watched can be helpful. When it comes to risk what we want to look at is if there is a change in which we are discovering (or solving for) risk. Recognizing that risk is growing early (and the rate at which it is growing) can lead to taking proactive action.

Building a culture that is fearless in discussing risk

The problem in just talking about Risk is that it frequently requires a culture that is willing to openly discuss and debate risk (air ALL the dirty laundry). In order to achieve measuring risk it is important that as soon as risk is recognized it is written down (in order to be measured). If risk isn’t written down then the measurements will be off. Getting good at this takes time and is as much culture as it is any specific technical skill. I have recognized that there are frequently two reasons that this fails:

Talking about risk can lead to FUD. FUD can be debilitating and I wrote a previous article talking about method for [dealing with FUD]( /post/dealingwithfud/. Because of this, and perhaps because an organization has dealt with this in the past, individuals shy away from raising risk and stoking others fears.
Fear of re-processions. If an organization reacts poorly when a risk is identified then, simply put, risks will be raised less frequently. This is a learned behavior.

For leaders it is important to talk the talk and ensure that risks can be freely raised. Creation of a risk register, discussing that risk can be accepted (accepting no risk we would imply not be shipping code), and showing how risk leads to action are all methods that can lead to a more fearless culture.

Interestingly enough when looking to push a culture towards an openly talking about risk, creating focus on improving a service on behalf of a customer is helpful. This is the opposite suggestion of the previous section (which targets creation of a mechanism that can drive discrete change) but recognizes the intent of encouraging writing down more while building alignment on the right level of investment.

Beginners Mind - Thinking about step function changes in needs

I am going to focus in a little on Availability for a moment simply to make a point. An interesting discussion when I talk to teams about availability is that teams want to achieve more “9’s.” The approach I most often see is that a team has started with something that is perhaps built for 3 9’s of availability (8-9 hours of downtime allowed in a year). Over time that team wants to do better for customers and therefore pushes towards 4 9’s (just under 1 hour of downtime allowed in a year). Through engineering investments in testing and quality a team gets there. Then then team sets a goal to achieve 5 9’s (~5 minutes of downtime in a year). Looking to the past those teams attempt to focus on up front quality as the main method for achieve this - but I argue there is a fundamental problem with this approach. I am going to make a few assertions first:

Likely metrics and monitoring to observe any error are trailing by about 5 minutes. Even with 1 minute collection points there can be up to a 2 minute delay in observing a metric.
Paging a human takes time. Let’s be optimistic and say even with alerts that trail at exactly 1 minute to get a human to get on VPN, and be ready to look will take on average 5 minutes (again I am being optimistic here).
Making a decision takes some time. If all the data is laid out or runbooks are well defined maybe that can be done in 1-2 minutes.

If I start with these few assertions then the minimum time to correct an outage causing bug is 7-10 minutes in the best case. The problem is that with humans in the loop achieving 5 9’s of availability takes an approach that is fundamentally different then what can be done to achieve 3 or 4 9’s of availability. This includes becoming focused on self healing systems, automation of run-books, and more.

Taking a step back; the same holds true for a number of different Non-Functional Style requirements. The importance is that as a business grows it is important to maintain a beginners mind, to constantly re-evaluate if behaviors (and processes) that were sufficient to a point will still enable the business to move forward, and finally to be intellectually honest about the costs to take on such step function changes.

Don’t just “Build the thing Fast” - focus on the “Feedback Cycle”

Often a stakeholder will presume that “build the thing fast applies” to getting the final product out the door. What is important to recognize is that a better interpretation would be to minimize the time before getting feedback (I am personally a huge proponent of Incremental development AND delivery (I have advocated with teams I have worked with on the notion of “Hypothesis Driven Development”)). The distinction is important because the question becomes how to deploy, build, and improve over time (sequence delivery) while optimizing for the opportunity to collect feedback along the way. The same holds true when thinking about risk reduction; especially when a team (or organization) encounters the need to invest more heavily such as determining that a step function change is required. When faced with such a need it seems prudent and practical to start with one area that has an acceptable ROI that can be accomplished over a well understood and defined timeframe. I specifically use the term acceptable because I have found even myself fall victim to first looking for the biggest pain point without considering the cost to fix it but in practice I have found that an organization can move MUCH faster by taking on a smaller effort and learning aggressively through incremental development.

Wrap-up

Quality is near and dear to my heart. Being as engineer I have always seen myself as a craftsman and proud in the outcomes that the solutions I have produced enable over the longevity of the their lives. Making smart and deliberate decisions, and at times investing in rebuilding and retooling, is important for achieve such outcomes found at the intersection of Build the right thing, Build it Right, and Build it Fast. What are the techniques that you have found to find your balance?

Tags: