Risky Business

Posted on November 1, 2020 | 13 minute read | Share via

I have recently been thinking once again about how to continuously drive quality through improvements availability, securability and predictability in order to meet the needs and high expectations of developers and customers. Ideally, just like with Functional Requirements, when we define our products we also consider other parts of the contract such as SLAs. Under these conditions customers ideally expect that the software does exactly what it “says on the tin.”

In reality as a member of a product development team (Systems Engineer, Product Owner, Software Engineer, etc) things are never quite as simple. There is importance not just in the definition but also in HOW and WHEN that definition is met. In other words we recognize that there are various competing forces and in many agile texts these are framed as:

– Build the Right thing – Build the thing Right – Build the thing Fast

The consequence of not reaching a balance between these competing forces can, for example, focusing to much on speed such that teams, once they ship their product, may suddenly find that the support burden is simply to high to continue innovating. At the same time if to much pressure is on building things the right way the product may not ship with the minimum required set of features. The challenge that is often faced however is determining the best method of communicating the trade-offs being made. An interesting observation that I have made in a number of different Engineering Organizations I have worked is that while many teams are able to clearly recognize if they have built the right thing (feedback on a feature or as an engineer having tested to the requirements) and they can directly measure the speed at which code is shipped to customers (and receiving feedback); those same teams struggle to measure if the correct thing has been built. Here are two quick examples…

The thing that I have come to observe is that there are actually two pieces at play here and I have started to come to believe that they have become conflated. Specifically; how do we draw the distinction between:

To be clear - both are important but they are also distinct. The first relates to meeting some defined expectation (focus on the discrete, the output) while the second actually relates to the overall process and behavior (focus on the systemic) that achieves the right mix of Build the Right Thing, Build it Right and Build it Fast. What I have observed over the years is that as soon as there is a question about some output, often times as a reaction to a problem such as a system outage, many stakeholders are not careful about the questions they ask and thus conflate both what has been built (was a previous decision wrong) as compared to the systemic question of are the right decisions being made. For example; I will hear the question of “is the quality of our software correct?” The problem is that the question is ambiguity - are we asking if the sum of choices correct (are we making the right trade-offs) or are we asking if once we have made a choice if there is a discrete problem in the software. The end result (and outcome) can be equally messy. I have seen:

Understanding that both questions (discrete concerns and systemic concerns) exist and should always be talked about may seem obvious at this point but such discussions happen infrequently. Leaders, such as myself, need to take note to be careful in the questions we ask and do not allow ourselves (and our teams) to fall victim. The results can be painful:

At this point it is worth asking about what are some ways to recognize and strike a balance between any discrete decision and watching for how decisions are more generally made.

Many organizations will recognize the there are behavior and systemic changes needed (both in process and in architectural approach) once the some set of risks have been realized (an outage, security vulnerability, etc). While this is not ideal it is also not a reason to panic (frankly doing so won’t help anyhow - it leads to either a deer in headlights approach OR a squirrels all running in different directions jumping from issue to issue). The following are some of my thoughts on how to keep an organization aligned. It focuses on the following outcomes:

Make things measurable

It is all to easy at this point to begin a discussion focused on “improving” but the first question to ask is if EVERYONE agrees on a definable outcome. If I want to improve our security stance - when will we know we have achieved the goal? Setting a goal of improvement almost always will fail (I wonder if this is actually correlated with the size of the organization) as soon as any stakeholder begins to questions “if we are there yet.” If an organization has seen a streak of issues for example, it is easy to build near term momentum and focus on targeted improvements BUT if things are quite for a period of time, months for example, then stakeholders may question if everything is already where we want them to be and the motivation for continuing to invest will wane.

A simple trick that I have found can help here is to focus the discussion on setting an acceptable risk goal and to catalog risks that exist. I have seen that getting everyone aligned that a risk exists is far easier then attempting to align on an abstract improvement initiative. The key is that a risk exists or does not exist. While the consequence of the risk may be debated the process of talking about risk and asking if a risk is acceptable is much easier then asking if the quality of what we build is good enough.

When thinking about measurements understanding if/when things are deviating from a steady state is equally important. In other words; looking at the derivate of a metric that is being watched can be helpful. When it comes to risk what we want to look at is if there is a change in which we are discovering (or solving for) risk. Recognizing that risk is growing early (and the rate at which it is growing) can lead to taking proactive action.

Building a culture that is fearless in discussing risk

The problem in just talking about Risk is that it frequently requires a culture that is willing to openly discuss and debate risk (air ALL the dirty laundry). In order to achieve measuring risk it is important that as soon as risk is recognized it is written down (in order to be measured). If risk isn’t written down then the measurements will be off. Getting good at this takes time and is as much culture as it is any specific technical skill. I have recognized that there are frequently two reasons that this fails:

For leaders it is important to talk the talk and ensure that risks can be freely raised. Creation of a risk register, discussing that risk can be accepted (accepting no risk we would imply not be shipping code), and showing how risk leads to action are all methods that can lead to a more fearless culture.

Interestingly enough when looking to push a culture towards an openly talking about risk, creating focus on improving a service on behalf of a customer is helpful. This is the opposite suggestion of the previous section (which targets creation of a mechanism that can drive discrete change) but recognizes the intent of encouraging writing down more while building alignment on the right level of investment.

Beginners Mind - Thinking about step function changes in needs

I am going to focus in a little on Availability for a moment simply to make a point. An interesting discussion when I talk to teams about availability is that teams want to achieve more “9’s.” The approach I most often see is that a team has started with something that is perhaps built for 3 9’s of availability (8-9 hours of downtime allowed in a year). Over time that team wants to do better for customers and therefore pushes towards 4 9’s (just under 1 hour of downtime allowed in a year). Through engineering investments in testing and quality a team gets there. Then then team sets a goal to achieve 5 9’s (~5 minutes of downtime in a year). Looking to the past those teams attempt to focus on up front quality as the main method for achieve this - but I argue there is a fundamental problem with this approach. I am going to make a few assertions first:

If I start with these few assertions then the minimum time to correct an outage causing bug is 7-10 minutes in the best case. The problem is that with humans in the loop achieving 5 9’s of availability takes an approach that is fundamentally different then what can be done to achieve 3 or 4 9’s of availability. This includes becoming focused on self healing systems, automation of run-books, and more.

Taking a step back; the same holds true for a number of different Non-Functional Style requirements. The importance is that as a business grows it is important to maintain a beginners mind, to constantly re-evaluate if behaviors (and processes) that were sufficient to a point will still enable the business to move forward, and finally to be intellectually honest about the costs to take on such step function changes.

Don’t just “Build the thing Fast” - focus on the “Feedback Cycle”

Often a stakeholder will presume that “build the thing fast applies” to getting the final product out the door. What is important to recognize is that a better interpretation would be to minimize the time before getting feedback (I am personally a huge proponent of Incremental development AND delivery (I have advocated with teams I have worked with on the notion of “Hypothesis Driven Development”)). The distinction is important because the question becomes how to deploy, build, and improve over time (sequence delivery) while optimizing for the opportunity to collect feedback along the way. The same holds true when thinking about risk reduction; especially when a team (or organization) encounters the need to invest more heavily such as determining that a step function change is required. When faced with such a need it seems prudent and practical to start with one area that has an acceptable ROI that can be accomplished over a well understood and defined timeframe. I specifically use the term acceptable because I have found even myself fall victim to first looking for the biggest pain point without considering the cost to fix it but in practice I have found that an organization can move MUCH faster by taking on a smaller effort and learning aggressively through incremental development.

Wrap-up

Quality is near and dear to my heart. Being as engineer I have always seen myself as a craftsman and proud in the outcomes that the solutions I have produced enable over the longevity of the their lives. Making smart and deliberate decisions, and at times investing in rebuilding and retooling, is important for achieve such outcomes found at the intersection of Build the right thing, Build it Right, and Build it Fast. What are the techniques that you have found to find your balance?


Tags: