A Strategy and Proposal for Assessing Cloud Service Readiness

Posted on October 11, 2020 | 8 minute read | Share via

Throughout my career I have been held accountable for the maturity and quality of code deployed into production. When I was first coming out of school this was fairly easy to assess (did the code I write ever cause a problem and when it did learn from it). As I grew as an engineer playing combinations of Management and Lead Engineer styles of roles I started to work with my team(s) to ensure we were on top of code reviews, looked at our testing procedures and more. As my scope increased I eventually help titles like “architect” and I started to realize that by the time code was being written large problems may have already been baked into the design (perhaps you have heard the term “one-way door”). It is around this time that I started to think about a more holistic set of criteria that needed to be thought about. I started to bake criteria into templates such as a design document that provided a template of things to think about (I like to use the term non-functional requirements because these are the things the team has decided are important in HOW they will approach and build the service and given those choices these are requirements that themselves must be met). Over time I further realized that simply writing things down into a design document was often insufficient because by the time a service was ready to be deployed that document may itself have become stale. My solution back then… the creation of more templates (service readiness template for example) which could be filled in at different points during development. I will also say I probably didn’t make many friends of engineers (or managers) at this time as the consequence of all of my templates was the need to fill many things in and often time the content was duplicative - but getting teams to take the time to think of quality in this way did lead to desired outcomes and improved overall success of software (and systems) that were being deployed. There were two gaps that I started to identify at this time however:

This has led to my more current thinking which is to think of Service Readiness a continuous process where both the requirements AND the assessment against those requirements can always be considered. The question then is what can (or should) this look like. The remainder of this document focuses on:

Assessing Service Readiness

First, I want to state that for purposes of this document a service is any piece of code that is independently built and delivered so that other teams can use it. I don’t presume that this only applies to micro-services as I have found this thinking can apply to monoliths as well (especially what I defined as Modular Monoliths.

So what does a Service Readiness Assessment look like? There are 3 components that make up the assessment:

OK - so with these three building blocks what happens next? This will depend a bit on the culture of the organization but I have found that asking teams to revisit such an assessment yearly and any time there is a large outage (or frequent smaller outages) to be a good starting point. The yearly timeframe ensures that new controls are caught and assessed over time. The ad hoc assessments often ensure that other processes (such as a retrospective) look over previously documented (known) risks and enables an open discussion should such risks still be accepted.

Open Source and Standards accepted approach

While I have seen a lot of examples (as linked in the previous exception or just google for many more) there seems to be a lost opportunity both in how companies can advocate for the maturity of their offerings as well as to learn from each other. IaaS, PaaS, SaaS, along with many other aaS style offerings are becoming ever more important and yet outages (think about the recent Microsoft 360 and Teams outage, Cloudflare outages, Salesforce outage in 2019, and many many more) still occur. As companies rely on such services and as their own offerings become more critical to those companies own success or failure simply looking at past performance is no longer enough.

One way of looking at this is that there has been movement in other critical areas such as Cyber and Information Security where the NIST CNF has taken on a larger role (as well as Standards such as ISO/IEC 27001). Now seems like the time to build this concept out further into a framework that leads to an overall industry improvement in Availability, Observability, Recoverability, Durability, etc. and helping companies better understand their overall risk profile in order to best serve their customers. I posit further that creating a large discussion in this area can more broadly educate the users of aaS offerings as to the importance of engineers investing investing in these areas which are often overlooked until there is a problem.

I would love to see if there are others that think like me. Are others aware of efforts that may exist already? And if not please feel free to send me a note (I may build out a discord server should there be enough broader interest in building something out).


Tags: