Keeping Systems Decoupled

Posted on May 20, 2020 | 10 minute read | Share via

A common problem I have been asked to solve on a variety of the systems I have worked on is how to move data between independent components (different monoliths, micro-services, databases, etc). My involvement occurred after many of these systems had grown organically as their respective businesses grew and expanded. The result was that to move data between components many point-2-point solutions were created. When integrating 2 components for the first time this makes sense but as the number of systems we attempt to connect together grows it is important to watch for some signs that things are turning into a “Big Ball of Mud.” There are some early warning signs (behaviors) that this is happening:

I have heard plenty of discussion that creating point-2-point solutions are more optimal (and I suspect that some reading this. The question I always ask myself when faced with this though is how have we defined optimality? Is it based on the speed at which bits can move? Is it based on the ability to deliver a solution to market? Is it based on the “Keep the Lights On” maintenance cost? As a reader you may have heard the phrase “beginners mind” tossed around and that phrase applies here. What is important over time is to refresh the set of Architectural Qualities (Non-Functional Requirements) that we are attempting to optimize for (at some point in the not to distant future I will post something on setting up a Fitness Function (writing down tenets) to track this).

Let us assume that at this point we have observed some form of coupling or that we would like to avoid some of the smells described above. First; as engineers we should appreciate everything that has been built to date as it was likely an excellent choice at the time. The question now is what is important to consider in keeping systems decoupled? Here are some characteristics of a decoupled system:

With these characteristics in mind lets consider three common solutions and how we might select a solution.

  1. Time Series Event Streams
  2. Message Queues
  3. Micro-Batching

Step 1: Frame the problem you are trying to solve

The first step is to take the time to write down the problem you are attempting to solve. This may sound obvious but I cannot tell you how many times I have seen teams start with a solution in mind (such as we will use Kafka) without first understanding the problem at hand. As we will see shortly the cost of different solutions can vary widely (both in terms monetary costs as well and engineering costs and complexity).

Step 2: Define your Functional and Non-Functional requirements

Once we document the problem (WHY we are approaching the design to begin with) the next step is to make sure we take the time to align on both the Functional (what we should build) and Non-Functional (how we should build) requirements. Like the problem statement skipping this step can lead to miscommunication. For example, in my work on Robotics at Amazon it was not uncommon for someone to state: “we need the information in real-time.” If I went around a room of engineers and asked “What does real-time mean” I would receive different answers from almost each engineer. In fact I will tell you from my work at Raytheon on missiles and Large Robotic Safety Systems the term “Real-Time” can have life or death consequences and has little to do with implying some notion of fast “speed” at which information is moved around. Again; being specific about the requirements at this point can quickly prune many of the choices that may follow.

Step 3: Select the data movement pattern

Of the three patterns discussed there are some real cost implications. Naively we can look at the price difference between the AWS offerings for Kinesis, SNS+SQS, and S3. For the moment assume that we want to produce information so that consumers can observe a discrete set of events within 5 minutes of the event occurring. In addition let us assume that events average over that 5 minute period 100 events/second with a size of 35KB.

Kinesis (Time Series Event Streams)

For Kinesis the total data input rate is 3.4MB/sec (100 records/sec*35KB/record divided by 1024 = 3.4 MB/sec).

We then calculate our monthly Kinesis Data Streams costs using Kinesis Data Streams pricing in the US-East Region:

Finally; to achieve decoupling and scale so that each consumer is independent we need to consider the cost of fan-out. For Kinesis this is $0.015 per shard hour per consumer. In the example above we had 4 shards so on a monthly basis this comes out to $44.64 per consumer. In addition we must pay for the data transferred by each consumer which is charged at a rate of $0.013 per GB. In the example above we produced data at a rate of 3.4 MB/sec. On a monthly basis this would be 3.4 MB/sec * 60 second/min * 60 min/hour * 24 hour/day * 31 days/month. Divided by 1024 this comes out to 8,893 GB * $0.013 = $115.61 per month.

So in total our monthly cost is $44.64 + $7.50 + $44.64 + $115.61 = $212.39 for one consumer and incrementally $160.25 for each additional consumer assuming all data produced is consumed.

SNS + SQS (Message Queues)

To achieve a similar system using SNS and SQS we need to consider again the producer and consumer costs. To simplify this example we will consider sending each even independently (some readers may recognize that multiple messages could be packed into a single request we see that requests are charged by 64 KB chunks so there is some potential cost savings).

SNS is charged at $0.50 per 1 million requests. At 100 records per second this results in 257,840,000 records as well as 257,840,000 SNS requests. On a monthly basis this costs $128.92.

On the consumer end we need to pay for the SQS requests. These are charged at $0.40 per million requests. This results in $103.13 per consumer.

So, for a single consumer our cost is $232.05 and each incremental consumer is $103.13. Compared to the previous solution at scale the dominating factor is the incremental consumer cost which is about a 33% savings.

S3 (Micro-Batching)

For some reason I tend not to hear many people talk about Micro-Batching but it is actually one of my favorite patterns for moving data around when thinking about building decoupled systems. We can think of micro-batching as the process of producing small files at a set rate (period). For our example we wanted to make sure consumers could see data within 5 minutes of production. To achieve this let us simplify our lives and presume we will produce data at a rate of 1 file per minute. Each file would be approximately 204 MB in size and over the course of a month we would produce 60 * 24 * 31 = 44,640 files and 8,894 GB of data.

There are 3 cost components with S3: Storage, Requests, Data Transfer.

Storage is charged at $0.023 per GB. If we assume we keep 7 days of data we would be charge $46.19 / month for storage.

Requests are charged at $0.005 per 1000 requests. We will need to PUT 44,640 files and let us assume we run some number of LIST requests as well (at 10% of all PUTS). This would lead to $0.25 per month. In addition clients need to GET requests which are charged at $0.0004. We will assume that clients attempt to retrieve twice per minute looking for the new files as they are produced. This would lead to an incremental cost of $0.04 for each client.

Finally; we need to pay for data transfers. There is no charge when storing data and when retrieving data if the clients are within AWS we are charged $0.01 per GB. This equates to $88.94 per month.

In total this solution would cost $46.19 + $0.25 + $0.04 + $88.94 = $135.42 and each incremental consumer would cost $88.98 per month. This is almost a 20% savings compared to SNS + SQS and almost half the cost of Kinesis.

One under-appreciated value of this specific pattern is that producers can also greatly reduce the cost for consumers in two common ways. First when raw events are not required but rather aggregated data. Second through introduction of a compression algorithm on the data produced such as GZip. Combined it is not uncommon to reduce the data transfer cost by 50 to 90% (or more).

Conclusion

What I really want to make people aware of are these asynchronous forms of data movement but to also remind developers of the importance of designing clean and repeatable interfaces just like we would a REST API. Once we have defined a pattern in which we can move data around PLEASE take the time to document the format of data that transits the wire, think about how you will version these “documents”, and consider simple solutions that are easy for your customer to reason about.

I have plenty of additional thoughts rolling around having written this down: #MicroBatching, #BackwardsCompatibility #FitnessFunctions

Books on my shelf related to this topic:


Tags: