Most messaging and streaming technologies have the ability to persist data to replay later or allow for stateless application design (persistence) and to guarantee a quality of service - ensuring message delivery (durability). In most messaging technologies this involves storing data on the file system.
Some systems, like Kafka, require at least some level of persistent storage and can be tuned via data retention configuration and replication factor. In other systems, like MQTT, a per message QoS dictates durability from none at all to exactly once delivery where replication options vary by vendor. NATS.io enables data durability and persistence when the JetStream subsystem is used, allowing for both persistence/durability and at-most-once delivery at the same time.
Regardless of your messaging system choice, configuration around data storage and replication has implications. Many organizations struggle with finding the best choices for their business and make the mistake of defaulting to what is perceived as the safe bet - a data persistence strategy where data is replicated far beyond what is required. This comes with a cost that should not be ignored.
Should organizations default to maximizing data persistence and minimizing risk?
No! I’ve seen teams default to producer and consumer acknowledgements and choose to replicate every message, all the time, in five places. When asked why, the answer is simply “we cannot miss a message,” without knowing business SLAs (Service Level Agreements) or the consequences of a SLA violation.
This stance of
replication-to-the-max is certainly understandable, especially in large enterprises where there’s a disconnect between business requirements and technical requirements. When there’s a missing or delayed transaction, who takes the blame? What is the cost of temporarily lost data or a lost transaction? Is the transaction truly lost or can it be retried? Fear and uncertainty around these questions result in overprovisioned systems that are expensive and negatively impact performance and scalability. While you can reduce the probability of downtime (or even data loss) to something tenable, you’ll never get to zero. Ever.
Back to the question, I challenge you to
never default to persistence. In fact, go the other way, and work up from the minimum to what you need and will let you sleep at night. Default to doing what is the best fit for your business in terms of cost and meeting SLAs through balancing throughput (units of work per second), latency (how fast a unit of work can be completed), and availability (the probability a unit of work will complete). Consider the notion that persistence is not even necessary in some data flows and retries/errors can be ignored or handled at the application layer.
Think of the many resilient applications built atop HTTP while still adhering to stringent business SLAs. It is of course cumbersome and pushes a lot of logic to the application layer, but the point is you don’t always need to use persistence provided by your distributed communication technology. Durability and persistence is a tremendous convenience which comes at the cost of some increased operational complexity, additional hardware/network resources, or additional SaaS cost (if you’re using a service).
There are a number of factors to consider around data persistence and durability. Note that these come from business requirements - which should drive technical decisions.
Some questions to think about include:
It’s easy and safe just to decide you want to cover your bases and go all out, then request the additional budget for more machines, clusters, storage, etc. However, let’s run through a quick scenario.
Assuming you require a 99.9% uptime guarantee for your business and are deciding between a server cluster of three nodes versus five nodes with quorum based replication. We’ll start with a cloud provider instance level guarantee of 99.5% uptime per instance (messaging server). For the sake of argument, let’s pessimistically subtract 0.5 % for software/network errors to have a 99.0% uptime for software and hardware resources. This results in a downtime of several hours per month by instance… but remember, we’re clustered.
With three nodes in a quorum based system and a replication factor of three (meaning at least two nodes must be available for business continuity), using binary probability calculations with each node having a 99.0% percent uptime, you’ll have a cumulative probability of P(X >= 2) of 99.97% uptime.
With five nodes, you’ll need at least three nodes available to continue business, resulting in a cumulative probability of P(X >= 3) of 99.999% uptime.
In this example, either choice meets the business SLA requirements. If you choose the three node deployment and you’ve saved yourself provisioning two extra nodes and potentially have increased performance with the lower replication factor - lean and mean.
This could reduce TCO in terms of resources by 40%
at the expense of potentially losing 0.029% uptime. In large multi-clustered deployments, this can represent hundreds of thousands of dollars per year.
If you measure your uptime and continue to meet the business SLAs, then you’ve justified the cost savings and become a hero to those that budget. Bonuses, anyone?
Since we’re talking about risk and persistence, let's go the other direction and discuss a relatively risky usage pattern of messaging persistence that ends up being expensive as well. One should not use a messaging system in lieu of a database/datastore or other source of truth. There are blogs and articles about using your messaging system to replace a database, and honestly some good arguments can be made on paper. Some vendors offer database-like interfaces to access data (e.g. KSQL).
In practice I suggest you do not
rely on your messaging system as a database or long term datastore. There’s no problem using it as such, but the point is not to
rely on it. Messaging systems are highly complex and less reliable than time tested database technology which have been production tested for decades. Messaging technology may get close someday, but it is not there quite yet. Just think about the distributed nature of all the moving parts and the unstructured way data can be handled in applications and this makes sense. Bugs are a fact of life, data corruption is possible, and occasionally user implementation mistakes such as inadvertent poison pill messages will take down an application ecosystem (where the best option to recover is usually to wipe out and replace data in the messaging system).
Instead, think of your messaging system’s persistent features as a buffer - albeit a potentially very large and long lived buffer - that enables you to meet your SLAs and provide some short/medium term guarantees. With this mindset you’ll implement the ability to repopulate data in your messaging system from a source of truth like a database in case there’s a catastrophic incident.
This also opens up other opportunities to shed costs, such as shutting a cluster down at night and increasing performance and saving resources by keeping data for a shorter period of time, potentially on high speed local SSDs.
The decision to persist data in a messaging system is more nuanced than it seems at first glance. Don’t let fear default you into overprovisioning, but instead calculate what uptime you need from your messaging system then design and provision appropriately, and of course measure and test.
Business groups and technical groups should collaborate on the SLAs required from the messaging system and track them. If you’re meeting your agreed upon uptime SLAs you can continue to justify decisions to reduce cost.
If you have questions around designing your distributed system, certainly reach out at
colin@luxantsolutions.com. We’re happy to help.