Little’s law is a theorem that tells us the relationship in queuing systems between the throughput of the system versus the number of requests and how long it takes to process each request. When developing Cloud (as in direct to object storage) Topics in Redpanda, we came across this very practically: each request in Redpanda normally hits NVMe storage and is processed in microseconds but for the first time with object storage in the produce path each request now takes hundreds of milliseconds. Understandably, this caused us to find bottlenecks now that the latency profile has drastically changed. As we were first testing scaling up from MB/s scale to GB/s we found that some pipelined processing code we had introduced a few years ago was preventing us from getting the throughput we needed. For low latency topics, the produce request processing looks something like this:

Producer batch Stage 1 Enqueue in Raft replication buffer Stage 2 Majority ack + flush to disk ack 1 — producer sends next batch ack 2 — write fully committed ordering preserved for idempotency sequence numbers
Figure 1: Produce processing for low latency topics

We process each produce request sequentially per producer/network connection and we take the producer’s batch and (after doing various validations and checks for idempotency, etc) we enqueue the batch in a buffer in the raft leader to be sent to all replicas. At that point, the order the batch will be inserted into the log is guaranteed and we can ask the request processing layer to release the next batch into the system. This preserves order of requests and is required for idempotent producers in Kafka. However, we do not acknowledge the request to the producer until we properly replicate the batch to a majority of the Raft group. This allows for nice pipelined produce request processing.

You might ask: where is the bottleneck then? With Cloud Topics, we write the data from the producer to object storage, then we replace the batch with metadata and replicate it via raft like a “normal” batch from a low latency topic. However, we’ve now multiplied our latency from producer to stage one by over 100 times by introducing batching and a write to object storage! Little’s law equates to Throughput = Latency * Concurrency - in order to improve our throughput with the increased latency we need to add more concurrency. We ended up fixing this by… introducing another layer of queuing.

Producer Queue Accepts batches, sends ack 1 Upload Raft enqueue Stage 1: pre-replicate enqueue + ack Raft Stage 2: replicate ack 1 — producer sends next batch stage 1 ack — releases next metadata in order ack 2 — metadata replicated on majority, data committed
Figure 2: Produce processing for Cloud Topics

The additional queue becomes a way to have the producer/networking layer release more batches into the system, however while queued in this new layer we can upload multiple batches from a single producer in parallel. Once the data is uploaded, we preserve the ordering from the producer and release those into the raft layer, waiting for the raft layer to enqueue the request before we release the next one. We still hold the producer acknowledgement for until the metadata is fully replicated in Raft. This means that we preserve the correct ordering and data durability requirements at every stage while allowing more concurrency into the latency bound part of the request processing.

What’s the diff? Running Open Messaging Benchmark with this change allowed us to easily push 10x more data and hit the GB/s scale without needing to change any producer configurations. If you’re interested in learning more about Cloud Topics, check out this blog I wrote on the overall architecture. The code for the producer queue is available here and here with the it’s main usage being here and here.