Skip to main content

Pagecache & Historical Reads

💡 Facts
  • 75% of consume traffic reads from the tail of the log
  • all tail reads are served from pagecache and do not touch the disk
  • 25% of consume traffic is historical
  • historical reads hit the disk when tiered storage is disabled, and do not touch the disk when tiered storage is enabled

Historical Reads & IOPS

We assume that 75% of consume traffic is reading from the tail of the log (the latest possible data), and the rest 25% is reading older offsets (e.g from the start of the log).

This then has implications on the load your disks see. Historical reads can read directly from the disk which then a) uses up precious IOPS and b) thrashes the pagecache with stale data.

Our assumptions are simple:

  • if tiered storage is disabled, these reads go directly to the disk and use IOPS
  • if tiered storage is enabled, these reads are read from the remote store directly and do not use IOPS, nor thrash pagecache (because they use the network driver)

Pagecache

The pagecache is the RAM of the machine. Kafka is commonly deployed with a bunch of RAM so that it can get used up for pagecache and therefore allow the OS / Kafka to batch writes efficiently and serve reads efficiently (without reading from the slow HDD disk). Don't be fooled by the fact that the Kafka JVM doesn't use much memory - the OS is allocating the rest to pagecache.

Whether reads are served from pagecache directly or read from disk is a delicate question that's actually pretty hard to predict. It varies greatly on:

  • how your OS is tuned (hint: default VM settings aren't always favorable to Kafka)
  • how much RAM your VM has
  • how much RAM is used up by other things
  • your consumers' reading patterns
note

💡 The calculator assumes that all tail consumer reads are served directly from pagecache.

Why does this assumption make sense?

  • the calculator currently indirectly limits how much reads a machine can have (15 MiB/s per vCPU)
  • we choose RAM-heavy (memory-optimized) VMs (e.g r4.xlarge).

As an example, the r4.xlarge has 4 vCPUs and 30.5 GiB of RAM.

That'd be a maximum of 20 MiB/s of new data (producer writes) and 60 MiB/s of reads that a broker can serve. The broker can therefore only:

  • write 1.17 GiB of new producer data in a minute.
  • serve 3.52 GiB of consumer traffic in a minute.

That's a lot of extra RAM to store and serve that latest data in. In other words - if your consumer lag is less than a minute - you can be sure you're reading from memory. Maybe even up to 10-20 minutes (1.17 GiB * 10-20)

We're most probably overly conservative here, especially in the tiered storage case.

There's this nice talk/blog from Adithya Chandra in Confluent about how they were able to deploy instances with less memory once they adopted tiered storage due to avoiding pagecache thrashing.

When Does This Assumption Matter Anyway?

It most matters when we're using HDDs. Which is to say, it most matters when tiered storage is disabled.

It's precisely then when we need to deploy very large disks to handle all the stored data and HDDs are the most cost-effective options.

There is just one problem - the IOPS of HDDs are limited. If you are to exhaust them or push them to the edge, you can really kill your clusters' performance and that itself can cause a domino-like effect of your cluster breaking down. So it's important to be conservative. See how we model IOPS very conservatively for more information.

With tiered storage, we need to store much less data so we opt for SSDs. Those have so much extra IOPS that we can completely forget about the topic. While albeit there is more page cache thrashing with Tiered Storage (the plugin has to read the newly-closed segment file so as to upload it, the extra IOPS and the extra memory we put on brokers gives us more tahn enough buffer)


🤗 Comments Welcome!

Leave feedback, report bugs or just complain at: