Size Matters (with disk)
- max disk size is 16 TiB
- max allowed disk usage is 50% (i.e, 50% free space capacity is kept)
This is a heavy topic (pun intended). The feature needs more work. Let's start exploring the space:
Why Extra Space
Each disk needs some free space in order to:
- give operators ample time to react in case of an incident scenario like write throughput doubling
- give the cluster time to perform partition reassignments when you want to move partitions away from a broker that's gonna hit the disk limit
- give cloud providers time to expand the disk (if you aren't at the max 16 TiB capacity)
For now, we hardcode this to the made up 50% free space value. This is not ideal by ANY means, because it means that:
- 👎 large deployments leave a lot of free space open (read: waste money) whereas small ones may not have enough.
Here are all the scenarios we modelled in the theoretical scenario of throughput doubling for a sustained period of time:
Without Tiered Storage
✅ Scenarios by Throughput and Retention Time
As you can see, the lower retention days (1, 2) have just two days to respond in case of a doubling
At 7d retention, 50% free space is overkill in the deployments we make. It would take 7 full days to fill the storage in case of double throughput. That's a bit overly conservative for the default.
The higher it goes, especially above 7, the higher of redundant capacity (who needs the ability to react in 31 days?!)
With Tiered Storage
The free space is easier to model here, because the scenarios you can run out of disk are more rare:
-
we assume that
local.retention.bytes
is set close to the disk's maximum (e.g 80%), so we never run out of disk as long as we're uploading to the remote store. -
therefore - when would we not be able to upload to the remote store?
- in case the remote store goes offline, then we do risk running out of disk. Note that S3 is only "allowed" to be down 52 minutes a year as per their SLA.
- in case the broker can't keep up uploading to the remote store (e.g network exhausted, CPU, etc). We provision with ample spare capacity so we don't expect this to be the case.
Following this train of thought, the only thing the tiered storage free space should ensure is that at some Kafka throughput (e.g double)throughput, we can sustain a full S3 downtime for 52 minutes.
This is why we currently have local.retention.ms
set to 60 minutes - to give us 60 minutes of free disk space in case of incidents.
What Is The Perfect Solution?
The right answer seems to require opening a big can of worms. It becomes a question of
- a) how much can you see your throughput increasing in unexpected ways, and how long could that unexpected way be sustained?
- b) what's the minimum amount of time you want to allow your on-call ops personel to react and fix the problem before disks fill up? (i.e before bad problems happen)
- c)
a + b
- what's the worst case minimum amount of time you want to give your ops personnel at the maximum amount of throughput increase you've designed for.
This answer is very personal and depends on a lot of factors - like for example - how long can it take your engineers to fix the problem? In some cases, like when using AWS EBS drives below 16 TiB - it can be as simple as pressing one button and having AWS double the disk capacity within some minutes. In other cases, it may involve adding new brokers and reassigning partitions out of the old ones. This can take days in the worst case.
A better solution would incorporate total throughput per broker and account for a desired time it would take to fill its disk. We didn't do it because we wanted to ship something quick (let's go champ).
We will work on it. For now, remember this rule of thumb:
- if the deployment is not using tiered storage and has a lot of retention, its disk spend can probably be reduced quite a lot.
- if the deployment is spending very little on disk (using tiered storage), then you may want to increase it a bit depending on your risk appetite.
Please feel free to give us your input.