Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate CPU-efficient compression for raft snapshot writes. #9620

Open
banks opened this issue Jan 22, 2021 · 1 comment
Open

Investigate CPU-efficient compression for raft snapshot writes. #9620

banks opened this issue Jan 22, 2021 · 1 comment
Labels
theme/performance Performance benchmarking or potential improvement theme/reliability

Comments

@banks
Copy link
Member

banks commented Jan 22, 2021

Background

This is a follow up to several incidents where the failure described in #9609 was the root cause.

This issue is a general performance improvement that could be low-hanging fruit to reduce Disk IO during snapshotting. This could be a significant performance enhancement in general for many write-heavy work loads, but if successful would provide clusters at risk of the replication failure described in the other issue additional headroom.

Proposal

When there is a large amount of data stored in the state store, raft snapshots serialize and write it out to disk for every 16k updates. When the write rate is high this snapshotting can often occur frequently - perhaps every minute.

Often these workloads are IO bound not CPU bound, so using an efficient compression algorithm - especially one like Snappy designed specifically to trade minimal CPU for significant IO reductions - seems like it could be a relatively easy win.

This would likely need to be done in the raft lib and enabled as an option. The file snapshot is https://github.com/hashicorp/raft/blob/e3c5b666287bb8dfe4131ae8759eacc75bbb39c0/file_snapshot.go. It would be possible for Consul to have its own implementation of FileSnapshot but if this optimization works well there's not real reason we shouldn't have it available for other users of the raft library.

The fact the snapshot is compressed (and with which algorithm) should probably be stored in the metadata file and read during restore so the appropriate decompression is used. This allows a single implementation to correctly handle a mixture of compression configurations that could be present before and after changing this config or after an older snapshot is restored (from a disk backup not an external snapshot since that has a separate API).

Note that we already (in Consul IIRC) gzip the external cluster snapshots users can make, this is specifically for the internal ones raft persists automatically which are not currently compressed.

I suggest since this is meant to be a "quick win" rather than a comprehensive analysis, time boxing and investigation on this to a couple of days, work with a single dev-mode consul agent and a script that writes a fixed-size (1M) set of large KVs as quickly as possible and observe snapshot timing (from logs or setup a local prometheus or similar). The fixed size means that eventualy every snapshot should be the same size and the time taken to write this out can be compared before and after compression.

With baselines stats, try wrapping the compression lib into the FileSnapshot and seeing how write times are affected.

If the results show enough gain, then we can do the pluming for config, detecting the right type on read etc. and then plumbing the config into Consul too.

@boxofrad
Copy link
Contributor

boxofrad commented Nov 1, 2021

Hey @banks 👋🏻

I spent a day or so experimenting with, and benchmarking, compression of snapshots using a small handful of algorithms (LZ4, Snappy, and S2).

Here are the headline results:

Compression Average Slowest Fastest state.bin Size
None (Baseline) 6.2s 6.8s 5.6s 500M
Snappy 35.5s 49.6s 32.3s 307M
S2 ⭐️ 1.2s 1.5s 0.9s 274M
LZ4 5.2s 5.3s 4.6s 237M

My experiment consisted of a single Consul server, running on my laptop, with the following configuration chosen to increase the frequency of snapshotting:

raft_snapshot_threshold = 1000
raft_snapshot_interval = "5s"

I then seeded the KV store with 1,000 entries, each ~512K in size, and filled with a random series of words from "Alice's Adventures in Wonderland" (a common test-case for compression) — in my first attempt I had used purely random bytes (from math/rand) but then realised that this made compression pretty much impossible, and probably isn't representative of real workloads anyway.

After seeding the KV, I started 100 goroutines, each sitting in a busy-loop putting KV entries (1KB of random bytes), to generate write load and advance the raft log towards snapshotting. I then observed the snapshot persistence timings for 10 minutes.

It's certainly not a perfect or scientific benchmark, but in the spirit of time-boxing I'm sharing the results now!

Some thoughts:

  1. Whether compression is helpful or not is extremely workload-specific. My previous attempt at using purely random data is a pathological example of this, where compression was pure CPU overhead without the benefit of a reduced snapshot size (and therefore less I/O). That said, using heuristics, S2 is able to detect such uncompressible data and mostly avoid the overhead.
  2. I suspect that part of the reason S2 performed so favorably is that it is able to use many CPU cores to compress data in parallel. My benchmark didn't observe CPU usage, so I'm not sure to what extent this would negatively impact other operations (anecdotally, write throughput seemed unaffected).
  3. Given this, I think that adding support for compression is worthwhile but would need to be tunable, such that operators are able to, for example, try out different configurations against a read-replica before enabling it in earnest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/performance Performance benchmarking or potential improvement theme/reliability
Projects
None yet
Development

No branches or pull requests

3 participants