Add a Count-Min-based probabilistic relative frequency sketch #32840

vekterli · 2024-11-12T13:03:06Z

@havardpe please review. This is arguably a rather experimental addition (that is currently not wired to anything), but it could come in handy for gluing into the caching subsystem as part of a cache admission controller (or something else shiny) 😳👉👈

Adds an implementation of a probabilistic frequency sketch that allows for estimating the relative frequency of of elements from a stream of events. That is, the sketch does not capture the absolute frequency of a given element over time.

To reduce the requirement for the number of bits used for the sketch's underlying counters, this sketch uses automatic decaying of counter values once the number of recorded samples reaches a certain point (relative to the sketch's size). Decaying divides all counters by 2.

The underlying data structure is a Count-Min sketch ¹² with automatic decaying of counters based on TinyLFU ³.

This implementation has certain changes from a "textbook" CM sketch, inspired by the approach used in ⁴. In particular, instead of having d logical rows each with width w that are accessed with hash-derived indexes (and thus likely triggering d cache misses for large values of w) we subdivide into w/64 blocks each with fixed number d=4 rows of 32 4-bit counters, i.e. each block is exactly 64 bytes. Counter updates or reads always happen within the scope of a single block. We also ensure the block array is allocated with at least a 64-byte alignment. This ensures that a given sketch update will touch exactly 1 cache line of the underlying sketch buffer (not counting cache lines occupied by the sketch object itself, as we assume these are already present in the CPU cache). Similarly, comparing the frequency of two elements will always touch at most 2 cache lines.

The Count-Min sketch (and its cousin, the Counting Bloom Filter) using k counters is usually described as requiring k pairwise independent hash functions. This implementation assumes this requirement is unnecessary assuming a hash function with good entropy; we instead extract non-overlapping subsets of bits of a single 64-bit hash value and use these as indices into our data structure components.

The Count-Min Sketch and its Applications (2003) ↩
https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch ↩
TinyLFU: A Highly Efficient Cache Admission Policy (2015) ↩
Caffeine FrequencySketch ↩

Adds an implementation of a probabilistic frequency sketch that allows for estimating the relative frequency of of elements from a stream of events. That is, the sketch does not capture the _absolute_ frequency of a given element over time. To reduce the requirement for the number of bits used for the sketch's underlying counters, this sketch uses automatic decaying of counter values once the number of recorded samples reaches a certain point (relative to the sketch's size). Decaying divides all counters by 2. The underlying data structure is a Count-Min sketch [0][1] with automatic decaying of counters based on TinyLFU [2]. This implementation has certain changes from a "textbook" CM sketch, inspired by the approach used in [3]. In particular, instead of having `d` logical rows each with width `w` that are accessed with hash-derived indexes (and thus likely triggering `d` cache misses for large values of `w`) we subdivide into w/64 blocks each with fixed number d=4 rows of 32 4-bit counters, i.e. each block is exactly 64 bytes. Counter updates or reads always happen within the scope of a single block. We also ensure the block array is allocated with at least a 64-byte alignment. This ensures that a given sketch update will touch exactly 1 cache line of the underlying sketch buffer (not counting cache lines occupied by the sketch object itself, as we assume these are already present in the cache). Similarly, comparing the frequency of two elements will always touch at most 2 cache lines. The Count-Min sketch (and its cousin, the Counting Bloom Filter) using `k` counters is usually described as requiring `k` pairwise independent hash functions. This implementation assumes this requirement is unnecessary assuming a hash function with good entropy; we instead extract non-overlapping subsets of bits of a single hash value and use these as indices into our data structure components. References: [0]: The Count-Min Sketch and its Applications (2003) [1]: https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch [2]: TinyLFU: A Highly Efficient Cache Admission Policy (2015) [3]: https://github.com/ben-manes/caffeine/blob/master/caffeine/ src/main/java/com/github/benmanes/caffeine/cache/FrequencySketch.java

ben-manes · 2024-11-12T23:03:45Z

fwiw, I recently added the sketch benchmarks to CI and the results were opposite of what I recall. For some reason the previous version, which was not optimized for a single cache line, was faster in those results. I then saw the same on my laptop.

The commit introducing this said it was 2.4-2.6x faster. I haven't reverted to benchmark at that commit and see if there was a regression.

Strangely this was not only my observation of the block sketch being faster. The improvement was ported to C# (results) and Go (results).

I plan on revisiting this once Java supports a SIMD API, since that would justify optimizations that could lead to some rework. The performance difference wasn't a problem originally or now, making it mostly done for fun, due to how the sketch is used. If your usage is more sensitive then you might dig into it, as I'm unsure what benchmark results are accurate at the moment.

havardpe

Looks good. I have some minor comments

vespalib/src/vespa/vespalib/util/relative_frequency_sketch.h

vespalib/src/vespa/vespalib/util/relative_frequency_sketch.cpp

vespalib/src/vespa/vespalib/util/relative_frequency_sketch.h

* Only use 5 bits from hash per counter selection. * Add distinct `concept` for sketch hasher. * Add separate `add_and_count` method that combines updatating frequency and returning the new Min-Count. * Remove heuristic for _not_ counting sample towards decay. Might be re-added later once we have a realistic data set to test with.

vekterli · 2024-11-14T13:58:52Z

@havardpe PTAL

vekterli · 2024-11-14T14:34:19Z

@ben-manes that is a very interesting observation—and surprising! I would usually expect a significant reduction in cache misses to be highly beneficial almost by default.

If I'm interpreting the linked benchmark results correctly (using a highly sophisticated "green means good, red means bad" heuristic) it looks like frequency estimation/updates did show an increase in performance while counter decay (reset) showed a decrease in performance?

Just thinking out loud—did the previous version of reset also include counting how many odd-numbered counters were divided by 2? If not, could the inclusion of count += Long.bitCount in the loop have introduced a data dependency that limits instruction parallelism? Could be interesting to explicitly stride the loop by 4 and have 4 independent counters that are summed at the end...

ben-manes · 2024-11-14T15:07:52Z

The summary is hard to read since it is comparing across Java versions. Instead click on the result json link above for an individual bar chart. I refer to the old version as “flat” for lack of a better term.

My guess is that it’s Java specific and maybe I broke escape analysis. To avoid data dependencies for instruction parallelism, I accumulate the results into an array, assuming the JIT will loop unroll, etc. An array is heap allocated, like all objects, but since it never escapes the method it should be optimized to the stack, etc. If that’s no longer happening then allocations add memory pressure and limit performance.

It’s really hard to rely on implicit optimizations that you cannot force explicitly, like I can by using explicit shifts instead of power-by-2 multiplication. I might have broken some delicate structure that helped the compiler realize it could apply these types of improvements, but that’s very difficult to design or test for. I’m assuming it’s my mistake and haven’t investigated yet, so it’s just an fyi of an unexpected change.

vekterli · 2024-11-14T15:34:26Z

Understood, thank you very much for the heads up!

The lack of a-priori insight into what the JVM is going to do with a piece of performance-critical code is a not-too-infrequent headache, which I will readily whine about to anyone unfortunate enough to be within whining-distance. Godbolt can be used with Java, but it only shows the JVM byte code—which is of dubious value.

The JIT giveth, and the JIT taketh away...

ben-manes · 2024-11-15T19:22:43Z

I took a quick glance at the benchmark and realized a subtle refactoring might explain part of it. I originally benchmarked by simply converting the FrequencySketch between the two versions for side-by-side runs. When I moved it into a longer-term commit and did later clean ups, I copied the old one into the test code, eagerly initialized its capacity, removed the initialization "if" guard checks on the methods, and had it implement a common interface (while using a proxy for Caffeine's actual version). That was enough to skew since we are talking about instruction-level parallelism, branch predictions, inlining, etc in the final output. What I should have done is kept it identical so there would be no unfair compiler advantage.

That now shows an improvement for large sketch sizes due to the single memory access. It is not as pronounced as 2x better and at smaller sizes it is still slower. So probably more when I have time to dig into it, and simply stole a little during a work meeting when my mind drifted...

ben-manes · 2024-11-16T20:54:53Z

Removing the count[] and index[] arrays fixed the remaining issues so at worst case the block-based has equal performance. Ideally these should have been stack allocated and easily optimized by the compiler into their individual components. The intent was to avoid data dependencies so that the loop would be independent and trivially unrolled. The Math.min is a branch-free single cycle intrinsic, so despite the stall it is easy for a compiler and OOO pipeline to optimize. For increment, manually loop unrolling (like the flat version does) added another 10M ops/s to close the gap at the smallest size and increase its lead at the largest. In Java 11 the block is always much faster, whereas in Java 23 it is slightly slower at small table sizes and shows a strong lead as the table size increases, as expected due to cache effects.

I'm still confused as to why benchmarks results differ so much from my original analysis or those by independent ports. I'm probably still missing something or was foolish before, since the gains are smaller and I did try to benchmark each optimization idea. If I restore to the original commit there is only a +/- 10M ops/s difference that favors flat on small tables and block on large. That's not enough to have justified a change, nor explain why others saw it, but perhaps the older machine (M3 Max now, 2016 Intel then) caused some skew.

Anyway, sorry for the noise but it was confusing me, I saw your changes, and felt a need to sort it out.

ben-manes · 2024-11-18T00:49:43Z

Oh, the last remaining issue is simply the extra multiplication latency for a small table. Since its 128kb and the benchmark uses zipf distribution to mimic hot/cold cache access, this mostly stays in L1. The flat sketch does a 2 round hash and each loop does its own 3rd round for the index, whereas the block-based does a 3rd round prior to the loop to calculate all indexes. That data dependency means the cost of the multiplication in a microbenchmark is visible, whereas in the flat is parallelized so the cost is amortized to be cheaper. One could use a bitwise mixing operations instead, but this is a bit ridiculous as unrelated to visible performance and merely serves to explains the benchmark results for a small table.

vekterli · 2024-11-18T15:07:10Z

Interesting! I don't have any instruction timings for the M-series CPUs, but at least on Skylake-era x64 a MUL looks like it has a 3-4 cycle latency, so although it's not much it's still time spent twiddling thumbs in the pipeline.

This implementation by default assumes that the provided hash function has poor entropy (which is more often than not the case with standard library hashes—these tend to be the identity function for integers 😬) and mixes it up with XXH3 once prior to use. Since this is a full 64-bit hash of (presumably) good entropy we have enough distinct bits to feed to both the block and counter index calculations.

Hmm, I suppose another difference between x64 and the M-series is 128-byte cache lines. A long array with unfortunate alignment risks pulling in (and for increments—updating) 2 cache lines of 128 bytes each for half the addressable blocks. But I would expect that large cache lines should affect the flat table even more...

ben-manes · 2024-11-18T15:47:10Z

I think 64 bytes is still the standard cache line size, but often 2-4 lines are pulled in at once if there is no MESI restrictions. The spatial prefetcher guesses that locality might avoid compulsory misses, today’s caches are large, and it’s cheaper to waste space than wait for a memory access. They try to be aggressive but also back off to avoid bandwidth saturations and false sharing problems. I’m guessing that my older 2016 intel was a lot simpler with its smaller caches so it probably made the block look better, whereas current hardware is so over provisioned that it’s harder to observe the benefits. Regardless it’s all for fun and it doesn’t matter what we do in practice, it’s already fast enough.

vekterli requested a review from havardpe November 12, 2024 13:03

havardpe reviewed Nov 14, 2024

View reviewed changes

vekterli requested a review from havardpe November 14, 2024 13:58

havardpe approved these changes Nov 14, 2024

View reviewed changes

vekterli merged commit ccc22f9 into master Nov 14, 2024
3 checks passed

vekterli deleted the vekterli/count-min-relative-frequency-sketch branch November 14, 2024 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Count-Min-based probabilistic relative frequency sketch #32840

Add a Count-Min-based probabilistic relative frequency sketch #32840

vekterli commented Nov 12, 2024

ben-manes commented Nov 12, 2024 •

edited

Loading

havardpe left a comment

vekterli commented Nov 14, 2024

vekterli commented Nov 14, 2024

ben-manes commented Nov 14, 2024

vekterli commented Nov 14, 2024

ben-manes commented Nov 15, 2024

ben-manes commented Nov 16, 2024 •

edited

Loading

ben-manes commented Nov 18, 2024

vekterli commented Nov 18, 2024

ben-manes commented Nov 18, 2024

Add a Count-Min-based probabilistic relative frequency sketch #32840

Add a Count-Min-based probabilistic relative frequency sketch #32840

Conversation

vekterli commented Nov 12, 2024

Footnotes

ben-manes commented Nov 12, 2024 • edited Loading

havardpe left a comment

Choose a reason for hiding this comment

vekterli commented Nov 14, 2024

vekterli commented Nov 14, 2024

ben-manes commented Nov 14, 2024

vekterli commented Nov 14, 2024

ben-manes commented Nov 15, 2024

ben-manes commented Nov 16, 2024 • edited Loading

ben-manes commented Nov 18, 2024

vekterli commented Nov 18, 2024

ben-manes commented Nov 18, 2024

ben-manes commented Nov 12, 2024 •

edited

Loading

ben-manes commented Nov 16, 2024 •

edited

Loading