-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Count-Min-based probabilistic relative frequency sketch #32840
Add a Count-Min-based probabilistic relative frequency sketch #32840
Conversation
Adds an implementation of a probabilistic frequency sketch that allows for estimating the relative frequency of of elements from a stream of events. That is, the sketch does not capture the _absolute_ frequency of a given element over time. To reduce the requirement for the number of bits used for the sketch's underlying counters, this sketch uses automatic decaying of counter values once the number of recorded samples reaches a certain point (relative to the sketch's size). Decaying divides all counters by 2. The underlying data structure is a Count-Min sketch [0][1] with automatic decaying of counters based on TinyLFU [2]. This implementation has certain changes from a "textbook" CM sketch, inspired by the approach used in [3]. In particular, instead of having `d` logical rows each with width `w` that are accessed with hash-derived indexes (and thus likely triggering `d` cache misses for large values of `w`) we subdivide into w/64 blocks each with fixed number d=4 rows of 32 4-bit counters, i.e. each block is exactly 64 bytes. Counter updates or reads always happen within the scope of a single block. We also ensure the block array is allocated with at least a 64-byte alignment. This ensures that a given sketch update will touch exactly 1 cache line of the underlying sketch buffer (not counting cache lines occupied by the sketch object itself, as we assume these are already present in the cache). Similarly, comparing the frequency of two elements will always touch at most 2 cache lines. The Count-Min sketch (and its cousin, the Counting Bloom Filter) using `k` counters is usually described as requiring `k` pairwise independent hash functions. This implementation assumes this requirement is unnecessary assuming a hash function with good entropy; we instead extract non-overlapping subsets of bits of a single hash value and use these as indices into our data structure components. References: [0]: The Count-Min Sketch and its Applications (2003) [1]: https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch [2]: TinyLFU: A Highly Efficient Cache Admission Policy (2015) [3]: https://github.com/ben-manes/caffeine/blob/master/caffeine/ src/main/java/com/github/benmanes/caffeine/cache/FrequencySketch.java
fwiw, I recently added the sketch benchmarks to CI and the results were opposite of what I recall. For some reason the previous version, which was not optimized for a single cache line, was faster in those results. I then saw the same on my laptop. The commit introducing this said it was 2.4-2.6x faster. I haven't reverted to benchmark at that commit and see if there was a regression. Strangely this was not only my observation of the block sketch being faster. The improvement was ported to C# (results) and Go (results). I plan on revisiting this once Java supports a SIMD API, since that would justify optimizations that could lead to some rework. The performance difference wasn't a problem originally or now, making it mostly done for fun, due to how the sketch is used. If your usage is more sensitive then you might dig into it, as I'm unsure what benchmark results are accurate at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I have some minor comments
* Only use 5 bits from hash per counter selection. * Add distinct `concept` for sketch hasher. * Add separate `add_and_count` method that combines updatating frequency and returning the new Min-Count. * Remove heuristic for _not_ counting sample towards decay. Might be re-added later once we have a realistic data set to test with.
@havardpe PTAL |
@ben-manes that is a very interesting observation—and surprising! I would usually expect a significant reduction in cache misses to be highly beneficial almost by default. If I'm interpreting the linked benchmark results correctly (using a highly sophisticated "green means good, red means bad" heuristic) it looks like frequency estimation/updates did show an increase in performance while counter decay ( Just thinking out loud—did the previous version of |
The summary is hard to read since it is comparing across Java versions. Instead click on the result json link above for an individual bar chart. I refer to the old version as “flat” for lack of a better term. My guess is that it’s Java specific and maybe I broke escape analysis. To avoid data dependencies for instruction parallelism, I accumulate the results into an array, assuming the JIT will loop unroll, etc. An array is heap allocated, like all objects, but since it never escapes the method it should be optimized to the stack, etc. If that’s no longer happening then allocations add memory pressure and limit performance. It’s really hard to rely on implicit optimizations that you cannot force explicitly, like I can by using explicit shifts instead of power-by-2 multiplication. I might have broken some delicate structure that helped the compiler realize it could apply these types of improvements, but that’s very difficult to design or test for. I’m assuming it’s my mistake and haven’t investigated yet, so it’s just an fyi of an unexpected change. |
Understood, thank you very much for the heads up! The lack of a-priori insight into what the JVM is going to do with a piece of performance-critical code is a not-too-infrequent headache, which I will readily whine about to anyone unfortunate enough to be within whining-distance. Godbolt can be used with Java, but it only shows the JVM byte code—which is of dubious value. The JIT giveth, and the JIT taketh away... |
I took a quick glance at the benchmark and realized a subtle refactoring might explain part of it. I originally benchmarked by simply converting the That now shows an improvement for large sketch sizes due to the single memory access. It is not as pronounced as 2x better and at smaller sizes it is still slower. So probably more when I have time to dig into it, and simply stole a little during a work meeting when my mind drifted... |
Removing the I'm still confused as to why benchmarks results differ so much from my original analysis or those by independent ports. I'm probably still missing something or was foolish before, since the gains are smaller and I did try to benchmark each optimization idea. If I restore to the original commit there is only a +/- 10M ops/s difference that favors flat on small tables and block on large. That's not enough to have justified a change, nor explain why others saw it, but perhaps the older machine (M3 Max now, 2016 Intel then) caused some skew. Anyway, sorry for the noise but it was confusing me, I saw your changes, and felt a need to sort it out. |
Oh, the last remaining issue is simply the extra multiplication latency for a small table. Since its 128kb and the benchmark uses zipf distribution to mimic hot/cold cache access, this mostly stays in L1. The flat sketch does a 2 round hash and each loop does its own 3rd round for the index, whereas the block-based does a 3rd round prior to the loop to calculate all indexes. That data dependency means the cost of the multiplication in a microbenchmark is visible, whereas in the flat is parallelized so the cost is amortized to be cheaper. One could use a bitwise mixing operations instead, but this is a bit ridiculous as unrelated to visible performance and merely serves to explains the benchmark results for a small table. |
Interesting! I don't have any instruction timings for the M-series CPUs, but at least on Skylake-era x64 a This implementation by default assumes that the provided hash function has poor entropy (which is more often than not the case with standard library hashes—these tend to be the identity function for integers 😬) and mixes it up with XXH3 once prior to use. Since this is a full 64-bit hash of (presumably) good entropy we have enough distinct bits to feed to both the block and counter index calculations. Hmm, I suppose another difference between x64 and the M-series is 128-byte cache lines. A |
I think 64 bytes is still the standard cache line size, but often 2-4 lines are pulled in at once if there is no MESI restrictions. The spatial prefetcher guesses that locality might avoid compulsory misses, today’s caches are large, and it’s cheaper to waste space than wait for a memory access. They try to be aggressive but also back off to avoid bandwidth saturations and false sharing problems. I’m guessing that my older 2016 intel was a lot simpler with its smaller caches so it probably made the block look better, whereas current hardware is so over provisioned that it’s harder to observe the benefits. Regardless it’s all for fun and it doesn’t matter what we do in practice, it’s already fast enough. |
@havardpe please review. This is arguably a rather experimental addition (that is currently not wired to anything), but it could come in handy for gluing into the caching subsystem as part of a cache admission controller (or something else shiny) 😳👉👈
Adds an implementation of a probabilistic frequency sketch that allows for estimating the relative frequency of of elements from a stream of events. That is, the sketch does not capture the absolute frequency of a given element over time.
To reduce the requirement for the number of bits used for the sketch's underlying counters, this sketch uses automatic decaying of counter values once the number of recorded samples reaches a certain point (relative to the sketch's size). Decaying divides all counters by 2.
The underlying data structure is a Count-Min sketch 12 with automatic decaying of counters based on TinyLFU 3.
This implementation has certain changes from a "textbook" CM sketch, inspired by the approach used in 4. In particular, instead of having
d
logical rows each with widthw
that are accessed with hash-derived indexes (and thus likely triggeringd
cache misses for large values ofw
) we subdivide intow/64
blocks each with fixed number d=4 rows of 32 4-bit counters, i.e. each block is exactly 64 bytes. Counter updates or reads always happen within the scope of a single block. We also ensure the block array is allocated with at least a 64-byte alignment. This ensures that a given sketch update will touch exactly 1 cache line of the underlying sketch buffer (not counting cache lines occupied by the sketch object itself, as we assume these are already present in the CPU cache). Similarly, comparing the frequency of two elements will always touch at most 2 cache lines.The Count-Min sketch (and its cousin, the Counting Bloom Filter) using
k
counters is usually described as requiringk
pairwise independent hash functions. This implementation assumes this requirement is unnecessary assuming a hash function with good entropy; we instead extract non-overlapping subsets of bits of a single 64-bit hash value and use these as indices into our data structure components.Footnotes
The Count-Min Sketch and its Applications (2003) ↩
https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch ↩
TinyLFU: A Highly Efficient Cache Admission Policy (2015) ↩
Caffeine FrequencySketch ↩