Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Count-Min-based probabilistic relative frequency sketch #32840

Merged
merged 2 commits into from
Nov 14, 2024

Conversation

vekterli
Copy link
Member

@havardpe please review. This is arguably a rather experimental addition (that is currently not wired to anything), but it could come in handy for gluing into the caching subsystem as part of a cache admission controller (or something else shiny) 😳👉👈

Adds an implementation of a probabilistic frequency sketch that allows for estimating the relative frequency of of elements from a stream of events. That is, the sketch does not capture the absolute frequency of a given element over time.

To reduce the requirement for the number of bits used for the sketch's underlying counters, this sketch uses automatic decaying of counter values once the number of recorded samples reaches a certain point (relative to the sketch's size). Decaying divides all counters by 2.

The underlying data structure is a Count-Min sketch 12 with automatic decaying of counters based on TinyLFU 3.

This implementation has certain changes from a "textbook" CM sketch, inspired by the approach used in 4. In particular, instead of having d logical rows each with width w that are accessed with hash-derived indexes (and thus likely triggering d cache misses for large values of w) we subdivide into w/64 blocks each with fixed number d=4 rows of 32 4-bit counters, i.e. each block is exactly 64 bytes. Counter updates or reads always happen within the scope of a single block. We also ensure the block array is allocated with at least a 64-byte alignment. This ensures that a given sketch update will touch exactly 1 cache line of the underlying sketch buffer (not counting cache lines occupied by the sketch object itself, as we assume these are already present in the CPU cache). Similarly, comparing the frequency of two elements will always touch at most 2 cache lines.

The Count-Min sketch (and its cousin, the Counting Bloom Filter) using k counters is usually described as requiring k pairwise independent hash functions. This implementation assumes this requirement is unnecessary assuming a hash function with good entropy; we instead extract non-overlapping subsets of bits of a single 64-bit hash value and use these as indices into our data structure components.

Footnotes

  1. The Count-Min Sketch and its Applications (2003)

  2. https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch

  3. TinyLFU: A Highly Efficient Cache Admission Policy (2015)

  4. Caffeine FrequencySketch

Adds an implementation of a probabilistic frequency sketch that allows
for estimating the relative frequency of of elements from a stream of events.
That is, the sketch does not capture the _absolute_ frequency of a given
element over time.

To reduce the requirement for the number of bits used for the sketch's
underlying counters, this sketch uses automatic decaying of counter values
once the number of recorded samples reaches a certain point (relative to
the sketch's size). Decaying divides all counters by 2.

The underlying data structure is a Count-Min sketch [0][1] with automatic
decaying of counters based on TinyLFU [2].

This implementation has certain changes from a "textbook" CM sketch,
inspired by the approach used in [3]. In particular, instead of having `d`
logical rows each with width `w` that are accessed with hash-derived
indexes (and thus likely triggering `d` cache misses for large values
of `w`) we subdivide into w/64 blocks each with fixed number d=4 rows of
32 4-bit counters, i.e. each block is exactly 64 bytes. Counter updates or
reads always happen within the scope of a single block. We also ensure the
block array is allocated with at least a 64-byte alignment. This ensures
that a given sketch update will touch exactly 1 cache line of the
underlying sketch buffer (not counting cache lines occupied by the sketch
object itself, as we assume these are already present in the cache).
Similarly, comparing the frequency of two elements will always touch at
most 2 cache lines.

The Count-Min sketch (and its cousin, the Counting Bloom Filter) using `k`
counters is usually described as requiring `k` pairwise independent hash
functions. This implementation assumes this requirement is unnecessary
assuming a hash function with good entropy; we instead extract
non-overlapping subsets of bits of a single hash value and use these as
indices into our data structure components.

References:
 [0]: The Count-Min Sketch and its Applications (2003)
 [1]: https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
 [2]: TinyLFU: A Highly Efficient Cache Admission Policy (2015)
 [3]: https://github.com/ben-manes/caffeine/blob/master/caffeine/
      src/main/java/com/github/benmanes/caffeine/cache/FrequencySketch.java
@vekterli vekterli requested a review from havardpe November 12, 2024 13:03
@ben-manes
Copy link

ben-manes commented Nov 12, 2024

fwiw, I recently added the sketch benchmarks to CI and the results were opposite of what I recall. For some reason the previous version, which was not optimized for a single cache line, was faster in those results. I then saw the same on my laptop.

The commit introducing this said it was 2.4-2.6x faster. I haven't reverted to benchmark at that commit and see if there was a regression.

Strangely this was not only my observation of the block sketch being faster. The improvement was ported to C# (results) and Go (results).

I plan on revisiting this once Java supports a SIMD API, since that would justify optimizations that could lead to some rework. The performance difference wasn't a problem originally or now, making it mostly done for fun, due to how the sketch is used. If your usage is more sensitive then you might dig into it, as I'm unsure what benchmark results are accurate at the moment.

Copy link
Member

@havardpe havardpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have some minor comments

 * Only use 5 bits from hash per counter selection.
 * Add distinct `concept` for sketch hasher.
 * Add separate `add_and_count` method that combines updatating
   frequency and returning the new Min-Count.
 * Remove heuristic for _not_ counting sample towards decay. Might
   be re-added later once we have a realistic data set to test with.
@vekterli
Copy link
Member Author

@havardpe PTAL

@vekterli vekterli requested a review from havardpe November 14, 2024 13:58
@vekterli
Copy link
Member Author

@ben-manes that is a very interesting observation—and surprising! I would usually expect a significant reduction in cache misses to be highly beneficial almost by default.

If I'm interpreting the linked benchmark results correctly (using a highly sophisticated "green means good, red means bad" heuristic) it looks like frequency estimation/updates did show an increase in performance while counter decay (reset) showed a decrease in performance?

Just thinking out loud—did the previous version of reset also include counting how many odd-numbered counters were divided by 2? If not, could the inclusion of count += Long.bitCount in the loop have introduced a data dependency that limits instruction parallelism? Could be interesting to explicitly stride the loop by 4 and have 4 independent counters that are summed at the end...

@vekterli vekterli merged commit ccc22f9 into master Nov 14, 2024
3 checks passed
@vekterli vekterli deleted the vekterli/count-min-relative-frequency-sketch branch November 14, 2024 14:39
@ben-manes
Copy link

The summary is hard to read since it is comparing across Java versions. Instead click on the result json link above for an individual bar chart. I refer to the old version as “flat” for lack of a better term.

My guess is that it’s Java specific and maybe I broke escape analysis. To avoid data dependencies for instruction parallelism, I accumulate the results into an array, assuming the JIT will loop unroll, etc. An array is heap allocated, like all objects, but since it never escapes the method it should be optimized to the stack, etc. If that’s no longer happening then allocations add memory pressure and limit performance.

It’s really hard to rely on implicit optimizations that you cannot force explicitly, like I can by using explicit shifts instead of power-by-2 multiplication. I might have broken some delicate structure that helped the compiler realize it could apply these types of improvements, but that’s very difficult to design or test for. I’m assuming it’s my mistake and haven’t investigated yet, so it’s just an fyi of an unexpected change.

@vekterli
Copy link
Member Author

Understood, thank you very much for the heads up!

The lack of a-priori insight into what the JVM is going to do with a piece of performance-critical code is a not-too-infrequent headache, which I will readily whine about to anyone unfortunate enough to be within whining-distance. Godbolt can be used with Java, but it only shows the JVM byte code—which is of dubious value.

The JIT giveth, and the JIT taketh away...

@ben-manes
Copy link

I took a quick glance at the benchmark and realized a subtle refactoring might explain part of it. I originally benchmarked by simply converting the FrequencySketch between the two versions for side-by-side runs. When I moved it into a longer-term commit and did later clean ups, I copied the old one into the test code, eagerly initialized its capacity, removed the initialization "if" guard checks on the methods, and had it implement a common interface (while using a proxy for Caffeine's actual version). That was enough to skew since we are talking about instruction-level parallelism, branch predictions, inlining, etc in the final output. What I should have done is kept it identical so there would be no unfair compiler advantage.

That now shows an improvement for large sketch sizes due to the single memory access. It is not as pronounced as 2x better and at smaller sizes it is still slower. So probably more when I have time to dig into it, and simply stole a little during a work meeting when my mind drifted...

@ben-manes
Copy link

ben-manes commented Nov 16, 2024

Removing the count[] and index[] arrays fixed the remaining issues so at worst case the block-based has equal performance. Ideally these should have been stack allocated and easily optimized by the compiler into their individual components. The intent was to avoid data dependencies so that the loop would be independent and trivially unrolled. The Math.min is a branch-free single cycle intrinsic, so despite the stall it is easy for a compiler and OOO pipeline to optimize. For increment, manually loop unrolling (like the flat version does) added another 10M ops/s to close the gap at the smallest size and increase its lead at the largest. In Java 11 the block is always much faster, whereas in Java 23 it is slightly slower at small table sizes and shows a strong lead as the table size increases, as expected due to cache effects.

I'm still confused as to why benchmarks results differ so much from my original analysis or those by independent ports. I'm probably still missing something or was foolish before, since the gains are smaller and I did try to benchmark each optimization idea. If I restore to the original commit there is only a +/- 10M ops/s difference that favors flat on small tables and block on large. That's not enough to have justified a change, nor explain why others saw it, but perhaps the older machine (M3 Max now, 2016 Intel then) caused some skew.

Anyway, sorry for the noise but it was confusing me, I saw your changes, and felt a need to sort it out.

@ben-manes
Copy link

Oh, the last remaining issue is simply the extra multiplication latency for a small table. Since its 128kb and the benchmark uses zipf distribution to mimic hot/cold cache access, this mostly stays in L1. The flat sketch does a 2 round hash and each loop does its own 3rd round for the index, whereas the block-based does a 3rd round prior to the loop to calculate all indexes. That data dependency means the cost of the multiplication in a microbenchmark is visible, whereas in the flat is parallelized so the cost is amortized to be cheaper. One could use a bitwise mixing operations instead, but this is a bit ridiculous as unrelated to visible performance and merely serves to explains the benchmark results for a small table.

@vekterli
Copy link
Member Author

Interesting! I don't have any instruction timings for the M-series CPUs, but at least on Skylake-era x64 a MUL looks like it has a 3-4 cycle latency, so although it's not much it's still time spent twiddling thumbs in the pipeline.

This implementation by default assumes that the provided hash function has poor entropy (which is more often than not the case with standard library hashes—these tend to be the identity function for integers 😬) and mixes it up with XXH3 once prior to use. Since this is a full 64-bit hash of (presumably) good entropy we have enough distinct bits to feed to both the block and counter index calculations.

Hmm, I suppose another difference between x64 and the M-series is 128-byte cache lines. A long array with unfortunate alignment risks pulling in (and for increments—updating) 2 cache lines of 128 bytes each for half the addressable blocks. But I would expect that large cache lines should affect the flat table even more...

@ben-manes
Copy link

I think 64 bytes is still the standard cache line size, but often 2-4 lines are pulled in at once if there is no MESI restrictions. The spatial prefetcher guesses that locality might avoid compulsory misses, today’s caches are large, and it’s cheaper to waste space than wait for a memory access. They try to be aggressive but also back off to avoid bandwidth saturations and false sharing problems. I’m guessing that my older 2016 intel was a lot simpler with its smaller caches so it probably made the block look better, whereas current hardware is so over provisioned that it’s harder to observe the benefits. Regardless it’s all for fun and it doesn’t matter what we do in practice, it’s already fast enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants