Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

percentiles metric aggregation #1763

Closed
PSeitz opened this issue Jan 6, 2023 · 6 comments · Fixed by #1984
Closed

percentiles metric aggregation #1763

PSeitz opened this issue Jan 6, 2023 · 6 comments · Fixed by #1984

Comments

@PSeitz
Copy link
Contributor

PSeitz commented Jan 6, 2023

percentiles aggregation returns the nth percentile for each interval (75th, 85th, 95th, and 99th percentile)

Percentiles show the point at which a certain percentage of observed values occur. For example, the 95th percentile is the value which is greater than 95% of the observed values.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html

@PSeitz
Copy link
Contributor Author

PSeitz commented Apr 3, 2023

Algorithm Requirements

  1. Unknown Bounds
    We have global min and max, but they may differ a lot from the actual min max of a bucket. Filtering can be done by search or sub-aggregations. This means that the algorithm should be able to handle variable bounds and adapt to the given data.

  2. Distributed
    Every segment is separate in tantivy. The algorithm should be able to handle data that is distributed across different segments, and be able to merge the results from these segments to produce the final output.

  3. Precision
    The algorithm should provide an accurate representation of the percentiles being calculated. Some provide an maximum relative error, some don't provide specific guarantees, like t-digest.

  4. Streaming
    The algorithm should be single-pass.

  5. Static memory allocation
    This is not a hard requirement, but ideally a percentiles aggregation would not have a large static upfront allocation (e.g. preallocated histogram). A query could be that you want to compute the percentiles per service and there may be hundreds of services (= hundreds of percentil aggregations).

Elastic Search

Elastic Search uses 2 algorithms: T-Digest and HDR Histogram

Crates in Rust

@PSeitz
Copy link
Contributor Author

PSeitz commented Apr 4, 2023

Here are some insights from a comparison of different algorithms

AllValues stores all values in a vec and retrieves the exact values from the sorted data.

Worse than just storing AllValues . (memory, speed, accuracy)

If there are not many values, only keeping the array is preferable

COUNT=[1_000], TDIGEST_BATCH=500, TDIGEST_MAX_SIZE=300, HDR_SIGFIG=3, DDSketch2Err=0.01

Distribution Algorithm Time PeakMemory SerializedSize 50.0 95.0 99.0 99.9 99.99
LogNorm Distribution AllValues 0.000s 12k 7k 20.59 102.24 213.66 341.40 341.40
LogNorm Distribution TDigest 0.000s 13k 4k 19.89 101.11 203.38 458.16 527.81
LogNorm Distribution HDRHistogram 0.000s 16k 226 19.00 110.00 212.00 595.00 595.00
LogNorm Distribution DDSketch 0.000s 4k 3k 19.89 100.49 186.82 820.71 820.71
LogNorm Distribution DDSketch2 0.000s 4k unavailable 19.70 92.28 180.94 361.88 361.88

COUNT=[1_000_000], TDIGEST_BATCH=500, TDIGEST_MAX_SIZE=300, HDR_SIGFIG=3, DDSketch2Err=0.01

Distribution Algorithm Time PeakMemory SerializedSize 50.0 95.0 99.0 99.9 99.99
LogNorm Distribution AllValues 0.091s 12099k 7812k 19.99 100.23 195.43 406.74 756.09
LogNorm Distribution TDigest 0.038s 13k 4k 20.02 100.18 193.34 413.25 744.75
LogNorm Distribution HDRHistogram 0.026s 32k 1232 20.00 99.00 194.00 415.00 759.00
LogNorm Distribution DDSketch 0.032s 4k 4k 19.89 100.49 194.44 415.78 727.90
LogNorm Distribution DDSketch2 0.023s 8k unavailable 20.09 99.92 195.89 407.74 768.17

COUNT=[1_000_000], TDIGEST_BATCH=500, TDIGEST_MAX_SIZE=300, HDR_SIGFIG=3, DDSketch2Err=0.01

Distribution Algorithm Time PeakMemory SerializedSize 50.0 95.0 99.0 99.9 99.99
LogNorm Distribution 1000x AllValues 0.092s 12099k 7812k 19987.59 100233.58 195429.27 406742.90 756094.39
LogNorm Distribution 1000x TDigest 0.039s 13k 4k 20015.22 100182.47 193339.03 413251.05 744747.28
LogNorm Distribution 1000x HDRHistogram 0.026s 192k 14k 20015.00 100031.00 194047.00 415487.00 760319.00
LogNorm Distribution 1000x DDSketch 0.030s 4k 4k 20136.32 99741.16 192982.67 412660.72 722447.25
LogNorm Distribution 1000x DDSketch2 0.022s 8k unavailable 20176.12 100318.96 196688.92 409372.62 756158.34

While HDRHistogram seems to be doing better, it has a severe limitation, it only operates on u64, some use cases can not be covered with it

@fulmicoton
Copy link
Collaborator

@PSeitz I naively assumed that tdigest would have a footprint close to TDIGEST_MAX_SIZE x SOMECONSTANTCLOSE_TO_8.
Here the amount of memory per bucket seem to be 1kB. It seems like a lot. How do we explain that?
Are you reporting memory footprint here, or serialized size? The latter is also important as we will have to send those digests on the wire.

@PSeitz
Copy link
Contributor Author

PSeitz commented Apr 4, 2023

@fulmicoton I use allocator hooks to track peak allocation PSeitz/stats_alloc@d925d3c. There's was a bugfix missing in the measurement.

T-Digest handles updates in batches, the previous number of 20_000 values was too high, 1_000 seems to be a better fit.

I update the tables above with the new measurements and added a serialized column (serialized with bincode)

@fulmicoton
Copy link
Collaborator

another challenger: https://arxiv.org/pdf/1908.10693.pdf ddsketch...

@PSeitz
Copy link
Contributor Author

PSeitz commented Apr 5, 2023

I updated the table to include two ddsketch implementations

Full results and benchmark source code is here
https://github.com/PSeitz/quantile_compare

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants