percentiles metric aggregation #1763

PSeitz · 2023-01-06T08:53:32Z

percentiles aggregation returns the nth percentile for each interval (75th, 85th, 95th, and 99th percentile)

Percentiles show the point at which a certain percentage of observed values occur. For example, the 95th percentile is the value which is greater than 95% of the observed values.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html

PSeitz · 2023-04-03T04:21:34Z

Algorithm Requirements

Unknown Bounds
We have global min and max, but they may differ a lot from the actual min max of a bucket. Filtering can be done by search or sub-aggregations. This means that the algorithm should be able to handle variable bounds and adapt to the given data.
Distributed
Every segment is separate in tantivy. The algorithm should be able to handle data that is distributed across different segments, and be able to merge the results from these segments to produce the final output.
Precision
The algorithm should provide an accurate representation of the percentiles being calculated. Some provide an maximum relative error, some don't provide specific guarantees, like t-digest.
Streaming
The algorithm should be single-pass.
Static memory allocation
This is not a hard requirement, but ideally a percentiles aggregation would not have a large static upfront allocation (e.g. preallocated histogram). A query could be that you want to compute the percentiles per service and there may be hundreds of services (= hundreds of percentil aggregations).

Elastic Search

Elastic Search uses 2 algorithms: T-Digest and HDR Histogram

Crates in Rust

https://crates.io/crates/hdrhistogram
From johnhoo, so quality should be fine.
https://crates.io/crates/quantiles
Implements CKMS, Misra Gries and Greenwald Khanna
Interesting comparison in this issue: benchmark: CKMS slow + excessive memory postmates/quantiles#32
Last release 5 years ago. Finished or abandoned?
https://crates.io/crates/tdigest
https://crates.io/crates/zw-fast-quantile
Implements the algorithm from this paper: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.8534&rep=rep1&type=pdf
Same author as t-digest (MnO2).

PSeitz · 2023-04-04T06:33:42Z

Here are some insights from a comparison of different algorithms

AllValues stores all values in a vec and retrieves the exact values from the sorted data.

Worse than just storing AllValues . (memory, speed, accuracy)

https://crates.io/crates/zw-fast-quantile (high memory, imprecise for 99.99 percentile)
https://crates.io/crates/quantiles quantiles::ckms::CKMS (extremely slow, memory)
https://crates.io/crates/quantiles quantiles::greenwald_khanna::Stream (slow or imprecise, depends on settings)

If there are not many values, only keeping the array is preferable

COUNT=[1_000], TDIGEST_BATCH=500, TDIGEST_MAX_SIZE=300, HDR_SIGFIG=3, DDSketch2Err=0.01

Distribution	Algorithm	Time	PeakMemory	SerializedSize	50.0	95.0	99.0	99.9	99.99
LogNorm Distribution	AllValues	0.000s	12k	7k	20.59	102.24	213.66	341.40	341.40
LogNorm Distribution	TDigest	0.000s	13k	4k	19.89	101.11	203.38	458.16	527.81
LogNorm Distribution	HDRHistogram	0.000s	16k	226	19.00	110.00	212.00	595.00	595.00
LogNorm Distribution	DDSketch	0.000s	4k	3k	19.89	100.49	186.82	820.71	820.71
LogNorm Distribution	DDSketch2	0.000s	4k	unavailable	19.70	92.28	180.94	361.88	361.88

COUNT=[1_000_000], TDIGEST_BATCH=500, TDIGEST_MAX_SIZE=300, HDR_SIGFIG=3, DDSketch2Err=0.01

Distribution	Algorithm	Time	PeakMemory	SerializedSize	50.0	95.0	99.0	99.9	99.99
LogNorm Distribution	AllValues	0.091s	12099k	7812k	19.99	100.23	195.43	406.74	756.09
LogNorm Distribution	TDigest	0.038s	13k	4k	20.02	100.18	193.34	413.25	744.75
LogNorm Distribution	HDRHistogram	0.026s	32k	1232	20.00	99.00	194.00	415.00	759.00
LogNorm Distribution	DDSketch	0.032s	4k	4k	19.89	100.49	194.44	415.78	727.90
LogNorm Distribution	DDSketch2	0.023s	8k	unavailable	20.09	99.92	195.89	407.74	768.17

COUNT=[1_000_000], TDIGEST_BATCH=500, TDIGEST_MAX_SIZE=300, HDR_SIGFIG=3, DDSketch2Err=0.01

Distribution	Algorithm	Time	PeakMemory	SerializedSize	50.0	95.0	99.0	99.9	99.99
LogNorm Distribution 1000x	AllValues	0.092s	12099k	7812k	19987.59	100233.58	195429.27	406742.90	756094.39
LogNorm Distribution 1000x	TDigest	0.039s	13k	4k	20015.22	100182.47	193339.03	413251.05	744747.28
LogNorm Distribution 1000x	HDRHistogram	0.026s	192k	14k	20015.00	100031.00	194047.00	415487.00	760319.00
LogNorm Distribution 1000x	DDSketch	0.030s	4k	4k	20136.32	99741.16	192982.67	412660.72	722447.25
LogNorm Distribution 1000x	DDSketch2	0.022s	8k	unavailable	20176.12	100318.96	196688.92	409372.62	756158.34

While HDRHistogram seems to be doing better, it has a severe limitation, it only operates on u64, some use cases can not be covered with it

fulmicoton · 2023-04-04T07:11:51Z

@PSeitz I naively assumed that tdigest would have a footprint close to TDIGEST_MAX_SIZE x SOMECONSTANTCLOSE_TO_8.
Here the amount of memory per bucket seem to be 1kB. It seems like a lot. How do we explain that?
Are you reporting memory footprint here, or serialized size? The latter is also important as we will have to send those digests on the wire.

PSeitz · 2023-04-04T08:32:27Z

@fulmicoton I use allocator hooks to track peak allocation PSeitz/stats_alloc@d925d3c. There's was a bugfix missing in the measurement.

T-Digest handles updates in batches, the previous number of 20_000 values was too high, 1_000 seems to be a better fit.

I update the tables above with the new measurements and added a serialized column (serialized with bincode)

fulmicoton · 2023-04-04T08:40:59Z

another challenger: https://arxiv.org/pdf/1908.10693.pdf ddsketch...

PSeitz · 2023-04-05T06:31:10Z

I updated the table to include two ddsketch implementations

Full results and benchmark source code is here
https://github.com/PSeitz/quantile_compare

PSeitz added the project airmail label Jan 6, 2023

PSeitz mentioned this issue Apr 6, 2023

add percentiles aggregations #1984

Merged

PSeitz closed this as completed in #1984 Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

percentiles metric aggregation #1763

percentiles metric aggregation #1763

PSeitz commented Jan 6, 2023

PSeitz commented Apr 3, 2023 •

edited

Loading

PSeitz commented Apr 4, 2023 •

edited

Loading

fulmicoton commented Apr 4, 2023

PSeitz commented Apr 4, 2023 •

edited

Loading

fulmicoton commented Apr 4, 2023

PSeitz commented Apr 5, 2023

percentiles metric aggregation #1763

percentiles metric aggregation #1763

Comments

PSeitz commented Jan 6, 2023

PSeitz commented Apr 3, 2023 • edited Loading

Algorithm Requirements

Elastic Search

Crates in Rust

PSeitz commented Apr 4, 2023 • edited Loading

fulmicoton commented Apr 4, 2023

PSeitz commented Apr 4, 2023 • edited Loading

fulmicoton commented Apr 4, 2023

PSeitz commented Apr 5, 2023

PSeitz commented Apr 3, 2023 •

edited

Loading

PSeitz commented Apr 4, 2023 •

edited

Loading

PSeitz commented Apr 4, 2023 •

edited

Loading