feat: ApproxPercentileCont supports sketches from data source #2004

jychen7 · 2022-03-13T15:42:51Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, approx_percentile_cont(column, quantile) (from #1538) supports raw data as input and build sketches during query time.
In the scenario of low latency query OLAP system (e.g. Druid), one common way is to pre-aggregate sketches during ingestion time (e.g. Spark/Flink -> DataStore), then merge sketches in query time (e.g. DataStore -> Datafusion).

Describe the solution you'd like
Improve approx_percentile_cont(column, quantile) to accept weight, since the pre-aggregate TDigest is just an array of (median, weight). Seems we can not change the arguments of approx_percentile_cont for backward compability, we can add new aggregator called approx_percentile_cont_with_weight

Trino have implement similar function, approx_percentile(x, percentage) and approx_percentile(x, w, percentage)

Describe alternatives you've considered
Improve approx_quantile(column, quantile) to accept an optional 3rd params, e.g. approx_quantile(column, quantile, format) where format can be

raw (default)
tdigest_base64
etc (for future sketch algo, e.g KLLSketch, MomentSketch, DDSketch)

Here is an example tsv file from Druid which also use TDigest algorithm. The data contains ["timestamp", "product", "sketch"]" and it is encoded using TDigest Verbose mode

However, I eventually figure out this approach is not elegant, because it relies on the store to be encoding with TDigest's java implementation.

The text was updated successfully, but these errors were encountered:

jychen7 · 2022-03-17T03:04:54Z

Improve approx_quantile(column, quantile) to accept an optional 3rd params, e.g. approx_quantile(column, quantile, format)

I have implement approx_percentile_cont_from_sketch at my fork repo approx_percentile_cont_from_sketch.rs with test case tests/sql/aggregates.rs and similar csv sketch file as Druid, but I find it is not elegant enough.

So want to try another way to introduce approx_percentile_cont_with_weight(column, weight_column, percentile) similar to Trino approx_percentile(x, w, percentage)

* Add new aggregate function in multiple places * implement new aggregator and test case * rename to SessionContext (follow latest change on master branch) * fix clippy * fix clippy * fix error message and add test cases for error ones

jychen7 added the enhancement New feature or request label Mar 13, 2022

jychen7 changed the title ~~feat: ApproxPercentileCont supports sketches as input~~ feat: ApproxPercentileCont supports sketches from data source Mar 13, 2022

jychen7 mentioned this issue Mar 18, 2022

feat: #2004 approx percentile with weight #2031

Merged

alamb mentioned this issue Mar 24, 2022

Document approx_percentile_cont_with_weight in users guide #2078

Closed

alamb closed this as completed in #2031 Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ApproxPercentileCont supports sketches from data source #2004

feat: ApproxPercentileCont supports sketches from data source #2004

jychen7 commented Mar 13, 2022 •

edited

Loading

jychen7 commented Mar 17, 2022

feat: ApproxPercentileCont supports sketches from data source #2004

feat: ApproxPercentileCont supports sketches from data source #2004

Comments

jychen7 commented Mar 13, 2022 • edited Loading

jychen7 commented Mar 17, 2022

jychen7 commented Mar 13, 2022 •

edited

Loading