You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, approx_percentile_cont(column, quantile) (from #1538) supports raw data as input and build sketches during query time.
In the scenario of low latency query OLAP system (e.g. Druid), one common way is to pre-aggregate sketches during ingestion time (e.g. Spark/Flink -> DataStore), then merge sketches in query time (e.g. DataStore -> Datafusion).
Describe the solution you'd like
Improve approx_percentile_cont(column, quantile) to accept weight, since the pre-aggregate TDigest is just an array of (median, weight). Seems we can not change the arguments of approx_percentile_cont for backward compability, we can add new aggregator called approx_percentile_cont_with_weight
Trino have implement similar function, approx_percentile(x, percentage) and approx_percentile(x, w, percentage)
Describe alternatives you've considered
Improve approx_quantile(column, quantile) to accept an optional 3rd params, e.g. approx_quantile(column, quantile, format) where format can be
Here is an example tsv file from Druid which also use TDigest algorithm. The data contains ["timestamp", "product", "sketch"]" and it is encoded using TDigest Verbose mode
However, I eventually figure out this approach is not elegant, because it relies on the store to be encoding with TDigest's java implementation.
The text was updated successfully, but these errors were encountered:
jychen7
changed the title
feat: ApproxPercentileCont supports sketches as input
feat: ApproxPercentileCont supports sketches from data source
Mar 13, 2022
So want to try another way to introduce approx_percentile_cont_with_weight(column, weight_column, percentile) similar to Trinoapprox_percentile(x, w, percentage)
* Add new aggregate function in multiple places
* implement new aggregator and test case
* rename to SessionContext (follow latest change on master branch)
* fix clippy
* fix clippy
* fix error message and add test cases for error ones
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently,
approx_percentile_cont(column, quantile)
(from #1538) supports raw data as input and build sketches during query time.In the scenario of low latency query OLAP system (e.g. Druid), one common way is to pre-aggregate sketches during ingestion time (e.g. Spark/Flink -> DataStore), then merge sketches in query time (e.g. DataStore -> Datafusion).
Describe the solution you'd like
Improve
approx_percentile_cont(column, quantile)
to acceptweight
, since the pre-aggregate TDigest is just an array of(median, weight)
. Seems we can not change the arguments ofapprox_percentile_cont
for backward compability, we can add new aggregator calledapprox_percentile_cont_with_weight
Trino have implement similar function,
approx_percentile(x, percentage)
andapprox_percentile(x, w, percentage)
Describe alternatives you've considered
Improve
approx_quantile(column, quantile)
to accept an optional 3rd params, e.g.approx_quantile(column, quantile, format)
where format can beHere is an example tsv file from Druid which also use TDigest algorithm. The data contains
["timestamp", "product", "sketch"]"
and it is encoded using TDigest Verbose modeHowever, I eventually figure out this approach is not elegant, because it relies on the store to be encoding with TDigest's java implementation.
The text was updated successfully, but these errors were encountered: