Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ApproxPercentileCont supports sketches from data source #2004

Closed
jychen7 opened this issue Mar 13, 2022 · 1 comment · Fixed by #2031
Closed

feat: ApproxPercentileCont supports sketches from data source #2004

jychen7 opened this issue Mar 13, 2022 · 1 comment · Fixed by #2031
Labels
enhancement New feature or request

Comments

@jychen7
Copy link
Contributor

jychen7 commented Mar 13, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, approx_percentile_cont(column, quantile) (from #1538) supports raw data as input and build sketches during query time.
In the scenario of low latency query OLAP system (e.g. Druid), one common way is to pre-aggregate sketches during ingestion time (e.g. Spark/Flink -> DataStore), then merge sketches in query time (e.g. DataStore -> Datafusion).

Describe the solution you'd like
Improve approx_percentile_cont(column, quantile) to accept weight, since the pre-aggregate TDigest is just an array of (median, weight). Seems we can not change the arguments of approx_percentile_cont for backward compability, we can add new aggregator called approx_percentile_cont_with_weight

Trino have implement similar function, approx_percentile(x, percentage) and approx_percentile(x, w, percentage)

Describe alternatives you've considered
Improve approx_quantile(column, quantile) to accept an optional 3rd params, e.g. approx_quantile(column, quantile, format) where format can be

Here is an example tsv file from Druid which also use TDigest algorithm. The data contains ["timestamp", "product", "sketch"]" and it is encoded using TDigest Verbose mode

However, I eventually figure out this approach is not elegant, because it relies on the store to be encoding with TDigest's java implementation.

@jychen7 jychen7 added the enhancement New feature or request label Mar 13, 2022
@jychen7 jychen7 changed the title feat: ApproxPercentileCont supports sketches as input feat: ApproxPercentileCont supports sketches from data source Mar 13, 2022
@jychen7
Copy link
Contributor Author

jychen7 commented Mar 17, 2022

Improve approx_quantile(column, quantile) to accept an optional 3rd params, e.g. approx_quantile(column, quantile, format)

I have implement approx_percentile_cont_from_sketch at my fork repo approx_percentile_cont_from_sketch.rs with test case tests/sql/aggregates.rs and similar csv sketch file as Druid, but I find it is not elegant enough.

So want to try another way to introduce approx_percentile_cont_with_weight(column, weight_column, percentile) similar to Trino approx_percentile(x, w, percentage)

alamb pushed a commit that referenced this issue Mar 24, 2022
* Add new aggregate function in multiple places

* implement new aggregator and test case

* rename to SessionContext (follow latest change on master branch)

* fix clippy

* fix clippy

* fix error message and add test cases for error ones
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant