Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement approx_distinct function using HyperLogLog #1087

Merged
merged 1 commit into from
Oct 12, 2021

Conversation

jimexist
Copy link
Member

@jimexist jimexist commented Oct 8, 2021

Which issue does this PR close?

Based on #1095 so review that first

Closes #1083

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate documentation Improvements or additions to documentation labels Oct 8, 2021
@@ -54,6 +54,10 @@ arrow = { version = "^5.3", features = ["prettyprint"] }
parquet = { version = "^5.3", features = ["arrow"] }
sqlparser = "0.11"
paste = "^1.0"
# these two are for approx_distinct
twox-hash = "1.6.1"
pdatastructs = "0.6.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pdatastructs author here: I should probably release a new version since it contains a few smaller fixes. Also if you have any trouble w/ this crate or need any features, feel free to ping me.

Copy link
Member Author

@jimexist jimexist Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crepererum thanks for the help! Indeed I can think of something missing that can be useful here, because in aggregation sometimes hll values will be merged (from different record batches). in that case it's vital that HyperLogLog values shall be serde compatible, i.e. able to write to and then read form binary form (e.g. cbor format or bincode format) so it's then encoded as Arrow values. I wonder if that's something possible to be added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am all for reusing code if possible, but I wonder what you think of bringing any code needed from pdatastructs into the DataFusion crate itself?

I am thinking that given the goals of DataFusion as an embedded query engine as well as its active body of maintainers, we might better serve our users and ourselves by keeping the dependency chain smaller, especially for code that may be unlikely to change much

I also noticed that not much in crates.io seems to depend yet on pdatastructs (which I take as a measure of their relative maturity): https://crates.io/crates/pdatastructs/reverse_dependencies

On the other hand, perhaps starting to use it more actively will bring in more users to the pdatastructs community

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think of bringing any code needed from pdatastructs into the DataFusion crate itself?

i think it's a good idea. @crepererum do you mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serde support can definitely be arranged (crepererum-oss/pdatastructs.rs#61).

Regarding the code move:
The goal of the pdatastructs project is to provide these kind of data structures (sketches as they're mostly called) to a wider audience, because I my view they are undervalued by developers and Rust is a good platform to implement them in a performant yet readable way. It is also for that reason that the documentation tries to explain quite some internals of the data structures. I don't mind if you copy code around, but I'm won't give up the original pdatastructs crate for that. With the latter I think it's unavoidable that the code will diverge on the long run. I you're worried about the dependency set pdatastructs itself, we can talk about that however (e.g. by using feature flags for the different structs or splitting it into smaller crates).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really cool @jimexist -- very cool

In terms of naming: I think approx_distinct is a good choice as it seems both presto link and timescale link uses approx_distinct

@@ -54,6 +54,10 @@ arrow = { version = "^5.3", features = ["prettyprint"] }
parquet = { version = "^5.3", features = ["arrow"] }
sqlparser = "0.11"
paste = "^1.0"
# these two are for approx_distinct
twox-hash = "1.6.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to use XXHash compared to ahash which is already a dependency: https://docs.rs/ahash/0.7.4/ahash/?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i think ahash64 is a good candidate

@@ -54,6 +54,10 @@ arrow = { version = "^5.3", features = ["prettyprint"] }
parquet = { version = "^5.3", features = ["arrow"] }
sqlparser = "0.11"
paste = "^1.0"
# these two are for approx_distinct
twox-hash = "1.6.1"
pdatastructs = "0.6.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am all for reusing code if possible, but I wonder what you think of bringing any code needed from pdatastructs into the DataFusion crate itself?

I am thinking that given the goals of DataFusion as an embedded query engine as well as its active body of maintainers, we might better serve our users and ourselves by keeping the dependency chain smaller, especially for code that may be unlikely to change much

I also noticed that not much in crates.io seems to depend yet on pdatastructs (which I take as a measure of their relative maturity): https://crates.io/crates/pdatastructs/reverse_dependencies

On the other hand, perhaps starting to use it more actively will bring in more users to the pdatastructs community

@@ -59,6 +59,8 @@ pub enum AggregateFunction {
Max,
/// avg
Avg,
/// Approximate aggregate function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Approximate aggregate function
/// Approximate avg function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not Avg? Approx distinct count.

{
/// new approx_distinct accumulator
pub fn new() -> Self {
// TODO use xx_hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems as if xxhash is not used yet -- I do wonder if we can use ahash here instead which is used elsewhere in DataFusion for hashing (e.g. gby hash and joins)

}

fn state(&self) -> Result<Vec<ScalarValue>> {
// TODO: maybe use binary type so that merge can work?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a binary type as the intermediate so calling [hll::merge()](https://docs.rs/pdatastructs/0.6.0/x86_64-pc-windows-msvc/pdatastructs/hyperloglog/struct.HyperLogLog.html#method.merge) would be a good idea

@jimexist jimexist force-pushed the add-approx-distinct-fun branch 5 times, most recently from 3bf646c to 612dafe Compare October 11, 2021 10:48
@jimexist jimexist marked this pull request as ready for review October 11, 2021 10:50
@jimexist jimexist force-pushed the add-approx-distinct-fun branch 2 times, most recently from fc93237 to eec2da7 Compare October 11, 2021 12:52
@jimexist jimexist changed the title implement approx_distinct implement approx_distinct function using HyperLogLog Oct 11, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code and it looks quite good @jimexist 👍 Looks like it may need a rebase after #1095

I also played around with it for a bit:

alamb@MacBook-Pro arrow-datafusion % cat /tmp/foo.csv 
1
2
3
NULL
NULL
NULL
NULL
5
5
DataFusion CLI v5.1.0-SNAPSHOT

CREATE EXTERNAL TABLE foo(x varchar)
STORED AS CSV
LOCATION '/tmp/foo.csv';

> select cast(x as varchar) from foo;

+---------------------+
| CAST(foo.x AS Utf8) |
+---------------------+
| 1                   |
| 2                   |
| 3                   |
| NULL                |
| NULL                |
| NULL                |
| NULL                |
| 5                   |
| 5                   |
+---------------------+
9 rows in set. Query took 0.010 seconds.

+-------+----------------+-----------------+
| count | count_distinct | approx_distinct |
+-------+----------------+-----------------+
| 9     | 5              | 5               |
+-------+----------------+-----------------+
1 row in set. Query took 0.025 seconds.

👍

fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> {
let accumulator: Box<dyn Accumulator> = match &self.input_data_type {
// TODO u8, i8, u16, i16 shall really be done using bitmap, not HLL
// TODO support for boolean (trivial case)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting these data types might be good projects for new contributors ("good first project") type things. If you would like I can file the tickets.

impl<T: Hash> From<&HyperLogLog<T>> for ScalarValue {
fn from(v: &HyperLogLog<T>) -> ScalarValue {
let values = v.as_ref().to_vec();
ScalarValue::Binary(Some(values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jimexist jimexist force-pushed the add-approx-distinct-fun branch 4 times, most recently from 8bd7976 to 12ad1ff Compare October 11, 2021 23:53
Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really good!

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid 👍

@jimexist jimexist merged commit 80c309a into apache:master Oct 12, 2021
@jimexist jimexist deleted the add-approx-distinct-fun branch October 12, 2021 10:44
@alamb
Copy link
Contributor

alamb commented Oct 12, 2021

🎉 can't wait to try this out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement approx_distinct using HyperLogLog
5 participants