implement `approx_distinct` function using HyperLogLog #1087

jimexist · 2021-10-08T13:34:24Z

Which issue does this PR close?

Based on #1095 so review that first

Closes #1083

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

crepererum · 2021-10-08T13:59:15Z

datafusion/Cargo.toml

@@ -54,6 +54,10 @@ arrow = { version = "^5.3", features = ["prettyprint"] }
 parquet = { version = "^5.3", features = ["arrow"] }
 sqlparser = "0.11"
 paste = "^1.0"
+# these two are for approx_distinct
+twox-hash = "1.6.1"
+pdatastructs = "0.6.0"


pdatastructs author here: I should probably release a new version since it contains a few smaller fixes. Also if you have any trouble w/ this crate or need any features, feel free to ping me.

@crepererum thanks for the help! Indeed I can think of something missing that can be useful here, because in aggregation sometimes hll values will be merged (from different record batches). in that case it's vital that HyperLogLog values shall be serde compatible, i.e. able to write to and then read form binary form (e.g. cbor format or bincode format) so it's then encoded as Arrow values. I wonder if that's something possible to be added

I am all for reusing code if possible, but I wonder what you think of bringing any code needed from pdatastructs into the DataFusion crate itself?

I am thinking that given the goals of DataFusion as an embedded query engine as well as its active body of maintainers, we might better serve our users and ourselves by keeping the dependency chain smaller, especially for code that may be unlikely to change much

I also noticed that not much in crates.io seems to depend yet on pdatastructs (which I take as a measure of their relative maturity): https://crates.io/crates/pdatastructs/reverse_dependencies

On the other hand, perhaps starting to use it more actively will bring in more users to the pdatastructs community

think of bringing any code needed from pdatastructs into the DataFusion crate itself?

i think it's a good idea. @crepererum do you mind?

Serde support can definitely be arranged (crepererum-oss/pdatastructs.rs#61).

Regarding the code move:
The goal of the pdatastructs project is to provide these kind of data structures (sketches as they're mostly called) to a wider audience, because I my view they are undervalued by developers and Rust is a good platform to implement them in a performant yet readable way. It is also for that reason that the documentation tries to explain quite some internals of the data structures. I don't mind if you copy code around, but I'm won't give up the original pdatastructs crate for that. With the latter I think it's unavoidable that the code will diverge on the long run. I you're worried about the dependency set pdatastructs itself, we can talk about that however (e.g. by using feature flags for the different structs or splitting it into smaller crates).

alamb

This looks really cool @jimexist -- very cool

In terms of naming: I think approx_distinct is a good choice as it seems both presto link and timescale link uses approx_distinct

alamb · 2021-10-08T17:26:18Z

datafusion/Cargo.toml

@@ -54,6 +54,10 @@ arrow = { version = "^5.3", features = ["prettyprint"] }
 parquet = { version = "^5.3", features = ["arrow"] }
 sqlparser = "0.11"
 paste = "^1.0"
+# these two are for approx_distinct
+twox-hash = "1.6.1"


Is there any reason to use XXHash compared to ahash which is already a dependency: https://docs.rs/ahash/0.7.4/ahash/?

yes i think ahash64 is a good candidate

alamb · 2021-10-08T17:26:39Z

datafusion/Cargo.toml

@@ -54,6 +54,10 @@ arrow = { version = "^5.3", features = ["prettyprint"] }
 parquet = { version = "^5.3", features = ["arrow"] }
 sqlparser = "0.11"
 paste = "^1.0"
+# these two are for approx_distinct
+twox-hash = "1.6.1"
+pdatastructs = "0.6.0"


I am all for reusing code if possible, but I wonder what you think of bringing any code needed from pdatastructs into the DataFusion crate itself?

I am thinking that given the goals of DataFusion as an embedded query engine as well as its active body of maintainers, we might better serve our users and ourselves by keeping the dependency chain smaller, especially for code that may be unlikely to change much

I also noticed that not much in crates.io seems to depend yet on pdatastructs (which I take as a measure of their relative maturity): https://crates.io/crates/pdatastructs/reverse_dependencies

On the other hand, perhaps starting to use it more actively will bring in more users to the pdatastructs community

alamb · 2021-10-08T17:27:12Z

datafusion/src/physical_plan/aggregates.rs

@@ -59,6 +59,8 @@ pub enum AggregateFunction {
    Max,
    /// avg
    Avg,
+    /// Approximate aggregate function


Suggested change

/// Approximate aggregate function

/// Approximate avg function

Not Avg? Approx distinct count.

alamb · 2021-10-08T17:28:12Z

datafusion/src/physical_plan/expressions/approx_distinct.rs

+{
+    /// new approx_distinct accumulator
+    pub fn new() -> Self {
+        // TODO use xx_hash


seems as if xxhash is not used yet -- I do wonder if we can use ahash here instead which is used elsewhere in DataFusion for hashing (e.g. gby hash and joins)

alamb · 2021-10-08T17:34:53Z

datafusion/src/physical_plan/expressions/approx_distinct.rs

+    }
+
+    fn state(&self) -> Result<Vec<ScalarValue>> {
+        // TODO: maybe use binary type so that merge can work?


I think a binary type as the intermediate so calling [hll::merge()](https://docs.rs/pdatastructs/0.6.0/x86_64-pc-windows-msvc/pdatastructs/hyperloglog/struct.HyperLogLog.html#method.merge) would be a good idea

alamb

I reviewed the code and it looks quite good @jimexist 👍 Looks like it may need a rebase after #1095

I also played around with it for a bit:

alamb@MacBook-Pro arrow-datafusion % cat /tmp/foo.csv 
1
2
3
NULL
NULL
NULL
NULL
5
5

DataFusion CLI v5.1.0-SNAPSHOT

CREATE EXTERNAL TABLE foo(x varchar)
STORED AS CSV
LOCATION '/tmp/foo.csv';

> select cast(x as varchar) from foo;

+---------------------+
| CAST(foo.x AS Utf8) |
+---------------------+
| 1                   |
| 2                   |
| 3                   |
| NULL                |
| NULL                |
| NULL                |
| NULL                |
| 5                   |
| 5                   |
+---------------------+
9 rows in set. Query took 0.010 seconds.

+-------+----------------+-----------------+
| count | count_distinct | approx_distinct |
+-------+----------------+-----------------+
| 9     | 5              | 5               |
+-------+----------------+-----------------+
1 row in set. Query took 0.025 seconds.

👍

alamb · 2021-10-11T20:20:34Z

datafusion/src/physical_plan/expressions/approx_distinct.rs

+    fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> {
+        let accumulator: Box<dyn Accumulator> = match &self.input_data_type {
+            // TODO u8, i8, u16, i16 shall really be done using bitmap, not HLL
+            // TODO support for boolean (trivial case)


Supporting these data types might be good projects for new contributors ("good first project") type things. If you would like I can file the tickets.

datafusion/src/physical_plan/expressions/approx_distinct.rs

alamb · 2021-10-11T20:21:31Z

datafusion/src/physical_plan/expressions/approx_distinct.rs

+impl<T: Hash> From<&HyperLogLog<T>> for ScalarValue {
+    fn from(v: &HyperLogLog<T>) -> ScalarValue {
+        let values = v.as_ref().to_vec();
+        ScalarValue::Binary(Some(values))


Dandandan

Overall looks really good!

houqp

Looks solid 👍

alamb · 2021-10-12T14:25:48Z

🎉 can't wait to try this out

github-actions bot added ballista datafusion Changes in the datafusion crate documentation Improvements or additions to documentation labels Oct 8, 2021

alamb mentioned this pull request Oct 8, 2021

Implement approx_distinct using HyperLogLog #1083

Closed

crepererum reviewed Oct 8, 2021

View reviewed changes

alamb reviewed Oct 8, 2021

View reviewed changes

jimexist force-pushed the add-approx-distinct-fun branch from 6df147d to f023477 Compare October 9, 2021 14:16

jimexist mentioned this pull request Oct 10, 2021

add hyperloglog implementation (add and count) #1095

Merged

jimexist force-pushed the add-approx-distinct-fun branch 5 times, most recently from 3bf646c to 612dafe Compare October 11, 2021 10:48

jimexist requested review from alamb, Dandandan and houqp October 11, 2021 10:48

jimexist marked this pull request as ready for review October 11, 2021 10:50

jimexist force-pushed the add-approx-distinct-fun branch 2 times, most recently from fc93237 to eec2da7 Compare October 11, 2021 12:52

jimexist changed the title ~~implement approx_distinct~~ implement approx_distinct function using HyperLogLog Oct 11, 2021

jimexist force-pushed the add-approx-distinct-fun branch from eec2da7 to 55f7dbc Compare October 11, 2021 12:56

alamb approved these changes Oct 11, 2021

View reviewed changes

jimexist force-pushed the add-approx-distinct-fun branch 4 times, most recently from 8bd7976 to 12ad1ff Compare October 11, 2021 23:53

add approx_distinct function

4cf3c1c

jimexist force-pushed the add-approx-distinct-fun branch from 070a405 to 4cf3c1c Compare October 12, 2021 00:44

Dandandan approved these changes Oct 12, 2021

View reviewed changes

houqp approved these changes Oct 12, 2021

View reviewed changes

jimexist merged commit 80c309a into apache:master Oct 12, 2021

jimexist deleted the add-approx-distinct-fun branch October 12, 2021 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement `approx_distinct` function using HyperLogLog #1087

implement `approx_distinct` function using HyperLogLog #1087

jimexist commented Oct 8, 2021 •

edited

Loading

crepererum Oct 8, 2021

jimexist Oct 8, 2021 •

edited

Loading

alamb Oct 8, 2021

jimexist Oct 9, 2021

crepererum Oct 11, 2021

alamb left a comment

alamb Oct 8, 2021

jimexist Oct 9, 2021

alamb Oct 8, 2021

alamb Oct 8, 2021

Dandandan Oct 12, 2021

alamb Oct 8, 2021

alamb Oct 8, 2021

alamb left a comment

alamb Oct 11, 2021

alamb Oct 11, 2021

Dandandan left a comment

houqp left a comment

alamb commented Oct 12, 2021

	/// Approximate aggregate function
	/// Approximate avg function

implement approx_distinct function using HyperLogLog #1087

implement approx_distinct function using HyperLogLog #1087

Conversation

jimexist commented Oct 8, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

jimexist Oct 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan left a comment

Choose a reason for hiding this comment

houqp left a comment

Choose a reason for hiding this comment

alamb commented Oct 12, 2021

implement `approx_distinct` function using HyperLogLog #1087

implement `approx_distinct` function using HyperLogLog #1087

jimexist commented Oct 8, 2021 •

edited

Loading

jimexist Oct 8, 2021 •

edited

Loading