-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement approx_distinct
function using HyperLogLog
#1087
Conversation
datafusion/Cargo.toml
Outdated
@@ -54,6 +54,10 @@ arrow = { version = "^5.3", features = ["prettyprint"] } | |||
parquet = { version = "^5.3", features = ["arrow"] } | |||
sqlparser = "0.11" | |||
paste = "^1.0" | |||
# these two are for approx_distinct | |||
twox-hash = "1.6.1" | |||
pdatastructs = "0.6.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pdatastructs
author here: I should probably release a new version since it contains a few smaller fixes. Also if you have any trouble w/ this crate or need any features, feel free to ping me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crepererum thanks for the help! Indeed I can think of something missing that can be useful here, because in aggregation sometimes hll values will be merged (from different record batches). in that case it's vital that HyperLogLog
values shall be serde
compatible, i.e. able to write to and then read form binary form (e.g. cbor format or bincode format) so it's then encoded as Arrow values. I wonder if that's something possible to be added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am all for reusing code if possible, but I wonder what you think of bringing any code needed from pdatastructs
into the DataFusion crate itself?
I am thinking that given the goals of DataFusion as an embedded query engine as well as its active body of maintainers, we might better serve our users and ourselves by keeping the dependency chain smaller, especially for code that may be unlikely to change much
I also noticed that not much in crates.io seems to depend yet on pdatastructs
(which I take as a measure of their relative maturity): https://crates.io/crates/pdatastructs/reverse_dependencies
On the other hand, perhaps starting to use it more actively will bring in more users to the pdatastructs
community
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think of bringing any code needed from pdatastructs into the DataFusion crate itself?
i think it's a good idea. @crepererum do you mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serde support can definitely be arranged (crepererum-oss/pdatastructs.rs#61).
Regarding the code move:
The goal of the pdatastructs project is to provide these kind of data structures (sketches as they're mostly called) to a wider audience, because I my view they are undervalued by developers and Rust is a good platform to implement them in a performant yet readable way. It is also for that reason that the documentation tries to explain quite some internals of the data structures. I don't mind if you copy code around, but I'm won't give up the original pdatastructs crate for that. With the latter I think it's unavoidable that the code will diverge on the long run. I you're worried about the dependency set pdatastructs itself, we can talk about that however (e.g. by using feature flags for the different structs or splitting it into smaller crates).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion/Cargo.toml
Outdated
@@ -54,6 +54,10 @@ arrow = { version = "^5.3", features = ["prettyprint"] } | |||
parquet = { version = "^5.3", features = ["arrow"] } | |||
sqlparser = "0.11" | |||
paste = "^1.0" | |||
# these two are for approx_distinct | |||
twox-hash = "1.6.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to use XXHash
compared to ahash
which is already a dependency: https://docs.rs/ahash/0.7.4/ahash/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i think ahash64 is a good candidate
datafusion/Cargo.toml
Outdated
@@ -54,6 +54,10 @@ arrow = { version = "^5.3", features = ["prettyprint"] } | |||
parquet = { version = "^5.3", features = ["arrow"] } | |||
sqlparser = "0.11" | |||
paste = "^1.0" | |||
# these two are for approx_distinct | |||
twox-hash = "1.6.1" | |||
pdatastructs = "0.6.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am all for reusing code if possible, but I wonder what you think of bringing any code needed from pdatastructs
into the DataFusion crate itself?
I am thinking that given the goals of DataFusion as an embedded query engine as well as its active body of maintainers, we might better serve our users and ourselves by keeping the dependency chain smaller, especially for code that may be unlikely to change much
I also noticed that not much in crates.io seems to depend yet on pdatastructs
(which I take as a measure of their relative maturity): https://crates.io/crates/pdatastructs/reverse_dependencies
On the other hand, perhaps starting to use it more actively will bring in more users to the pdatastructs
community
@@ -59,6 +59,8 @@ pub enum AggregateFunction { | |||
Max, | |||
/// avg | |||
Avg, | |||
/// Approximate aggregate function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Approximate aggregate function | |
/// Approximate avg function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not Avg? Approx distinct count.
{ | ||
/// new approx_distinct accumulator | ||
pub fn new() -> Self { | ||
// TODO use xx_hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems as if xxhash is not used yet -- I do wonder if we can use ahash
here instead which is used elsewhere in DataFusion for hashing (e.g. gby hash and joins)
} | ||
|
||
fn state(&self) -> Result<Vec<ScalarValue>> { | ||
// TODO: maybe use binary type so that merge can work? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a binary type as the intermediate so calling [hll::merge()
](https://docs.rs/pdatastructs/0.6.0/x86_64-pc-windows-msvc/pdatastructs/hyperloglog/struct.HyperLogLog.html#method.merge) would be a good idea
6df147d
to
f023477
Compare
3bf646c
to
612dafe
Compare
fc93237
to
eec2da7
Compare
approx_distinct
function using HyperLogLog
eec2da7
to
55f7dbc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code and it looks quite good @jimexist 👍 Looks like it may need a rebase after #1095
I also played around with it for a bit:
alamb@MacBook-Pro arrow-datafusion % cat /tmp/foo.csv
1
2
3
NULL
NULL
NULL
NULL
5
5
DataFusion CLI v5.1.0-SNAPSHOT
CREATE EXTERNAL TABLE foo(x varchar)
STORED AS CSV
LOCATION '/tmp/foo.csv';
> select cast(x as varchar) from foo;
+---------------------+
| CAST(foo.x AS Utf8) |
+---------------------+
| 1 |
| 2 |
| 3 |
| NULL |
| NULL |
| NULL |
| NULL |
| 5 |
| 5 |
+---------------------+
9 rows in set. Query took 0.010 seconds.
+-------+----------------+-----------------+
| count | count_distinct | approx_distinct |
+-------+----------------+-----------------+
| 9 | 5 | 5 |
+-------+----------------+-----------------+
1 row in set. Query took 0.025 seconds.
👍
fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> { | ||
let accumulator: Box<dyn Accumulator> = match &self.input_data_type { | ||
// TODO u8, i8, u16, i16 shall really be done using bitmap, not HLL | ||
// TODO support for boolean (trivial case) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supporting these data types might be good projects for new contributors ("good first project") type things. If you would like I can file the tickets.
impl<T: Hash> From<&HyperLogLog<T>> for ScalarValue { | ||
fn from(v: &HyperLogLog<T>) -> ScalarValue { | ||
let values = v.as_ref().to_vec(); | ||
ScalarValue::Binary(Some(values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
8bd7976
to
12ad1ff
Compare
070a405
to
4cf3c1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks really good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid 👍
🎉 can't wait to try this out |
Which issue does this PR close?
Based on #1095 so review that first
Closes #1083
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?