-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dictionary_expresions feature (#4386) #4999
Add dictionary_expresions feature (#4386) #4999
Conversation
@@ -204,7 +204,7 @@ async fn csv_filter_with_file_col() -> Result<()> { | |||
); | |||
|
|||
let result = ctx | |||
.sql("SELECT c1, c2 FROM t WHERE date='2021-10-27' and date!=c1 LIMIT 5") | |||
.sql("SELECT c1, c2 FROM t WHERE date='2021-10-27' and c1!='2021-10-27' LIMIT 5") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't see a compelling reason why this test needed to test comparison of dictionary columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the partitioning columns in a ListingTable are dictionary encoded and this test is verifying the encoding. I think we should put the test back (and gate it with a #cfg
directive)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me -- it would be nice to have dictionary support enabled by default, but not required
@@ -35,12 +35,15 @@ path = "src/lib.rs" | |||
[features] | |||
crypto_expressions = ["md-5", "sha2", "blake2", "blake3"] | |||
default = ["crypto_expressions", "regex_expressions", "unicode_expressions"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep dictionary support as a default, if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fairly strongly disagree, it is pretty esoteric. As a data point, none of IOx's integration tests require this, and we use dictionaries a LOT 😄
It is important to highlight this isn't "dictionary support" but non-scalar, binary dictionary kernels which are pretty unusual in practice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fairly strongly disagree, it is pretty esoteric.
I am not sure how to quantify how esoteric the feature is or how commonly it is used. Clearly IOx uses it. I was just thinking that this PR changes the default behavior
But maybe that is ok.
Perhaps some other committers have thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearly IOx uses it
That is the point I'm trying to make, IOx doesn't use it, at least not within any of its tests. A user theoretically could construct a query that directly compares dictionary columns, in practice there are extremely limited use-cases that come to mind of this.
This feature was only enabled in #4168 prior to that it was disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least not within any of its tests.
🤔 when I didn't enable the dyn dictionary kernels in arrow iox tests failed in some past version
We have it enabled here: https://github.com/influxdata/influxdb_iox/blob/6f39ae342e64848bd6555bddbc1d3fa30050f75e/arrow_util/Cargo.toml#L12
Benchmark runs are scheduled for baseline = 5d4038a and contender = 9f498bb. 9f498bb is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
It would appear enabling the dyn_arith_dict kernels causes the CI runner to exceed the available disk space 😱 Not really sure how best to workaround this... |
Which issue does this PR close?
Closes #4386
Rationale for this change
Release build of datafusion-cli on master
With this feature
What changes are included in this PR?
Adds a
dictionary_expressions
feature that gatesarrow/dyn_cmp_dict
andarrow/dyn_arith_dict
. I vacillated a bit on the naming and whether this should enablearrow/dyn_arith_dict
but figured as this is just plumbing features through, and does not alter the actual DataFusion code, downstreams could not use this feature and manually enabled the desiredarrow
features themselves if they so wish.Are these changes tested?
Are there any user-facing changes?