Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dictionary_expresions feature (#4386) #4999

Merged
merged 2 commits into from
Jan 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -95,11 +95,11 @@ jobs:
- name: Build tests
run: |
export PATH=$PATH:$HOME/d/protoc/bin
cargo test --features avro,jit,scheduler,json --no-run
cargo test --features avro,jit,scheduler,json,dictionary_expressions --no-run
- name: Run tests
run: |
export PATH=$PATH:$HOME/d/protoc/bin
cargo test --features avro,jit,scheduler,json
cargo test --features avro,jit,scheduler,json,dictionary_expressions
- name: Run examples
run: |
export PATH=$PATH:$HOME/d/protoc/bin
Expand Down
4 changes: 3 additions & 1 deletion datafusion/core/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@ avro = ["apache-avro", "num-traits", "datafusion-common/avro"]
compression = ["xz2", "bzip2", "flate2", "async-compression"]
crypto_expressions = ["datafusion-physical-expr/crypto_expressions"]
default = ["crypto_expressions", "regex_expressions", "unicode_expressions", "compression"]
# Enables support for non-scalar, binary operations on dictionaries
# Note: this results in significant additional codegen
dictionary_expressions = ["datafusion-physical-expr/dictionary_expressions"]
# Used for testing ONLY: causes all values to hash to the same value (test for collisions)
force_hash_collisions = []
# Used to enable JIT code generation
Expand Down Expand Up @@ -102,7 +105,6 @@ xz2 = { version = "0.1", optional = true }


[dev-dependencies]
arrow = { version = "31.0.0", features = ["prettyprint", "dyn_cmp_dict"] }
async-trait = "0.1.53"
criterion = "0.4"
csv = "1.1.6"
Expand Down
2 changes: 1 addition & 1 deletion datafusion/core/tests/path_partition.rs
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ async fn csv_filter_with_file_col() -> Result<()> {
);

let result = ctx
.sql("SELECT c1, c2 FROM t WHERE date='2021-10-27' and date!=c1 LIMIT 5")
.sql("SELECT c1, c2 FROM t WHERE date='2021-10-27' and c1!='2021-10-27' LIMIT 5")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't see a compelling reason why this test needed to test comparison of dictionary columns

Copy link
Contributor

@alamb alamb Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the partitioning columns in a ListingTable are dictionary encoded and this test is verifying the encoding. I think we should put the test back (and gate it with a #cfg directive)

.await?
.collect()
.await?;
Expand Down
1 change: 1 addition & 0 deletions datafusion/core/tests/sql/select.rs
Original file line number Diff line number Diff line change
Expand Up @@ -621,6 +621,7 @@ async fn query_nested_get_indexed_field_on_struct() -> Result<()> {
}

#[tokio::test]
#[cfg(feature = "dictionary_expressions")]
async fn query_on_string_dictionary() -> Result<()> {
// Test to ensure DataFusion can operate on dictionary types
// Use StringDictionary (32 bit indexes = keys)
Expand Down
7 changes: 5 additions & 2 deletions datafusion/physical-expr/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ repository = "https://github.com/apache/arrow-datafusion"
readme = "README.md"
authors = ["Apache Arrow <[email protected]>"]
license = "Apache-2.0"
keywords = [ "arrow", "query", "sql" ]
keywords = ["arrow", "query", "sql"]
edition = "2021"
rust-version = "1.62"

Expand All @@ -35,12 +35,15 @@ path = "src/lib.rs"
[features]
crypto_expressions = ["md-5", "sha2", "blake2", "blake3"]
default = ["crypto_expressions", "regex_expressions", "unicode_expressions"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep dictionary support as a default, if possible

Copy link
Contributor Author

@tustvold tustvold Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fairly strongly disagree, it is pretty esoteric. As a data point, none of IOx's integration tests require this, and we use dictionaries a LOT 😄

It is important to highlight this isn't "dictionary support" but non-scalar, binary dictionary kernels which are pretty unusual in practice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fairly strongly disagree, it is pretty esoteric.

I am not sure how to quantify how esoteric the feature is or how commonly it is used. Clearly IOx uses it. I was just thinking that this PR changes the default behavior

But maybe that is ok.

Perhaps some other committers have thoughts.

Copy link
Contributor Author

@tustvold tustvold Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly IOx uses it

That is the point I'm trying to make, IOx doesn't use it, at least not within any of its tests. A user theoretically could construct a query that directly compares dictionary columns, in practice there are extremely limited use-cases that come to mind of this.

This feature was only enabled in #4168 prior to that it was disabled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least not within any of its tests.

🤔 when I didn't enable the dyn dictionary kernels in arrow iox tests failed in some past version

We have it enabled here: https://github.com/influxdata/influxdb_iox/blob/6f39ae342e64848bd6555bddbc1d3fa30050f75e/arrow_util/Cargo.toml#L12

# Enables support for non-scalar, binary operations on dictionaries
# Note: this results in significant additional codegen
dictionary_expressions = ["arrow/dyn_cmp_dict", "arrow/dyn_arith_dict"]
regex_expressions = ["regex"]
unicode_expressions = ["unicode-segmentation"]

[dependencies]
ahash = { version = "0.8", default-features = false, features = ["runtime-rng"] }
arrow = { version = "31.0.0", features = ["prettyprint", "dyn_cmp_dict"] }
arrow = { version = "31.0.0", features = ["prettyprint"] }
arrow-buffer = "31.0.0"
arrow-schema = "31.0.0"
blake2 = { version = "^0.10.2", optional = true }
Expand Down
1 change: 1 addition & 0 deletions datafusion/physical-expr/src/expressions/binary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1502,6 +1502,7 @@ mod tests {
// is no way at the time of this writing to create a dictionary
// array using the `From` trait
#[test]
#[cfg(feature = "dictionary_expressions")]
fn test_dictionary_type_to_array_coersion() -> Result<()> {
// Test string a string dictionary
let dict_type =
Expand Down