Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: get_prune_stats returns homogenous ArrayRef #1413

Merged
merged 4 commits into from
Jun 2, 2023

Conversation

cmackenzie1
Copy link
Contributor

Description

Switch the get_prune_stats functions to use None to represent null
instead of ScalarValue::Null as ArrayRef must be of all the same
type.

Related Issue(s)

Documentation

https://github.com/apache/arrow-datafusion/blob/dd5e1dbbfd20539b40ae65acb8883f7e392cba92/datafusion/core/src/physical_optimizer/pruning.rs#L54-L72

Test table `issue_1374` was created by hand to have 2 data files where
only one file has the `min_values` for the statistics in the
`checkpoint.parquet` file set to null in order to trigger the bug.
There is no other significance to the table other than to demonstrate
issue delta-io#1374.

```
internal error: entered unreachable code
thread 'test_issue_1374' panicked at 'internal error: entered unreachable code', /Users/cole/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-common-24.0.0/src/scalar.rs:2472:26
```
@cmackenzie1
Copy link
Contributor Author

Looks like this change causes this test to begin failing and I am unsure of what the correct behavior should be here.

---- operations::delete::tests::test_delete_on_mixed_columns stdout ----
thread 'operations::delete::tests::test_delete_on_mixed_columns' panicked at 'Could not determine null type: Generic("Unsupported data type for Delta Lake Dictionary(UInt16, Utf8)")', rust/src/operations/transaction/state.rs:205:55
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    operations::delete::tests::test_delete_on_mixed_columns

test result: FAILED. 94 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.42s

rtyler
rtyler previously approved these changes May 30, 2023
@cmackenzie1
Copy link
Contributor Author

cmackenzie1 commented May 31, 2023

Thoughts on this diff? My concern is there is no explicit restriction on the types, not sure if that matters though

diff --git a/rust/src/delta_datafusion.rs b/rust/src/delta_datafusion.rs
index 43aed52..6ce3f1e 100644
--- a/rust/src/delta_datafusion.rs
+++ b/rust/src/delta_datafusion.rs
@@ -590,11 +590,14 @@ pub(crate) fn get_null_of_arrow_type(t: &ArrowDataType) -> DeltaResult<ScalarVal
                 TimeUnit::Nanosecond => ScalarValue::TimestampNanosecond(None, tz),
             })
         }
+        ArrowDataType::Dictionary(k, v) => Ok(ScalarValue::Dictionary(
+            k.clone(),
+            Box::new(get_null_of_arrow_type(v).unwrap()),
+        )),
         //Unsupported types...
         ArrowDataType::Float16
         | ArrowDataType::Decimal256(_, _)
         | ArrowDataType::Union(_, _)
-        | ArrowDataType::Dictionary(_, _)
         | ArrowDataType::LargeList(_)
         | ArrowDataType::Struct(_)
         | ArrowDataType::List(_)

Switch the `get_prune_stats` functions to use `None` to represent `null`
instead of `ScalarValue::Null` as `ArrayRef` must be of all the same
type.

https://github.com/apache/arrow-datafusion/blob/dd5e1dbbfd20539b40ae65acb8883f7e392cba92/datafusion/core/src/physical_optimizer/pruning.rs#L54-L72
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective returning the appropriate nullsy type for the column (which is how I understand this) makes sense. It does feel like some Arrow type system wackiness that there's a Null type and also a nullsy boolean, etc 😛

@rtyler rtyler enabled auto-merge (rebase) June 1, 2023 22:13
@roeap roeap disabled auto-merge June 1, 2023 22:28
@roeap roeap enabled auto-merge (squash) June 1, 2023 22:28
@roeap roeap merged commit f67ac09 into delta-io:main Jun 2, 2023
@cmackenzie1 cmackenzie1 deleted the cole/issue-1374 branch June 2, 2023 04:49
roeap pushed a commit to roeap/delta-rs that referenced this pull request Jun 2, 2023
# Description
Switch the `get_prune_stats` functions to use `None` to represent `null`
instead of `ScalarValue::Null` as `ArrayRef` must be of all the same
type.

# Related Issue(s)

- closes delta-io#1374 

# Documentation


https://github.com/apache/arrow-datafusion/blob/dd5e1dbbfd20539b40ae65acb8883f7e392cba92/datafusion/core/src/physical_optimizer/pruning.rs#L54-L72

---------

Co-authored-by: R. Tyler Croy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datafusion: unreachable code reached when parsing statistics with missing columns
3 participants