-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41667: [C++][Parquet] Refuse writing non-nullable column that contains nulls #44921
GH-41667: [C++][Parquet] Refuse writing non-nullable column that contains nulls #44921
Conversation
…t contains nulls A non-nullable column that contains nulls would result in an invalid Parquet file.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good. I have some minor comments.
@@ -1301,6 +1301,10 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter< | |||
bool single_nullable_element = | |||
(level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) && | |||
leaf_field_nullable; | |||
if (!leaf_field_nullable && leaf_array.null_count() != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know should we check single_nullable_element
rather than leaf_field_nullable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't even know what single_nullable_element
means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As what parquet handle nulls, leaf_field_nullable
might including nulls in parents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As what parquet handle nulls,
leaf_field_nullable
might including nulls in parents?
Not according to my reading of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. List< int notnull> nullable
, we may have the single_nullable_element == false
but leaf_field_nullable == true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaf_field_nullable
is computed from PathBuilder.nullable_in_parent_
, which itself is initialized in e.g.:
arrow/cpp/src/parquet/arrow/path_internal.cc
Lines 792 to 801 in fb8e812
Status Visit(const ::arrow::StructArray& array) { | |
MaybeAddNullable(array); | |
PathInfo info_backup = info_; | |
for (int x = 0; x < array.num_fields(); x++) { | |
nullable_in_parent_ = array.type()->field(x)->nullable(); | |
RETURN_NOT_OK(VisitInline(*array.field(x))); | |
info_ = info_backup; | |
} | |
return Status::OK(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for lists:
arrow/cpp/src/parquet/arrow/path_internal.cc
Lines 725 to 741 in fb8e812
template <typename T> | |
::arrow::enable_if_t<std::is_same<::arrow::ListArray, T>::value || | |
std::is_same<::arrow::LargeListArray, T>::value, | |
Status> | |
Visit(const T& array) { | |
MaybeAddNullable(array); | |
// Increment necessary due to empty lists. | |
info_.max_def_level++; | |
info_.max_rep_level++; | |
// raw_value_offsets() accounts for any slice offset. | |
ListPathNode<VarRangeSelector<typename T::offset_type>> node( | |
VarRangeSelector<typename T::offset_type>{array.raw_value_offsets()}, | |
info_.max_rep_level, info_.max_def_level - 1); | |
info_.path.emplace_back(std::move(node)); | |
nullable_in_parent_ = array.list_type()->value_field()->nullable(); | |
return VisitInline(*array.values()); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, there should be leaf_field_nullable
. single_nullable_element
is for preparing validity child buffer for List< int notnull> nullable
. I got it wrong here
Java JNI failure looks related. |
cc79516
to
5b66a5c
Compare
5b66a5c
to
bb71157
Compare
The Java code has been moved, so we could ignore the JNI failure here and open an issue in the other repo (I'm still getting all the CI set up, though) |
Well, I think I got the failure fixed anyway (at least locally it works, let's wait for CI). |
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit ded148c. There were 132 benchmark results with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 11 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
A non-nullable column that contains nulls would result in an invalid Parquet file, so we'd rather raise an error when writing.
This detection is only implemented for leaf columns. Implementing it for non-leaf columns would be more involved, and also doesn't actually seem necessary.
Are these changes tested?
Yes.
Are there any user-facing changes?
Raising a clear error when trying to write invalid data to Parquet, instead of letting the Parquet writer silently generate an invalid file.