Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41667: [C++][Parquet] Refuse writing non-nullable column that contains nulls #44921

Merged
merged 2 commits into from
Dec 4, 2024

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Dec 3, 2024

Rationale for this change

A non-nullable column that contains nulls would result in an invalid Parquet file, so we'd rather raise an error when writing.

This detection is only implemented for leaf columns. Implementing it for non-leaf columns would be more involved, and also doesn't actually seem necessary.

Are these changes tested?

Yes.

Are there any user-facing changes?

Raising a clear error when trying to write invalid data to Parquet, instead of letting the Parquet writer silently generate an invalid file.

…t contains nulls

A non-nullable column that contains nulls would result in an invalid Parquet file.
Copy link

github-actions bot commented Dec 3, 2024

⚠️ GitHub issue #41667 has been automatically assigned in GitHub to PR creator.

@pitrou pitrou marked this pull request as ready for review December 3, 2024 18:18
@pitrou pitrou requested a review from wgtmac as a code owner December 3, 2024 18:18
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good. I have some minor comments.

cpp/src/parquet/column_writer.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 4, 2024
@@ -1301,6 +1301,10 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<
bool single_nullable_element =
(level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) &&
leaf_field_nullable;
if (!leaf_field_nullable && leaf_array.null_count() != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know should we check single_nullable_element rather than leaf_field_nullable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even know what single_nullable_element means.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As what parquet handle nulls, leaf_field_nullable might including nulls in parents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As what parquet handle nulls, leaf_field_nullable might including nulls in parents?

Not according to my reading of the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. List< int notnull> nullable, we may have the single_nullable_element == false but leaf_field_nullable == true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaf_field_nullable is computed from PathBuilder.nullable_in_parent_, which itself is initialized in e.g.:

Status Visit(const ::arrow::StructArray& array) {
MaybeAddNullable(array);
PathInfo info_backup = info_;
for (int x = 0; x < array.num_fields(); x++) {
nullable_in_parent_ = array.type()->field(x)->nullable();
RETURN_NOT_OK(VisitInline(*array.field(x)));
info_ = info_backup;
}
return Status::OK();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for lists:

template <typename T>
::arrow::enable_if_t<std::is_same<::arrow::ListArray, T>::value ||
std::is_same<::arrow::LargeListArray, T>::value,
Status>
Visit(const T& array) {
MaybeAddNullable(array);
// Increment necessary due to empty lists.
info_.max_def_level++;
info_.max_rep_level++;
// raw_value_offsets() accounts for any slice offset.
ListPathNode<VarRangeSelector<typename T::offset_type>> node(
VarRangeSelector<typename T::offset_type>{array.raw_value_offsets()},
info_.max_rep_level, info_.max_def_level - 1);
info_.path.emplace_back(std::move(node));
nullable_in_parent_ = array.list_type()->value_field()->nullable();
return VisitInline(*array.values());
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, there should be leaf_field_nullable. single_nullable_element is for preparing validity child buffer for List< int notnull> nullable. I got it wrong here

@pitrou
Copy link
Member Author

pitrou commented Dec 4, 2024

Java JNI failure looks related.

@pitrou pitrou requested a review from lidavidm as a code owner December 4, 2024 09:16
@pitrou pitrou force-pushed the gh41667-parquet-write-non-nullable-nulls branch from cc79516 to 5b66a5c Compare December 4, 2024 10:16
@pitrou pitrou force-pushed the gh41667-parquet-write-non-nullable-nulls branch from 5b66a5c to bb71157 Compare December 4, 2024 10:16
@lidavidm
Copy link
Member

lidavidm commented Dec 4, 2024

The Java code has been moved, so we could ignore the JNI failure here and open an issue in the other repo (I'm still getting all the CI set up, though)

@pitrou
Copy link
Member Author

pitrou commented Dec 4, 2024

Well, I think I got the failure fixed anyway (at least locally it works, let's wait for CI).

@pitrou
Copy link
Member Author

pitrou commented Dec 4, 2024

@lidavidm @jduo Are the Java changes ok?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Dec 4, 2024
@pitrou pitrou merged commit ded148c into apache:main Dec 4, 2024
45 of 47 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Dec 4, 2024
@pitrou pitrou deleted the gh41667-parquet-write-non-nullable-nulls branch December 4, 2024 12:28
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Dec 4, 2024
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit ded148c.

There were 132 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 11 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants