-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Drop force_nullable_schema
from chunked parquet writer
#12996
[REVIEW] Drop force_nullable_schema
from chunked parquet writer
#12996
Conversation
python/cudf/cudf/_lib/parquet.pyx
Outdated
@@ -698,7 +698,14 @@ cdef _set_col_metadata( | |||
column_in_metadata& col_meta, | |||
bool force_nullable_schema | |||
): | |||
col_meta.set_nullability(force_nullable_schema or col.nullable) | |||
if force_nullable_schema or not col.nullable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure if I agree with the comment below. Here is what I expected:
- Col is nullable and forced nullable. This will be written as nullable by default so no action is needed.
- Col is nullable and not forced nullable. This will be written as nullable by default so no action is needed.
- Col is not nullable and forced nullable. The call to tell C++ to write nulls is needed.
- Col is not nullable and not forced nullable. No action is needed.
So can we write this?
if force_nullable_schema and not col.nullable:
col_meta.set_nullability(True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All are right, except this:
- Col is not nullable and not forced nullable. No action is needed.
We need action for this incase of chunked parquet writer. Because if _nullability
isn't defined, the default libcudf behavior is to return null
schema. So when someone asks for force_nullable_schema=False
& column isn't having any nulls the current logic will yield not null
as requested by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. The conditional can still be simplified if only cases 3 and 4 (with non-nullable columns) need action:
if force_nullable_schema or not col.nullable: | |
if not col.nullable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically I feel this is right, but seems like there is something else going on too..investigating..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing some investigation @vuule found a bug on libcudf parquet writer side for a sliced struct column. We decided to scope down the force_nullable_schema
to only single writer instead. Thus, dropping it's support in chunked parquet writer will not let the bug surface. We don't want to introduce wide-ranging changes at parquet writer side to address the actual problem. So this PR should now be ready for review.
force_nullable_schema
interaction with libcudf
APIforce_nullable_schema
from chunked parquet writer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that was short-lived. The changes seem okay. Thanks for describing the problem in further detail here: #12996
/merge |
Hey, we still got the relevant use case supported :) |
Description
force_nullable_schema
was introduced in #12952, however strangely only after it has been merged tobranch-23.04
we are seeing the following pytest failure occur locally:This PR fixes the issue by dropping
force_nullable_schema
from chunked parquet writer.Checklist