-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16651 : [Python] Casting Table to new schema ignores nullability of fields #14048
ARROW-16651 : [Python] Casting Table to new schema ignores nullability of fields #14048
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for solving this @kshitij12345, looks great to me!
We should probably check if a field actually has nulls, instead of only the nullability flag of the field? |
Wouldn't that be much slower if we have to check if the field has null in it's data or is there meta-data stored around that? |
Yeah, that can indeed be slower, so that is certainly a trade-off to make. An array can store an optional |
@jorisvandenbossche Have updated to check the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, looking good! Just a small optimization comment
python/pyarrow/table.pxi
Outdated
@@ -3401,6 +3401,9 @@ cdef class Table(_PandasConvertible): | |||
.format(self.schema.names, target_schema.names)) | |||
|
|||
for column, field in zip(self.itercolumns(), target_schema): | |||
if column.null_count > 0 and not field.nullable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if column.null_count > 0 and not field.nullable: | |
if not field.nullable and column.null_count > 0: |
Switching the order will avoid checking the null_count (potentially expensive) if the field is nullable (which will be the most common case, since this is the default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Thanks!
python/pyarrow/table.pxi
Outdated
@@ -3401,6 +3401,9 @@ cdef class Table(_PandasConvertible): | |||
.format(self.schema.names, target_schema.names)) | |||
|
|||
for column, field in zip(self.itercolumns(), target_schema): | |||
if column.null_count > 0 and not field.nullable: | |||
raise RuntimeError("Casting field {!r} with null values to non-nullable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be a ValueError
@jorisvandenbossche Have addressed the review. Thanks! PTAL :) |
Benchmark runs are scheduled for baseline = 43670af and contender = df121b7. df121b7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…y of fields (apache#14048) ```python table = pa.table({'a': [None, 1], 'b': [None, True]}) new_schema = pa.schema([pa.field("a", "int64", nullable=True), pa.field("b", "bool", nullable=False)]) casted = table.cast(new_schema) ``` Now leads to ``` RuntimeError: Casting field 'b' with null values to non-nullable ``` Authored-by: kshitij12345 <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…y of fields (apache#14048) ```python table = pa.table({'a': [None, 1], 'b': [None, True]}) new_schema = pa.schema([pa.field("a", "int64", nullable=True), pa.field("b", "bool", nullable=False)]) casted = table.cast(new_schema) ``` Now leads to ``` RuntimeError: Casting field 'b' with null values to non-nullable ``` Authored-by: kshitij12345 <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Now leads to