ARROW-16651 : [Python] Casting Table to new schema ignores nullability of fields #14048

kshitij12345 · 2022-09-05T18:51:19Z

table = pa.table({'a': [None, 1], 'b': [None, True]})
new_schema = pa.schema([pa.field("a", "int64", nullable=True), pa.field("b", "bool", nullable=False)])
casted = table.cast(new_schema)

Now leads to

RuntimeError: Casting field 'b' with null values to non-nullable

github-actions · 2022-09-05T18:51:39Z

https://issues.apache.org/jira/browse/ARROW-16651

github-actions · 2022-09-05T18:51:40Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

kshitij12345 · 2022-09-05T18:55:59Z

cc: @AlenkaF @jorisvandenbossche

AlenkaF

Thank you for solving this @kshitij12345, looks great to me!

jorisvandenbossche · 2022-09-06T07:17:49Z

We should probably check if a field actually has nulls, instead of only the nullability flag of the field?
Meaning: I think it should still be possible to cast a nullable field to non-nullable, if that field has no nulls in it?

kshitij12345 · 2022-09-06T07:21:11Z

Wouldn't that be much slower if we have to check if the field has null in it's data or is there meta-data stored around that?

jorisvandenbossche · 2022-09-06T07:37:32Z

Yeah, that can indeed be slower, so that is certainly a trade-off to make.

An array can store an optional null_count, which is cached once known, so getting the null count _can_be fast (depending on whether it is already available).

kshitij12345 · 2022-09-06T16:05:40Z

@jorisvandenbossche Have updated to check the null_count. PTAL :)

jorisvandenbossche

Thanks for the update, looking good! Just a small optimization comment

jorisvandenbossche · 2022-09-08T07:22:31Z

python/pyarrow/table.pxi

@@ -3401,6 +3401,9 @@ cdef class Table(_PandasConvertible):
                             .format(self.schema.names, target_schema.names))

        for column, field in zip(self.itercolumns(), target_schema):
+            if column.null_count > 0 and not field.nullable:


Suggested change

if column.null_count > 0 and not field.nullable:

if not field.nullable and column.null_count > 0:

Switching the order will avoid checking the null_count (potentially expensive) if the field is nullable (which will be the most common case, since this is the default)

Makes sense! Thanks!

jorisvandenbossche · 2022-09-08T07:22:57Z

python/pyarrow/table.pxi

@@ -3401,6 +3401,9 @@ cdef class Table(_PandasConvertible):
                             .format(self.schema.names, target_schema.names))

        for column, field in zip(self.itercolumns(), target_schema):
+            if column.null_count > 0 and not field.nullable:
+                raise RuntimeError("Casting field {!r} with null values to non-nullable"


I think this can be a ValueError

kshitij12345 · 2022-09-08T07:57:39Z

@jorisvandenbossche Have addressed the review. Thanks! PTAL :)

ursabot · 2022-09-08T14:52:19Z

Benchmark runs are scheduled for baseline = 43670af and contender = df121b7. df121b7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.37% ⬆️0.1%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] df121b7f ec2-t3-xlarge-us-east-2
[Failed] df121b7f test-mac-arm
[Failed] df121b7f ursa-i9-9960x
[Finished] df121b7f ursa-thinkcentre-m75q
[Finished] 43670af0 ec2-t3-xlarge-us-east-2
[Failed] 43670af0 test-mac-arm
[Failed] 43670af0 ursa-i9-9960x
[Finished] 43670af0 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…y of fields (apache#14048) ```python table = pa.table({'a': [None, 1], 'b': [None, True]}) new_schema = pa.schema([pa.field("a", "int64", nullable=True), pa.field("b", "bool", nullable=False)]) casted = table.cast(new_schema) ``` Now leads to ``` RuntimeError: Casting field 'b' with null values to non-nullable ``` Authored-by: kshitij12345 <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

stricter casting for table with new schema

38759eb

github-actions bot added the Component: Python label Sep 5, 2022

add test and run linter

a641a83

AlenkaF approved these changes Sep 6, 2022

View reviewed changes

check for null_count on the chunked array

f8f8635

kshitij12345 marked this pull request as ready for review September 6, 2022 16:27

jorisvandenbossche reviewed Sep 8, 2022

View reviewed changes

address review

4049b9d

make linter happy

bd9a60f

jorisvandenbossche approved these changes Sep 8, 2022

View reviewed changes

jorisvandenbossche merged commit df121b7 into apache:master Sep 8, 2022

kshitij12345 deleted the dev/strict/table-casting branch September 8, 2022 12:45

asfimport mentioned this pull request Sep 8, 2022

[Python] Casting Table to new schema ignores nullability of fields #32000

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16651 : [Python] Casting Table to new schema ignores nullability of fields #14048

ARROW-16651 : [Python] Casting Table to new schema ignores nullability of fields #14048

kshitij12345 commented Sep 5, 2022 •

edited

Loading

github-actions bot commented Sep 5, 2022

github-actions bot commented Sep 5, 2022

kshitij12345 commented Sep 5, 2022

AlenkaF left a comment

jorisvandenbossche commented Sep 6, 2022

kshitij12345 commented Sep 6, 2022

jorisvandenbossche commented Sep 6, 2022

kshitij12345 commented Sep 6, 2022

jorisvandenbossche left a comment

jorisvandenbossche Sep 8, 2022

kshitij12345 Sep 8, 2022

jorisvandenbossche Sep 8, 2022

kshitij12345 commented Sep 8, 2022

ursabot commented Sep 8, 2022

	if column.null_count > 0 and not field.nullable:
	if not field.nullable and column.null_count > 0:

ARROW-16651 : [Python] Casting Table to new schema ignores nullability of fields #14048

ARROW-16651 : [Python] Casting Table to new schema ignores nullability of fields #14048

Conversation

kshitij12345 commented Sep 5, 2022 • edited Loading

github-actions bot commented Sep 5, 2022

github-actions bot commented Sep 5, 2022

kshitij12345 commented Sep 5, 2022

AlenkaF left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 6, 2022

kshitij12345 commented Sep 6, 2022

jorisvandenbossche commented Sep 6, 2022

kshitij12345 commented Sep 6, 2022

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Sep 8, 2022

Choose a reason for hiding this comment

kshitij12345 Sep 8, 2022

Choose a reason for hiding this comment

jorisvandenbossche Sep 8, 2022

Choose a reason for hiding this comment

kshitij12345 commented Sep 8, 2022

ursabot commented Sep 8, 2022

kshitij12345 commented Sep 5, 2022 •

edited

Loading