Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Orc writes don't fully support Booleans with nulls #11763

Open
wants to merge 3 commits into
base: branch-24.12
Choose a base branch
from

Conversation

kuhushukla
Copy link
Collaborator

Fixes #11736 and exposes #11762 which is why I am marking this WIP and seeing how I can work around this without impacting many tests in orc_write_test.py

@kuhushukla kuhushukla self-assigned this Nov 25, 2024
@kuhushukla kuhushukla marked this pull request as draft November 25, 2024 20:03
@kuhushukla kuhushukla changed the title Orc writes don't fully support Booleans with nulls [WIP] Orc writes don't fully support Booleans with nulls Nov 25, 2024
@kuhushukla kuhushukla marked this pull request as ready for review November 25, 2024 20:41
@kuhushukla
Copy link
Collaborator Author

build

@@ -26,10 +26,17 @@
pytestmark = pytest.mark.nightly_resource_consuming_test

orc_write_basic_gens = [byte_gen, short_gen, int_gen, long_gen, float_gen, double_gen,
string_gen, boolean_gen, DateGen(start=date(1590, 1, 1)),
string_gen, BooleanGen(nullable=False), DateGen(start=date(1590, 1, 1)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removing the test for nullable boolean values. Can we have an explicit test(s) that have a non-nullable struct with nullable values, or many different types, in it? I am fine if this is a follow on issue.

@@ -1243,6 +1243,14 @@ val GPU_COREDUMP_PIPE_PATTERN = conf("spark.rapids.gpu.coreDump.pipePattern")
.booleanConf
.createWithDefault(true)

val ENABLE_ORC_NULLABLE_BOOL = conf("spark.rapids.sql.format.orc.write.boolType.enabled")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just fall back for all booleans instead of only nullable ones? Spark already marks almost everything as nullable, so there is very little value in trying to distinguish between the two. But then I see things like #11762 where it scares me that CUDF might end up writing something out that they think is valid, but in practice is not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack. yes.

@revans2
Copy link
Collaborator

revans2 commented Nov 26, 2024

When I updated my tests for #11781 to write out 128000 rows I got crashes for boolean columns under ORC with the same error message that this is trying to work around. So even for boolean columns that are not-nullable under a struct that is we are going to have to fall back to the CPU. I think in general we just want to fall back to the CPU for all boolean columns on ORC writes.

@sameerz sameerz added the bug Something isn't working label Nov 27, 2024
@kuhushukla
Copy link
Collaborator Author

Thank you for the above finding @revans2 . I will update my patch and I see I have a few more tests to fix for the fallback as well. I expect the tests' change to be bigger than the actual change here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Orc writes don't fully support Booleans with nulls
3 participants