-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Orc writes don't fully support Booleans with nulls #11763
base: branch-24.12
Are you sure you want to change the base?
Conversation
Signed-off-by: Kuhu Shukla <[email protected]>
Signed-off-by: Kuhu Shukla <[email protected]>
build |
@@ -26,10 +26,17 @@ | |||
pytestmark = pytest.mark.nightly_resource_consuming_test | |||
|
|||
orc_write_basic_gens = [byte_gen, short_gen, int_gen, long_gen, float_gen, double_gen, | |||
string_gen, boolean_gen, DateGen(start=date(1590, 1, 1)), | |||
string_gen, BooleanGen(nullable=False), DateGen(start=date(1590, 1, 1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is removing the test for nullable boolean values. Can we have an explicit test(s) that have a non-nullable struct with nullable values, or many different types, in it? I am fine if this is a follow on issue.
@@ -1243,6 +1243,14 @@ val GPU_COREDUMP_PIPE_PATTERN = conf("spark.rapids.gpu.coreDump.pipePattern") | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val ENABLE_ORC_NULLABLE_BOOL = conf("spark.rapids.sql.format.orc.write.boolType.enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just fall back for all booleans instead of only nullable ones? Spark already marks almost everything as nullable, so there is very little value in trying to distinguish between the two. But then I see things like #11762 where it scares me that CUDF might end up writing something out that they think is valid, but in practice is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack. yes.
When I updated my tests for #11781 to write out 128000 rows I got crashes for boolean columns under ORC with the same error message that this is trying to work around. So even for boolean columns that are not-nullable under a struct that is we are going to have to fall back to the CPU. I think in general we just want to fall back to the CPU for all boolean columns on ORC writes. |
Thank you for the above finding @revans2 . I will update my patch and I see I have a few more tests to fix for the fallback as well. I expect the tests' change to be bigger than the actual change here. |
Fixes #11736 and exposes #11762 which is why I am marking this WIP and seeing how I can work around this without impacting many tests in
orc_write_test.py