-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix arrays_zip to not rely on broken segmented gather #7484
Conversation
Signed-off-by: Robert (Bobby) Evans <[email protected]>
build |
I filed #7485 as follow on work for this. |
I started to read the code in this PR and wonder if this is a bug or something we need to change in |
@ttnghia you have it backwards. The old version would not return a zero length segment for NULL. Your patch to make_list rapidsai/cudf#12370 fixed it and now it does the right thing and produce a zero length segment. I don't know if it is guaranteed to always do that or if rapidsai/cudf#12370 is fixing it up afterwards. Either way this patch is dealing with that new reality. |
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala
Outdated
Show resolved
Hide resolved
build |
This fixes #7469
Sorry that the patch is so large. I had to fundamentally change how some parts of this work and then did some refactoring to clean it up.
Before the code relied on segmented gather returning a child value for every gather entry even if the input column was null. then they would hack a null back on the output, possibly creating an invalid output where a row was marked as NULL but had a non-null offset.
This changed it so that we clean up the inputs so that the validity is the same everywhere. This should cause the children output from the segmented gather to all have the exact same offsets, so we can pull them apart properly. This is inefficient, because we have to make a copy of all the input if any of them have a NULL. But I didn't want to risk passing an invalid column to segmented gather and having it produce an incorrect result.
I plan on filing another issue for us to stop using
mergeAndSetValidity
on any nested type. This includes strings.