-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for element_at
and GetMapValue
#5883
Conversation
Signed-off-by: Raza Jafri <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala
Outdated
Show resolved
Hide resolved
(_, _) => throw new UnsupportedOperationException("non-literal key is not supported") | ||
(map, indices) => { | ||
if (failOnError) { | ||
withResource(map.getMapKeyExistence(indices)) { keyExists => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a conventional programming checking for key existence before doing the value lookup is considered a peroformance antipattern leading to an unnecessary second lookup.
Is there a way to just build a logic with a single getMapValue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about doing it without an extra lookup but there is no way for us to know if the map actually had a null
value or was the key not found in the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very easy to achieve by re-implementing map lookup to return a pair of {value, success}
. During lookup, you already get the sucess
value for free but discarded. Otherwise, checking for existence then retrieving is doubly expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to go back and do that rather than accepting this double computation.
Signed-off-by: Raza Jafri <[email protected]>
build |
CI is failing because of a dependency in cudf |
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
@revans2 I have addressed all your concerns I think |
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/ColumnVectorUtil.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala
Show resolved
Hide resolved
Signed-off-by: Raza Jafri <[email protected]>
build |
origin: Origin): ColumnVector = { | ||
withResource(map.getMapKeyExistence(indices)) { keyExists => | ||
withResource(keyExists.all()) { exist => | ||
if (!exist.isValid && exist.getBoolean) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, if !exist.isValid
then it should not contain a valid boolean value, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! I have added a test for this
withResource(map.getMapKeyExistence(indices)) { keyExists => | ||
withResource(keyExists.all()) { exist => | ||
if (!exist.isValid && exist.getBoolean) { | ||
map.getMapValue(indices) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this is not much different from the last time: Firstly it checks for key existence, then it pulls map values. These two operations are both expensive but they internally call to the same thing in cudf (lists::index_of
).
We can only get rid of such double computation by:
- Calling to
index_of
directly (in JNI), returning a lists column of the key indices found in the map - The check keys step can be replaced by checking if all the returned indices are not negative
- The
getMapValue
step can be replaced by callinglists::segmented_gather
on these returned indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we can do much better to reduce the run time by half.
Update: I have filed an issue to improve this later: #5919
Will go back to improve it later in a follow up work.
Signed-off-by: Raza Jafri <[email protected]>
build |
merged rapidsai/cudf#11147 and kicked off spark-rapids-jni nightly build |
build |
@revans2 @gerashegalov @ttnghia I think I have addressed your concerns. PTAL |
It looks like the style or docs checks failed. |
Signed-off-by: Raza Jafri <[email protected]>
build |
This PR adds support for
GetMapValue
andelement_at
to take a column of keys and return their values if the key exists otherwise null.This PR depends on rapidsai/cudf#11128
Signed-off-by: Raza Jafri [email protected]