Add support for `element_at` and `GetMapValue` #5883

razajafri · 2022-06-22T00:36:18Z

This PR adds support for GetMapValue and element_at to take a column of keys and return their values if the key exists otherwise null.

This PR depends on rapidsai/cudf#11128

Signed-off-by: Raza Jafri [email protected]

Signed-off-by: Raza Jafri <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/complexTypeExtractors.scala

integration_tests/src/main/python/map_test.py

gerashegalov · 2022-06-22T21:53:44Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala

-        (_, _) => throw new UnsupportedOperationException("non-literal key is not supported")
+        (map, indices) => {
+          if (failOnError) {
+            withResource(map.getMapKeyExistence(indices)) { keyExists =>


in a conventional programming checking for key existence before doing the value lookup is considered a peroformance antipattern leading to an unnecessary second lookup.

Is there a way to just build a logic with a single getMapValue?

I thought about doing it without an extra lookup but there is no way for us to know if the map actually had a null value or was the key not found in the map.

This is very easy to achieve by re-implementing map lookup to return a pair of {value, success}. During lookup, you already get the sucess value for free but discarded. Otherwise, checking for existence then retrieving is doubly expensive.

I would prefer to go back and do that rather than accepting this double computation.

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-06-22T23:22:18Z

build

razajafri · 2022-06-23T18:21:07Z

CI is failing because of a dependency in cudf

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-06-24T19:54:21Z

@revans2 I have addressed all your concerns I think

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/ColumnVectorUtil.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-06-27T17:59:57Z

build

ttnghia · 2022-06-27T18:14:58Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMapUtils.scala

+      origin: Origin): ColumnVector = {
+    withResource(map.getMapKeyExistence(indices)) { keyExists =>
+      withResource(keyExists.all()) { exist =>
+        if (!exist.isValid && exist.getBoolean) {


Wait, if !exist.isValid then it should not contain a valid boolean value, right?

Great catch! I have added a test for this

ttnghia · 2022-06-27T18:25:04Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMapUtils.scala

+    withResource(map.getMapKeyExistence(indices)) { keyExists =>
+      withResource(keyExists.all()) { exist =>
+        if (!exist.isValid && exist.getBoolean) {
+          map.getMapValue(indices)
+        } else {


Overall, this is not much different from the last time: Firstly it checks for key existence, then it pulls map values. These two operations are both expensive but they internally call to the same thing in cudf (lists::index_of).

We can only get rid of such double computation by:

Calling to index_of directly (in JNI), returning a lists column of the key indices found in the map

The check keys step can be replaced by checking if all the returned indices are not negative

The getMapValue step can be replaced by calling lists::segmented_gather on these returned indices.

ttnghia

I believe that we can do much better to reduce the run time by half.

Update: I have filed an issue to improve this later: #5919

Will go back to improve it later in a follow up work.

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-06-27T19:52:56Z

build

razajafri · 2022-06-27T20:15:54Z

merged rapidsai/cudf#11147 and kicked off spark-rapids-jni nightly build

razajafri · 2022-06-28T18:16:58Z

build

razajafri · 2022-06-28T22:53:00Z

@revans2 @gerashegalov @ttnghia I think I have addressed your concerns. PTAL

revans2 · 2022-06-29T13:02:12Z

It looks like the style or docs checks failed.

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2022-06-29T19:00:54Z

build

Add support for element_at and GetMapValue

06fac95

Signed-off-by: Raza Jafri <[email protected]>

sameerz added the feature request New feature or request label Jun 22, 2022

revans2 reviewed Jun 22, 2022

View reviewed changes

gerashegalov reviewed Jun 22, 2022

View reviewed changes

addressed review comments

36cb63c

Signed-off-by: Raza Jafri <[email protected]>

razajafri added 2 commits June 23, 2022 12:38

added tests for decimal 128

44c1417

Signed-off-by: Raza Jafri <[email protected]>

added a utility method to return the first bad key

f46f10f

Signed-off-by: Raza Jafri <[email protected]>

revans2 reviewed Jun 24, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/ColumnVectorUtil.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/collectionOperations.scala Show resolved Hide resolved

added test with d128 as key and moved the util function

aeaad54

Signed-off-by: Raza Jafri <[email protected]>

ttnghia reviewed Jun 27, 2022

View reviewed changes

ttnghia previously requested changes Jun 27, 2022

View reviewed changes

ttnghia mentioned this pull request Jun 27, 2022

[FEA] Improve performance for GetMapValue #5919

Open

addressed review comments

1f4c7d6

Signed-off-by: Raza Jafri <[email protected]>

revans2 approved these changes Jun 29, 2022

View reviewed changes

trigger the verify checks

29916ef

Signed-off-by: Raza Jafri <[email protected]>

razajafri merged commit f273ff2 into NVIDIA:branch-22.08 Jun 29, 2022

razajafri deleted the SP-5204-getMapValue branch June 29, 2022 22:33

razajafri mentioned this pull request Jun 29, 2022

[FEA] Support Key vectors for GetMapValue and ElementAt for maps. #5204

Closed

pxLi mentioned this pull request Jun 30, 2022

[BUG] test_get_map_value_string_col_keys_ansi_fail in databricks321 runtime #5937

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `element_at` and `GetMapValue` #5883

Add support for `element_at` and `GetMapValue` #5883

razajafri commented Jun 22, 2022

gerashegalov Jun 22, 2022

razajafri Jun 22, 2022

ttnghia Jun 24, 2022 •

edited

Loading

ttnghia Jun 24, 2022

razajafri commented Jun 22, 2022

razajafri commented Jun 23, 2022

razajafri commented Jun 24, 2022

razajafri commented Jun 27, 2022

ttnghia Jun 27, 2022 •

edited

Loading

razajafri Jun 27, 2022

ttnghia Jun 27, 2022

ttnghia left a comment •

edited

Loading

razajafri commented Jun 27, 2022

razajafri commented Jun 27, 2022

razajafri commented Jun 28, 2022

razajafri commented Jun 28, 2022

revans2 commented Jun 29, 2022

razajafri commented Jun 29, 2022

Add support for element_at and GetMapValue #5883

Add support for element_at and GetMapValue #5883

Conversation

razajafri commented Jun 22, 2022

gerashegalov Jun 22, 2022

Choose a reason for hiding this comment

razajafri Jun 22, 2022

Choose a reason for hiding this comment

ttnghia Jun 24, 2022 • edited Loading

Choose a reason for hiding this comment

ttnghia Jun 24, 2022

Choose a reason for hiding this comment

razajafri commented Jun 22, 2022

razajafri commented Jun 23, 2022

razajafri commented Jun 24, 2022

razajafri commented Jun 27, 2022

ttnghia Jun 27, 2022 • edited Loading

Choose a reason for hiding this comment

razajafri Jun 27, 2022

Choose a reason for hiding this comment

ttnghia Jun 27, 2022

Choose a reason for hiding this comment

ttnghia left a comment • edited Loading

Choose a reason for hiding this comment

razajafri commented Jun 27, 2022

razajafri commented Jun 27, 2022

razajafri commented Jun 28, 2022

razajafri commented Jun 28, 2022

revans2 commented Jun 29, 2022

razajafri commented Jun 29, 2022

Add support for `element_at` and `GetMapValue` #5883

Add support for `element_at` and `GetMapValue` #5883

ttnghia Jun 24, 2022 •

edited

Loading

ttnghia Jun 27, 2022 •

edited

Loading

ttnghia left a comment •

edited

Loading