Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for element_at and GetMapValue #5883

Merged
merged 7 commits into from
Jun 29, 2022

Conversation

razajafri
Copy link
Collaborator

This PR adds support for GetMapValue and element_at to take a column of keys and return their values if the key exists otherwise null.

This PR depends on rapidsai/cudf#11128

Signed-off-by: Raza Jafri [email protected]

@sameerz sameerz added the feature request New feature or request label Jun 22, 2022
(_, _) => throw new UnsupportedOperationException("non-literal key is not supported")
(map, indices) => {
if (failOnError) {
withResource(map.getMapKeyExistence(indices)) { keyExists =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a conventional programming checking for key existence before doing the value lookup is considered a peroformance antipattern leading to an unnecessary second lookup.

Is there a way to just build a logic with a single getMapValue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing it without an extra lookup but there is no way for us to know if the map actually had a null value or was the key not found in the map.

Copy link
Collaborator

@ttnghia ttnghia Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very easy to achieve by re-implementing map lookup to return a pair of {value, success}. During lookup, you already get the sucess value for free but discarded. Otherwise, checking for existence then retrieving is doubly expensive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to go back and do that rather than accepting this double computation.

Signed-off-by: Raza Jafri <[email protected]>
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

CI is failing because of a dependency in cudf

@razajafri
Copy link
Collaborator Author

@revans2 I have addressed all your concerns I think

@razajafri
Copy link
Collaborator Author

build

origin: Origin): ColumnVector = {
withResource(map.getMapKeyExistence(indices)) { keyExists =>
withResource(keyExists.all()) { exist =>
if (!exist.isValid && exist.getBoolean) {
Copy link
Collaborator

@ttnghia ttnghia Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, if !exist.isValid then it should not contain a valid boolean value, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I have added a test for this

Comment on lines 49 to 53
withResource(map.getMapKeyExistence(indices)) { keyExists =>
withResource(keyExists.all()) { exist =>
if (!exist.isValid && exist.getBoolean) {
map.getMapValue(indices)
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is not much different from the last time: Firstly it checks for key existence, then it pulls map values. These two operations are both expensive but they internally call to the same thing in cudf (lists::index_of).

We can only get rid of such double computation by:

  • Calling to index_of directly (in JNI), returning a lists column of the key indices found in the map
  • The check keys step can be replaced by checking if all the returned indices are not negative
  • The getMapValue step can be replaced by calling lists::segmented_gather on these returned indices.

Copy link
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we can do much better to reduce the run time by half.

Update: I have filed an issue to improve this later: #5919

@ttnghia ttnghia dismissed their stale review June 27, 2022 18:52

Will go back to improve it later in a follow up work.

Signed-off-by: Raza Jafri <[email protected]>
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

merged rapidsai/cudf#11147 and kicked off spark-rapids-jni nightly build

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@revans2 @gerashegalov @ttnghia I think I have addressed your concerns. PTAL

@revans2
Copy link
Collaborator

revans2 commented Jun 29, 2022

It looks like the style or docs checks failed.

Signed-off-by: Raza Jafri <[email protected]>
@razajafri
Copy link
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants