Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Legacy join APIs return extraneous columns #7762

Closed
jlowe opened this issue Mar 30, 2021 · 7 comments · Fixed by #11274
Closed

[BUG] Legacy join APIs return extraneous columns #7762

jlowe opened this issue Mar 30, 2021 · 7 comments · Fixed by #11274
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@jlowe
Copy link
Member

jlowe commented Mar 30, 2021

Describe the bug
While updating the Java bindings to the new join gather map APIs added in #7454, I noticed that the old join APIs now return more columns than they did before. They seem to be returning a table that contains the key columns from both the left and right tables, whereas before they returned only the key columns as they appeared in the left table (updated for the join result).

For example take two tables, the left having columns A,B and the right having columns C,D and we want to perform a join on A == C. Previously the join APIs would return a table with 3 columns, A, B, and D, but now they return 4 columns.

Steps/Code to reproduce bug
Call the left_join, inner_join, and full_join APIs before and after #7454.

Expected behavior
Redundant key columns should not be manifested. Manifesting the right key columns on an inner or left join seems to serve no purpose, as the left key columns already contain the proper answer for the join.

@jlowe jlowe added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. labels Mar 30, 2021
@harrism
Copy link
Member

harrism commented Apr 23, 2021

@shwina did you see this?

@shwina
Copy link
Contributor

shwina commented Apr 23, 2021

This was by design, although I can concede a lazy one on my part. @jrhemstad wanted the "common columns" parameter from the legacy APIs gone and I didn't replace it with reasonable default behaviour

One solution is to get rid of these APIs entirely. Python (and I believe Spark) don't use them anymore. If we still want them for the convenience of libcudf users, we could make it such that only key columns from the left are preserved for inner and left joins.

@shwina shwina closed this as completed Apr 23, 2021
@shwina
Copy link
Contributor

shwina commented Apr 23, 2021

Sorry for closing! Wrong button

@shwina shwina reopened this Apr 23, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jlowe
Copy link
Member Author

jlowe commented May 24, 2021

Still relevant. The API should either be fixed or removed.

@github-actions
Copy link

github-actions bot commented Feb 7, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@jlowe
Copy link
Member Author

jlowe commented Feb 7, 2022

I think this is still relevant, but low priority.

One solution is to get rid of these APIs entirely.

I think this is the most appropriate solution, since I cannot see a valid reason callers would want or expect duplicated key columns in the join output.

@jlowe jlowe removed the inactive-90d label Feb 7, 2022
rapids-bot bot pushed a commit that referenced this issue Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants