-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join APIs that return gathermaps #7454
Merged
rapids-bot
merged 183 commits into
rapidsai:branch-0.19
from
shwina:gathermap-based-join-apis
Mar 30, 2021
Merged
Join APIs that return gathermaps #7454
rapids-bot
merged 183 commits into
rapidsai:branch-0.19
from
shwina:gathermap-based-join-apis
Mar 30, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…athermap-based-join-apis
…athermap-based-join-apis
…athermap-based-join-apis
…athermap-based-join-apis
Co-authored-by: Mark Harris <[email protected]>
…to gathermap-based-join-apis
kkraus14
approved these changes
Mar 26, 2021
hyperbolic2346
approved these changes
Mar 26, 2021
@harrism can you review again when you get a chance? Then we'll merge! |
jlowe
reviewed
Mar 26, 2021
brandon-b-miller
approved these changes
Mar 26, 2021
jrhemstad
added
breaking
Breaking change
and removed
non-breaking
Non-breaking change
labels
Mar 27, 2021
harrism
approved these changes
Mar 30, 2021
@gpucibot merge |
rapids-bot bot
pushed a commit
that referenced
this pull request
Apr 1, 2021
Adds support for equijoin on structs. This PR is leveraging the [struct PR](#7422) and the [rewrite for join API](#7454). It enables equijoin on structs by flattening the struct for the hash calculation. closes #7543 Authors: - Mike Wilson (https://github.com/hyperbolic2346) - Ashwin Srinath (https://github.com/shwina) - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) - Alessandro Bellina (https://github.com/abellina) - Devavret Makkar (https://github.com/devavret) - David Wendt (https://github.com/davidwendt) - Liangcai Li (https://github.com/firestarman) - Paul Taylor (https://github.com/trxcllnt) - Kumar Aatish (https://github.com/kaatish) - Jason Lowe (https://github.com/jlowe) - Dillon Cullinan (https://github.com/dillon-cullinan) - Raza Jafri (https://github.com/razajafri) - https://github.com/rwlee - Michael Wang (https://github.com/isVoid) - Dante Gama Dessavre (https://github.com/dantegd) - Keith Kraus (https://github.com/kkraus14) - Robert Maynard (https://github.com/robertmaynard) - GALI PREM SAGAR (https://github.com/galipremsagar) - https://github.com/ChrisJar - AJ Schmidt (https://github.com/ajschmidt8) - https://github.com/nvdbaranec - Nghia Truong (https://github.com/ttnghia) - https://github.com/chenrui17 - Conor Hoekstra (https://github.com/codereport) - Mike Wendt (https://github.com/mike-wendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Devavret Makkar (https://github.com/devavret) - Jake Hemstad (https://github.com/jrhemstad) URL: #7720
rapids-bot bot
pushed a commit
that referenced
this pull request
Oct 1, 2021
The `method` parameter to `DataFrame.join` and `DataFrame.merge` [isn't used internally](https://github.com/rapidsai/cudf/blob/e2098e56f0cb209b1d916ce617c04533444a056a/python/cudf/cudf/core/join/join.py#L90) after changes in #7454. This PR updates the docstrings and adds deprecation notices via `FutureWarning` as discussed in #9347. The parameter is now deprecated in the public API. I removed all internal uses of the `method` parameter. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #9291
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
breaking
Breaking change
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #6480
C++ changes
TL;DR
unique_ptr<rmm::device_uvector<size_type>>>
(rather than aunique_ptr<column>
), to accommodate join results that can be larger thanINT32_MAX
rowsThe problem
The work in this PR was motivated by the need for simpler join APIs that give the user more flexibility in how they want to construct the result of a join. To explain the current problem, consider the
inner_join
API:In addition to the left and right tables (and corresponding keys), the API also accepts a
columns_in_common
argument. This is argument specifies pairs of columns from the LHS and RHS respectively, for which only a single column should appear in the result. That single column appears on the "left" side of the result. This makes the API somewhat complicated as well as inflexible.There is a "lower-level" join API that gives more control on which side the "common" columns should go, by providing an additional
common_columns_output_side
argument:But even that offers only limited flexibility: for example, it doesn't allow the user to specify an arbitrary ordering of result columns, or omit columns altogether from the result.
Proposed API
The proposed API in this PR is:
Note:
out_of_bounds_policy::NULLIFY
argument togather
, this will produce nulls in the appropriate locations of the result table.std::unique_ptr<rmm::device_uvector>>
rather than justrmm::device_uvector
because of a Cython limitation that prevents wrapping functions whose return types do not provide a nullary (default) constructor.rmm::device_uvector
allows the API to return results of size >INT32_MAX
, which can occur easily in outer joins.Python changes
TL;DR
Changes/Improvements
_Indexer
One major change introduced in the join internals is the use of a new type
_Indexer
to represent a key column.Previously, join keys were represented by a numeric offset. This was for two reasons:
left_on
andright_on
arguments_Indexer
provides a more convenient way to construct and represent join keys by allowing one to refer unambiguosly to an index or data column of aFrame
:Casting logic
Some of the casting logic has been simplified since we no longer need to post-process (cast) the result returned by libcudf. Previously, we were accounting for
"right"
joins in our casting functions. But, since a right join is implemented in terms of a left join with the operands reversed, it turns out we never really needed to handle right joins separately. I have removed that and it simplifies casting logic further.Others
casting_logic.py
to_join_helpers.py
and included other join utilities there.Merge
for handling semi/anti joinsassert_join_results_equal
helper to compare join results between Pandas and cuDF. libcudf can return join results with arbitrary row ordering, and we weren't accounting for that in some of our tests previously. I'm a bit surprised we never ran into any test failures :)