Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add a join API that returns gather maps rather than a table #6480

Closed
shwina opened this issue Oct 9, 2020 · 3 comments · Fixed by #7454
Closed

[FEA] Add a join API that returns gather maps rather than a table #6480

shwina opened this issue Oct 9, 2020 · 3 comments · Fixed by #7454
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@shwina
Copy link
Contributor

shwina commented Oct 9, 2020

This would depend on #6479 and #6478

It would be great if libcudf had a join API that returned gather map(s) (one in case of an inner join and two in case of outer joins). For example:

std::pair<std::unique_ptr<cudf::column>, std::unique_ptr<cudf::column>> left_join(...);

As explained in #6479 , this could result in simpler join code, both in the C++ as well as in Python.

As follow up tasks to this feature, we should:

  • Update Python bindings to use the new gather map API
  • Remove the columns_in_common option from the existing libcudf join APIs - since Python will no longer need it
  • Refactor the existing libcudf join APIs to just call the new API that returns a gather map
@shwina shwina added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Oct 9, 2020
@shwina
Copy link
Contributor Author

shwina commented Oct 9, 2020

cc: @jrhemstad

@jlowe
Copy link
Member

jlowe commented Oct 20, 2020

We would love to see this for the RAPIDS plugin for Spark as well. There are many cases where a join occurs in the query and one or more keys involved in the join do not appear in the output table. The current cudf join APIs always manifest the key columns in the join result. Having the gather maps join API allows us to gather only the output columns that are needed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Mar 30, 2021
Closes #6480 

# C++ changes

## TL;DR

* Adds join APIs that accept join keys and return gathermaps
* Return type is a `unique_ptr<rmm::device_uvector<size_type>>>` (rather than a `unique_ptr<column>`), to accommodate join results that can be larger than `INT32_MAX` rows
* Simplifies previous join APIs to not accept arguments relating to "common columns" -- instead, those APIs always return all the columns from the LHS/RHS. Users wanting finer control can use the gathermap-based APIs

## The problem

The work in this PR was motivated by the need for simpler join APIs that give the user more flexibility in how they want to construct the result of a join. To explain the current problem, consider the `inner_join` API:

```c++
std::unique_ptr<cudf::table> inner_join(
  cudf::table_view const& left,
  cudf::table_view const& right,
  std::vector<cudf::size_type> const& left_on,
  std::vector<cudf::size_type> const& right_on,
  std::vector<std::pair<cudf::size_type, cudf::size_type>> const& columns_in_common,
  null_equality compare_nulls         = null_equality::EQUAL,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
```

In addition to the left and right tables (and corresponding keys), the API also accepts a `columns_in_common` argument. This is argument specifies pairs of columns from the LHS and RHS respectively, for which only a single column should appear in the result. That single column appears on the "left" side of the result. This makes the API somewhat complicated as well as inflexible.

There is a "lower-level" join API that gives more control on which side the "common" columns should go, by providing an additional `common_columns_output_side` argument:

```c++
  std::pair<std::unique_ptr<cudf::table>, std::unique_ptr<cudf::table>> inner_join(
    cudf::table_view const& probe,
    std::vector<size_type> const& probe_on,
    std::vector<std::pair<cudf::size_type, cudf::size_type>> const& columns_in_common,
    common_columns_output_side common_columns_output_side = common_columns_output_side::PROBE,
    null_equality compare_nulls                           = null_equality::EQUAL,
    rmm::cuda_stream_view stream                          = rmm::cuda_stream_default,
    rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) const;
```

But even that offers only limited flexibility: for example, it doesn't allow the user to specify an arbitrary ordering of result columns, or omit columns altogether from the result.

## Proposed API

The proposed API in this PR is:

```c++
std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
          std::unique_ptr<rmm::device_uvector<size_type>>>
inner_join(cudf::table_view const& left_keys,
           cudf::table_view const& right_keys,
           null_equality compare_nulls         = null_equality::EQUAL,
           rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
```

Note:

* Rather than requiring the full left and right tables of the join, this API only needs the key columns from the left and right tables.
* Rather than constructing the result of the join, this API returns the gathermaps which can be used to construct it.
* For outer join, non-matches are represented by out-of-bound values in the gathermap. In conjunction with the `out_of_bounds_policy::NULLIFY` argument to `gather`, this will produce nulls in the appropriate locations of the result table.
* The API returns a `std::unique_ptr<rmm::device_uvector>>` rather than just `rmm::device_uvector` because of a Cython limitation that prevents wrapping functions whose return types do not provide a nullary (default) constructor.
* The use of `rmm::device_uvector` allows the API to return results of size > `INT32_MAX`, which can occur easily in outer joins. 


# Python changes

## TL;DR

* Add Cython bindings for the new C++ APIs
* Rework join internals to interface with the new Cython APIs

## Changes/Improvements

### _Indexer

One major change introduced in the join internals is the use of a new type `_Indexer` to represent a key column.

Previously, join keys were represented by a numeric offset. This was for two reasons:

* A join key could be either an index column or a data column, and the only way to refer to it unambiguously was by its offset -- a DataFrame can have an index column and a data column with the same name.
* The C++ API required numeric offsets for the `left_on` and `right_on` arguments

`_Indexer` provides a more convenient way to construct and represent join keys by allowing one to refer unambiguosly to an index or data column of a `Frame`:

```
    # >>> df
    #    a
    # b
    # 4  1
    # 5  2
    # 6  3
    # >>> _Indexer("a", column=True).get(df)  # returns column "a" of df
    # >>> _Indexer("b", index=True).get(df)  # returns index level "b" of df
```

### Casting logic

Some of the casting logic has been simplified since we no longer need to post-process (cast) the result returned by libcudf. Previously, we were accounting for `"right"` joins in our casting functions. But, since a right join is implemented in terms of a left join with the operands reversed, it turns out we never really needed to handle right joins separately. I have removed that and it simplifies casting logic further.

### Others

* Renamed `casting_logic.py` to `_join_helpers.py` and included other join utilities there.
* Added a subclass of `Merge` for handling semi/anti joins
* Added a `assert_join_results_equal` helper to compare join results between Pandas and cuDF. libcudf can return join results with arbitrary row ordering, and we weren't accounting for that in some of our tests previously. I'm a bit surprised we never ran into any test failures :)

Authors:
  - Ashwin Srinath (@shwina)
  - Vyas Ramasubramani (@vyasr)

Approvers:
  - Jake Hemstad (@jrhemstad)
  - Keith Kraus (@kkraus14)
  - Mike Wilson (@hyperbolic2346)
  - @brandon-b-miller
  - Mark Harris (@harrism)

URL: #7454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants