Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cudf::detail::contains(table_view, table_view) #11325

Closed
wants to merge 7 commits into from

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jul 21, 2022

This PR modifies the overload of cudf::detail::contains that accepts a pair of table_view. The main change here is switching to using cuco::static_map instead of cuco::static_multimap. This can avoid the performance regression due to the limitation of using multimap as has been reported in #11299, when the input tables have many duplicate rows.

@ttnghia ttnghia added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Jul 21, 2022
@ttnghia ttnghia self-assigned this Jul 21, 2022
@ttnghia ttnghia requested a review from a team as a code owner July 21, 2022 19:22
@ttnghia ttnghia added the bug Something isn't working label Jul 21, 2022
@codecov
Copy link

codecov bot commented Jul 21, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@719f4c8). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11325   +/-   ##
===============================================
  Coverage                ?   86.39%           
===============================================
  Files                   ?      143           
  Lines                   ?    22753           
  Branches                ?        0           
===============================================
  Hits                    ?    19658           
  Misses                  ?     3095           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 719f4c8...d57211f. Read the comment docs.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

template <typename T>
__device__ inline auto operator()(T const idx) const noexcept
{
return _hasher(static_cast<size_type>(idx));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be narrow conversion issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functor is only used locally in this file, and the input idx is only strong index types (either lhs_index_type or rhs_index_type).

@mythrocks
Copy link
Contributor

This was looking good to me. I'd like to get my head around the null-checking before I approve.

@hyperbolic2346
Copy link
Contributor

Do we have any performance results for this yet?

@ttnghia ttnghia added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 22, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 22, 2022

Close this as it is covered in a new PR: #11330.

@ttnghia ttnghia closed this Jul 22, 2022
rapids-bot bot pushed a commit that referenced this pull request Jul 22, 2022
…ins` (#11330)

The current implementation of `cudf::detail::contains` can process input with arbitrary nested types. However, it was reported to have severe performance issue when the input tables have many duplicate rows (#11299). In order to fix the issue, #11310 and #11325 was created. 

Unfortunately, #11310 is separating semi-anti-join from `cudf::detail::contains`, causing duplicate implementation. On the other hand, #11325 can address the issue #11299 but semi-anti-join using it still performs worse than the previous semi-anti-join implementation.

The changes in this PR include the following:
 * Fix the performance issue reported in #11299 for the current `cudf::detail::contains` implementation that support nested types.
 * Add a separate code path into `cudf::detail::contains` such that:
     * Input without having lists column (at any nested level) will be processed by the code path that is the same as the old implementation of semi-anti-join. This is to make sure the performance of semi-anti-join will remain the same as before.
     * Input with nested lists column, or NaNs compared as unequal, will be processed by another code path that supports nested types and different NaNs behavior. This will make sure support for nested types will not be dropped.

Closes #11299.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)
  - MithunR (https://github.com/mythrocks)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Alessandro Bellina (https://github.com/abellina)

URL: #11330
@ttnghia ttnghia deleted the refactor_contains branch July 22, 2022 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 5 - DO NOT MERGE Hold off on merging; see PR for details bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants