Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for cudf::contains when searching for a scalar #11202

Merged
merged 20 commits into from
Jul 8, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Jul 6, 2022

The current implementation of cudf::contains(column_view, scalar) uses thrust::find and thrust::any_of (which also calls thrust::find_if under the hood). These thrust APIs were known to have performance regression (NVIDIA/cccl#720).

This PR replaces thrust::find and thrust::any_of in cudf::contains by thrust::count_if, which improves performance significantly.
Benchmarks show that the run time can be reduced as much as 80% after modification, or up to 5X speedup.

Closes #3806.

@ttnghia ttnghia added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 6, 2022
@ttnghia ttnghia self-assigned this Jul 6, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 6, 2022

Here is the benchmark results, comparing the performance after vs before this PR:

Benchmark                                                              Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------
Contains/SearchScalar_AllValid/32768/manual_time                    -0.4414         -0.3216             0             0             0             0
Contains/SearchScalar_AllValid/262144/manual_time                   -0.3719         -0.2666             0             0             0             0
Contains/SearchScalar_AllValid/2097152/manual_time                  -0.4689         -0.4047             0             0             0             0
Contains/SearchScalar_AllValid/16777216/manual_time                 -0.7573         -0.7391             1             0             1             0
Contains/SearchScalar_AllValid/134217728/manual_time                -0.8053         -0.8005             5             1             5             1
Contains/SearchScalar_AllValid/268435456/manual_time                -0.8057         -0.8054            10             2            10             2
Contains/SearchScalar_Nulls/32768/manual_time                       -0.3815         -0.2611             0             0             0             0
Contains/SearchScalar_Nulls/262144/manual_time                      -0.3703         -0.2673             0             0             0             0
Contains/SearchScalar_Nulls/2097152/manual_time                     -0.4685         -0.4092             0             0             0             0
Contains/SearchScalar_Nulls/16777216/manual_time                    -0.7554         -0.7358             1             0             1             0
Contains/SearchScalar_Nulls/134217728/manual_time                   -0.8056         -0.8008             5             1             5             1
Contains/SearchScalar_Nulls/268435456/manual_time                   -0.8099         -0.8044            10             2            10             2

@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 6, 2022

Here is the original benchmark results:

Before:

-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
Contains/SearchScalar_AllValid/32768/manual_time          0.049 ms        0.068 ms        11579
Contains/SearchScalar_AllValid/262144/manual_time         0.049 ms        0.067 ms        12872
Contains/SearchScalar_AllValid/2097152/manual_time        0.100 ms        0.113 ms         6656
Contains/SearchScalar_AllValid/16777216/manual_time       0.631 ms        0.644 ms         1042
Contains/SearchScalar_AllValid/134217728/manual_time       4.90 ms         4.91 ms          144
Contains/SearchScalar_AllValid/268435456/manual_time       9.80 ms         9.81 ms           71
Contains/SearchScalar_Nulls/32768/manual_time             0.044 ms        0.062 ms        15785
Contains/SearchScalar_Nulls/262144/manual_time            0.049 ms        0.066 ms        13046
Contains/SearchScalar_Nulls/2097152/manual_time           0.101 ms        0.114 ms         6572
Contains/SearchScalar_Nulls/16777216/manual_time          0.637 ms        0.650 ms         1007
Contains/SearchScalar_Nulls/134217728/manual_time          4.91 ms         4.92 ms          134
Contains/SearchScalar_Nulls/268435456/manual_time          9.82 ms         9.83 ms           71

After:

-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
Contains/SearchScalar_AllValid/32768/manual_time          0.028 ms        0.046 ms        25542
Contains/SearchScalar_AllValid/262144/manual_time         0.031 ms        0.049 ms        22731
Contains/SearchScalar_AllValid/2097152/manual_time        0.053 ms        0.067 ms        12746
Contains/SearchScalar_AllValid/16777216/manual_time       0.153 ms        0.168 ms         4571
Contains/SearchScalar_AllValid/134217728/manual_time      0.954 ms        0.980 ms          732
Contains/SearchScalar_AllValid/268435456/manual_time       1.90 ms         1.91 ms          375
Contains/SearchScalar_Nulls/32768/manual_time             0.027 ms        0.046 ms        25510
Contains/SearchScalar_Nulls/262144/manual_time            0.031 ms        0.049 ms        22713
Contains/SearchScalar_Nulls/2097152/manual_time           0.054 ms        0.067 ms        12385
Contains/SearchScalar_Nulls/16777216/manual_time          0.156 ms        0.172 ms         4496
Contains/SearchScalar_Nulls/134217728/manual_time         0.955 ms        0.981 ms          733
Contains/SearchScalar_Nulls/268435456/manual_time          1.87 ms         1.92 ms          375

@ttnghia ttnghia marked this pull request as ready for review July 6, 2022 04:45
@ttnghia ttnghia requested a review from a team as a code owner July 6, 2022 04:45
@codecov

This comment was marked as off-topic.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few suggestions.

cpp/src/search/contains_nested.cu Outdated Show resolved Hide resolved
cpp/src/search/contains.cu Outdated Show resolved Hide resolved
Comment on lines 91 to 92
BINARY_SEARCH_BENCHMARK_DEFINE(Column_AllValid, false)
BINARY_SEARCH_BENCHMARK_DEFINE(Column_HasNulls, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if intentional, but AFAICT we don't need separate definitions here. Validity can be another parameter, something like (if CreateRange is available in the version that we use)

    ->ArgsProduct({
      benchmark::CreateRange(100000, 100000000, 10),
      {0,1}
    })

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't know that there is such way. I'm trying that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current Google benchmark version used in cudf doesn't support CreateRange. I've created another PR to update it: #11209

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vuule It seems that we can't upgrade Google benchmark, so unfortunately your suggestion here can't be worked on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should look at this as an opportunity to convert this benchmark to nvbench. The reason not to allow something like #11209 is to encourage us to switch (which is also important for adjacent initiatives like building dashboards around better benchmark tracking).

cpp/benchmarks/search/search.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/search/search.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/search/search.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/search/search.cpp Outdated Show resolved Hide resolved
cpp/src/search/contains.cu Outdated Show resolved Hide resolved
cpp/src/search/contains.cu Outdated Show resolved Hide resolved
cpp/src/search/contains_nested.cu Outdated Show resolved Hide resolved
cpp/src/search/contains_nested.cu Outdated Show resolved Hide resolved
cpp/src/search/contains_nested.cu Outdated Show resolved Hide resolved
@ttnghia ttnghia requested a review from bdice July 6, 2022 18:45
@github-actions github-actions bot added the CMake CMake build issue label Jul 6, 2022
cpp/benchmarks/search/contains.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/search/contains.cpp Outdated Show resolved Hide resolved
cpp/src/search/contains.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion, otherwise LGTM!

@ttnghia
Copy link
Contributor Author

ttnghia commented Jul 8, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4d4632a into rapidsai:branch-22.08 Jul 8, 2022
@ttnghia ttnghia deleted the fix_contains branch July 8, 2022 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Performance - search contains using slow approach
6 participants