Improve performance for `cudf::contains` when searching for a scalar #11202

ttnghia · 2022-07-06T04:42:02Z

The current implementation of cudf::contains(column_view, scalar) uses thrust::find and thrust::any_of (which also calls thrust::find_if under the hood). These thrust APIs were known to have performance regression (NVIDIA/cccl#720).

This PR replaces thrust::find and thrust::any_of in cudf::contains by thrust::count_if, which improves performance significantly.
Benchmarks show that the run time can be reduced as much as 80% after modification, or up to 5X speedup.

Closes #3806.

ttnghia · 2022-07-06T04:43:00Z

Here is the benchmark results, comparing the performance after vs before this PR:

Benchmark                                                              Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------
Contains/SearchScalar_AllValid/32768/manual_time                    -0.4414         -0.3216             0             0             0             0
Contains/SearchScalar_AllValid/262144/manual_time                   -0.3719         -0.2666             0             0             0             0
Contains/SearchScalar_AllValid/2097152/manual_time                  -0.4689         -0.4047             0             0             0             0
Contains/SearchScalar_AllValid/16777216/manual_time                 -0.7573         -0.7391             1             0             1             0
Contains/SearchScalar_AllValid/134217728/manual_time                -0.8053         -0.8005             5             1             5             1
Contains/SearchScalar_AllValid/268435456/manual_time                -0.8057         -0.8054            10             2            10             2
Contains/SearchScalar_Nulls/32768/manual_time                       -0.3815         -0.2611             0             0             0             0
Contains/SearchScalar_Nulls/262144/manual_time                      -0.3703         -0.2673             0             0             0             0
Contains/SearchScalar_Nulls/2097152/manual_time                     -0.4685         -0.4092             0             0             0             0
Contains/SearchScalar_Nulls/16777216/manual_time                    -0.7554         -0.7358             1             0             1             0
Contains/SearchScalar_Nulls/134217728/manual_time                   -0.8056         -0.8008             5             1             5             1
Contains/SearchScalar_Nulls/268435456/manual_time                   -0.8099         -0.8044            10             2            10             2

ttnghia · 2022-07-06T04:44:26Z

Here is the original benchmark results:

Before:

-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
Contains/SearchScalar_AllValid/32768/manual_time          0.049 ms        0.068 ms        11579
Contains/SearchScalar_AllValid/262144/manual_time         0.049 ms        0.067 ms        12872
Contains/SearchScalar_AllValid/2097152/manual_time        0.100 ms        0.113 ms         6656
Contains/SearchScalar_AllValid/16777216/manual_time       0.631 ms        0.644 ms         1042
Contains/SearchScalar_AllValid/134217728/manual_time       4.90 ms         4.91 ms          144
Contains/SearchScalar_AllValid/268435456/manual_time       9.80 ms         9.81 ms           71
Contains/SearchScalar_Nulls/32768/manual_time             0.044 ms        0.062 ms        15785
Contains/SearchScalar_Nulls/262144/manual_time            0.049 ms        0.066 ms        13046
Contains/SearchScalar_Nulls/2097152/manual_time           0.101 ms        0.114 ms         6572
Contains/SearchScalar_Nulls/16777216/manual_time          0.637 ms        0.650 ms         1007
Contains/SearchScalar_Nulls/134217728/manual_time          4.91 ms         4.92 ms          134
Contains/SearchScalar_Nulls/268435456/manual_time          9.82 ms         9.83 ms           71

After:

-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
Contains/SearchScalar_AllValid/32768/manual_time          0.028 ms        0.046 ms        25542
Contains/SearchScalar_AllValid/262144/manual_time         0.031 ms        0.049 ms        22731
Contains/SearchScalar_AllValid/2097152/manual_time        0.053 ms        0.067 ms        12746
Contains/SearchScalar_AllValid/16777216/manual_time       0.153 ms        0.168 ms         4571
Contains/SearchScalar_AllValid/134217728/manual_time      0.954 ms        0.980 ms          732
Contains/SearchScalar_AllValid/268435456/manual_time       1.90 ms         1.91 ms          375
Contains/SearchScalar_Nulls/32768/manual_time             0.027 ms        0.046 ms        25510
Contains/SearchScalar_Nulls/262144/manual_time            0.031 ms        0.049 ms        22713
Contains/SearchScalar_Nulls/2097152/manual_time           0.054 ms        0.067 ms        12385
Contains/SearchScalar_Nulls/16777216/manual_time          0.156 ms        0.172 ms         4496
Contains/SearchScalar_Nulls/134217728/manual_time         0.955 ms        0.981 ms          733
Contains/SearchScalar_Nulls/268435456/manual_time          1.87 ms         1.92 ms          375

vuule

Looks good, just a few suggestions.

cpp/src/search/contains_nested.cu

cpp/src/search/contains.cu

vuule · 2022-07-06T14:39:10Z

cpp/benchmarks/search/search.cpp

+BINARY_SEARCH_BENCHMARK_DEFINE(Column_AllValid, false)
+BINARY_SEARCH_BENCHMARK_DEFINE(Column_HasNulls, true)


Not sure if intentional, but AFAICT we don't need separate definitions here. Validity can be another parameter, something like (if CreateRange is available in the version that we use)

->ArgsProduct({ benchmark::CreateRange(100000, 100000000, 10), {0,1} })

Oh I didn't know that there is such way. I'm trying that...

The current Google benchmark version used in cudf doesn't support CreateRange. I've created another PR to update it: #11209

@vuule It seems that we can't upgrade Google benchmark, so unfortunately your suggestion here can't be worked on.

I think we should look at this as an opportunity to convert this benchmark to nvbench. The reason not to allow something like #11209 is to encourage us to switch (which is also important for adjacent initiatives like building dashboards around better benchmark tracking).

cpp/benchmarks/search/search.cpp

cpp/src/search/contains.cu

cpp/src/search/contains_nested.cu

cpp/benchmarks/search/search.cpp

cpp/benchmarks/search/contains.cpp

cpp/src/search/contains.cu

bdice

One suggestion, otherwise LGTM!

ttnghia · 2022-07-08T13:44:55Z

@gpucibot merge

ttnghia added 7 commits July 5, 2022 19:43

Fix warnings

0d4747e

Extract create_table_data function

9c5709c

Run benchmark using random data

a09154a

Use macro in benchmark

fb32f2d

Add more benchmark

4b43d6d

Change benchmark range

1debe3b

Use count_if instead of find

b4b4847

ttnghia added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 6, 2022

ttnghia self-assigned this Jul 6, 2022

ttnghia marked this pull request as ready for review July 6, 2022 04:45

ttnghia requested a review from a team as a code owner July 6, 2022 04:45

ttnghia requested review from karthikeyann, vuule and GregoryKimball July 6, 2022 04:45

This comment was marked as off-topic.

Sign in to view

vuule reviewed Jul 6, 2022

View reviewed changes

Change float to size_type for benchmark

7528ea0

ttnghia mentioned this pull request Jul 6, 2022

Upgrade Google Benchmark #11209

Closed

bdice requested changes Jul 6, 2022

View reviewed changes

ttnghia added 3 commits July 6, 2022 10:07

Remove [[maybe_unused]]

9da73de

Remove separator

b8dc84a

MISC

19c73d3

Change seed from static into dynamic, and add back [[maybe_unused]]

d4d063b

bdice reviewed Jul 6, 2022

View reviewed changes

cpp/benchmarks/search/search.cpp Outdated Show resolved Hide resolved

ttnghia added 4 commits July 6, 2022 10:53

Reverse benchmark

75d5584

Avoid memory round trip

61cd767

Fix typo

0ea3794

Add benchmark

fc9d7bb

ttnghia requested a review from bdice July 6, 2022 18:45

github-actions bot added the CMake CMake build issue label Jul 6, 2022

Add tparam

0223ec7

karthikeyann reviewed Jul 7, 2022

View reviewed changes

cpp/benchmarks/search/contains.cpp Outdated Show resolved Hide resolved

cpp/benchmarks/search/contains.cpp Outdated Show resolved Hide resolved

ttnghia added 2 commits July 7, 2022 06:35

Fix null frequency

3d5acaf

Remove random seed

3d82ba4

karthikeyann approved these changes Jul 7, 2022

View reviewed changes

bdice reviewed Jul 8, 2022

View reviewed changes

cpp/src/search/contains.cu Outdated Show resolved Hide resolved

bdice approved these changes Jul 8, 2022

View reviewed changes

Simplify code

fdfa9f7

rapids-bot bot merged commit 4d4632a into rapidsai:branch-22.08 Jul 8, 2022

ttnghia deleted the fix_contains branch July 8, 2022 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for `cudf::contains` when searching for a scalar #11202

Improve performance for `cudf::contains` when searching for a scalar #11202

ttnghia commented Jul 6, 2022 •

edited

Loading

ttnghia commented Jul 6, 2022 •

edited

Loading

ttnghia commented Jul 6, 2022

This comment was marked as off-topic.

vuule left a comment

vuule Jul 6, 2022

ttnghia Jul 6, 2022

ttnghia Jul 6, 2022

ttnghia Jul 6, 2022

vyasr Jul 6, 2022

bdice left a comment

ttnghia commented Jul 8, 2022

		BINARY_SEARCH_BENCHMARK_DEFINE(Column_AllValid, false)
		BINARY_SEARCH_BENCHMARK_DEFINE(Column_HasNulls, true)

Improve performance for cudf::contains when searching for a scalar #11202

Improve performance for cudf::contains when searching for a scalar #11202

Conversation

ttnghia commented Jul 6, 2022 • edited Loading

ttnghia commented Jul 6, 2022 • edited Loading

ttnghia commented Jul 6, 2022

Before:

After:

This comment was marked as off-topic.

vuule left a comment

Choose a reason for hiding this comment

vuule Jul 6, 2022

Choose a reason for hiding this comment

ttnghia Jul 6, 2022

Choose a reason for hiding this comment

ttnghia Jul 6, 2022

Choose a reason for hiding this comment

ttnghia Jul 6, 2022

Choose a reason for hiding this comment

vyasr Jul 6, 2022

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

ttnghia commented Jul 8, 2022

Improve performance for `cudf::contains` when searching for a scalar #11202

Improve performance for `cudf::contains` when searching for a scalar #11202

ttnghia commented Jul 6, 2022 •

edited

Loading

ttnghia commented Jul 6, 2022 •

edited

Loading