-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thrust::all_of
is slower than a naive reduction
#720
Comments
I updated the original issue to reflect a correction I made in my benchmark code where the location of the "early out" element changes for each trial. This exposed significantly more variance in the |
thrust::all_of
is very slow without an early exitthrust::all_of
is slower than a naive reduction
could implementing this |
@jrhemstad index is not necessary for all_of or any_of. This tuple<bool, size_t> will consume more registers too. Alternative implementation of auto reduce_and = [](auto const& values) {
return thrust::transform_reduce(thrust::device, values.begin(), values.end(),
is_positive{}, true, thrust::logical_and<bool>{} );
};
auto reduce_min = [](auto const& values) {
return thrust::transform_reduce(thrust::device, values.begin(), values.end(),
is_positive{}, false, thrust::minimum<bool>{} );
}; Runtime are similar for count, logical_and, minimum. I created new No early out(us): With early out(us): |
|
I don't know what
If no early out exists, then you need to read all 100,000,000 (100,000,000 * 8B) / 352us -> 2.2 TB/s That's well over the 900GB/s theoretical peak of a V100 GPU. |
I would expect any reduction based implementation to perform the same (as your results show). Since reduction is bandwidth bound, it doesn't really matter what your binary operator is (sum, or, and, etc.) in the reduction. Furthermore, your results are fishy because if an early out does not exist, then the |
You are right. My implementation has a bug.
|
@karthikeyann Those results look much more like what I would expect. While the |
…11202) The current implementation of `cudf::contains(column_view, scalar)` uses `thrust::find` and `thrust::any_of` (which also calls `thrust::find_if` under the hood). These thrust APIs were known to have performance regression (https://github.com/NVIDIA/thrust/issues/1016). This PR replaces `thrust::find` and `thrust::any_of` in `cudf::contains` by `thrust::count_if`, which improves performance significantly. Benchmarks show that the run time can be reduced as much as 80% after modification, or up to 5X speedup. Closes #3806. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Bradley Dice (https://github.com/bdice) URL: #11202
In a
thrust::all_of
, when the first element that violates the predicate is discovered, the computation can be aborted, i.e., an "early exit".For example, imagine you are given a
thrust::device_vector<int64_t>
and want to check if any of the values are negative. You could do this with athrust::all_of
or with athrust::count_if
:count_if
must read everything invalues
, whereasall_of
can shortcut if an early exit exists. Therefore, I would expectall_of
to out performcount_if
when one or more negative values exist. If no negative values are present, then bothall_of
andcount_if
must read everything invalues
and I would expect their performance to be roughly equivalent.However, this is not the case. I have found that the performance of
thrust::all_of
is extremely erratic with a 10x difference between the best and worst performance. Furthermore, anall_of
is always slower than a naive reduction as incount_if
.Here are the results of performing 100 trials of the example I described above on an input size of 100,000,000 million
int64_t
elements on a GV100.No Early Exit
all_of
count_if
Single Early Exit
all_of
count_if
As you can see, whether or not an early exit exists,
all_of
is always significantly slower than acount_if
.Looking at the profile of
all_of
(attached), it appears that the reason it is so slow is because a single invocation ofall_of
results in ~50 invocations ofDeviceReduceKernel
. I suspect this is because the implementation ofall_of
does a set of batched reductions in attempt to avoid reading the entire input when an early exit exists. However, launching all of these small kernels (each with their own allocation/free) results in a significant amount of overhead. This overhead is exacerbated by the fact that each batch is executed on the same stream, meaning there is no overlap or concurrency between batches.I suspect a better implementation would launch a single kernel, where threads occasionally poll an atomic flag to check if an early exit exists, at which point they exit the computation. Or, forgo an attempt at an early exit and just do the naive reduction like in
count_if
.nsys_profile.zip
Reproducer code:
Tasks
thrust::all_of
. #2113The text was updated successfully, but these errors were encountered: