Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thrust::all_of is slower than a naive reduction #720

Open
1 task
jrhemstad opened this issue Sep 17, 2019 · 8 comments
Open
1 task

thrust::all_of is slower than a naive reduction #720

jrhemstad opened this issue Sep 17, 2019 · 8 comments
Assignees
Labels
thrust For all items related to Thrust.

Comments

@jrhemstad
Copy link
Collaborator

jrhemstad commented Sep 17, 2019

In a thrust::all_of, when the first element that violates the predicate is discovered, the computation can be aborted, i.e., an "early exit".

For example, imagine you are given a thrust::device_vector<int64_t> and want to check if any of the values are negative. You could do this with a thrust::all_of or with a thrust::count_if:

thrust::device_vector<int64_t> values(...);
bool all_positive = thrust::all_of(values.begin(), values.end(), [](auto v){return v > 0;});
bool all_positive = values.size() == thrust::count_if(values.begin(), values.end(), [](auto v){return v > 0;});

count_if must read everything in values, whereas all_of can shortcut if an early exit exists. Therefore, I would expect all_of to out perform count_if when one or more negative values exist. If no negative values are present, then both all_of and count_if must read everything in values and I would expect their performance to be roughly equivalent.

However, this is not the case. I have found that the performance of thrust::all_of is extremely erratic with a 10x difference between the best and worst performance. Furthermore, an all_of is always slower than a naive reduction as in count_if.

Here are the results of performing 100 trials of the example I described above on an input size of 100,000,000 million int64_t elements on a GV100.

No Early Exit

mean (us) min (us) max (us)
all_of 75269 56403 104922
count_if 3124 1686 4413

Single Early Exit

mean (us) min (us) max (us)
all_of 51620 9346 370845
count_if 3100 1703 5158

As you can see, whether or not an early exit exists, all_of is always significantly slower than a count_if.

Looking at the profile of all_of (attached), it appears that the reason it is so slow is because a single invocation of all_of results in ~50 invocations of DeviceReduceKernel. I suspect this is because the implementation of all_of does a set of batched reductions in attempt to avoid reading the entire input when an early exit exists. However, launching all of these small kernels (each with their own allocation/free) results in a significant amount of overhead. This overhead is exacerbated by the fact that each batch is executed on the same stream, meaning there is no overlap or concurrency between batches.

I suspect a better implementation would launch a single kernel, where threads occasionally poll an atomic flag to check if an early exit exists, at which point they exit the computation. Or, forgo an attempt at an early exit and just do the naive reduction like in count_if.

profile
nsys_profile.zip

Reproducer code:

// compile with `nvcc --std=c++14 -O3 --expt-extended-lambda thrust_logical.cu -o thrust_logical -lnvToolsExt`
#include <cxxabi.h>
#include <nvToolsExt.h>
#include <thrust/device_vector.h>
#include <thrust/logical.h>
#include <thrust/random.h>
#include <chrono>
#include <limits>

template <typename T>
struct time_result {
  T min{std::numeric_limits<T>::max()};
  T max{std::numeric_limits<T>::lowest()};
  T mean{0};
  T sum{0};
  std::size_t count{0};

  void add_measurement(T new_duration) {
    ++count;
    sum += new_duration;
    mean = sum / count;
    min = std::min(min, new_duration);
    max = std::max(max, new_duration);
  }

  std::string to_string() {
    return std::string{
        "count: " + std::to_string(count) + " mean: " + std::to_string(mean) +
        " min: " + std::to_string(min) + " max: " + std::to_string(max)};
  }
};

template <typename Duration = std::chrono::microseconds, typename F,
          typename... Args>
typename Duration::rep time_it(std::string const& name, F&& fun,
                               Args&&... args) {
  const auto begin = std::chrono::high_resolution_clock::now();
  nvtxRangePushA(name.c_str());
  std::forward<F>(fun)(std::forward<Args>(args)...);
  nvtxRangePop();
  const auto end = std::chrono::high_resolution_clock::now();
  return std::chrono::duration_cast<Duration>(end - begin).count();
}

template <typename Duration = std::chrono::microseconds,
          typename InputGenerator, typename F, typename... Args>
auto time_trial(std::string const& name, std::size_t num_trials, InputGenerator&& generator, F&& f,
                Args&&... args) {
  time_result<typename Duration::rep> result{};
  for (auto i = 0; i < num_trials; ++i) {
    auto input = generator();
    result.add_measurement(time_it<Duration>(name, std::forward<F>(f), input,
                                             std::forward<Args>(args)...));
  }
  return result;
}

struct is_positive {
  template <typename T>
  bool __device__ operator()(T v) {
    return v > 0;
  }
};

int main(void) {
  constexpr std::size_t input_size{100'000'000};
  constexpr std::size_t num_trials{100};

  auto all_of = [](auto const& values) {
    return thrust::all_of(thrust::device, values.begin(), values.end(),
                          is_positive{});
  };

  auto count_if = [](auto const& values) {
    return thrust::count_if(thrust::device, values.begin(), values.end(),
                            is_positive{});
  };

  auto no_early_out = []() {
    nvtxRangePushA("no early out input");
    thrust::device_vector<int64_t> values(input_size, 1);
    cudaDeviceSynchronize();
    nvtxRangePop();
    return values;
  };

  auto early_out = []() {
    nvtxRangePushA("early out input");
    thrust::device_vector<int64_t> values(input_size, 1);
    thrust::default_random_engine engine(
        std::chrono::high_resolution_clock::now().time_since_epoch().count());
    thrust::uniform_int_distribution<std::size_t> distribution{0, input_size};
    auto random_location = distribution(engine);
    values[random_location] = -1;
    cudaDeviceSynchronize();
    nvtxRangePop();
    return values;
  };

  std::cout << "No early out(us):\n";
  std::cout << "all of: "
            << time_trial("all_of", num_trials, no_early_out, all_of).to_string()
            << std::endl;
  std::cout << "count if: "
            << time_trial("count_if", num_trials, no_early_out, count_if).to_string()
            << std::endl;


  std::cout << std::endl << "With early out(us):\n";
  std::cout << "all of: "
            << time_trial("all_of", num_trials, early_out, all_of).to_string()
            << std::endl;
  std::cout << "count if: "
            << time_trial("count_if", num_trials, early_out, count_if).to_string()
            << std::endl;


  return 0;
}

Tasks

@jrhemstad
Copy link
Collaborator Author

I updated the original issue to reflect a correction I made in my benchmark code where the location of the "early out" element changes for each trial. This exposed significantly more variance in the all_of results and shows that it is always slower than a count_if.

@jrhemstad jrhemstad changed the title thrust::all_of is very slow without an early exit thrust::all_of is slower than a naive reduction Sep 18, 2019
@karthikeyann
Copy link
Contributor

karthikeyann commented Oct 16, 2019

thrust::all_of and thrust::any_of are implemented using thrust::find_if,
/usr/local/cuda-10.0/targets/x86_64-linux/include/thrust/system/detail/generic/find.inl

 91   // this implementation breaks up the sequence into separate intervals
 92   // in an attempt to early-out as soon as a value is found
 93
 94   // TODO incorporate sizeof(InputType) into interval_threshold and round to multiple of 32
 95   const difference_type interval_threshold = 1 << 20;
 96   const difference_type interval_size = (thrust::min)(interval_threshold, n);

could implementing this // TODO solve the performance issue?
(if this issue does not arise or is less severe for smaller datatype, this might solve the issue)

@karthikeyann
Copy link
Contributor

@jrhemstad
Additional details:
thrust::count_if uses transform_reduce, which uses thrust::plus
thrust::find_if uses reduce on a tuple(pred, index) with thrust::min operator on index.

index is not necessary for all_of or any_of. This tuple<bool, size_t> will consume more registers too.

Alternative implementation of thrust::any_of could be using transform_reduce with thrust::maximum or thrust::logical_or operator on pred result.
Similarly for thrust::all_of using transform_reduce with thrust::minimum or thrust::logical_and.
Benchmarked using following extra code.

  auto reduce_and = [](auto const& values) {
    return thrust::transform_reduce(thrust::device, values.begin(), values.end(),
                            is_positive{}, true, thrust::logical_and<bool>{} );
  };

  auto reduce_min = [](auto const& values) {
    return thrust::transform_reduce(thrust::device, values.begin(), values.end(),
                            is_positive{}, false, thrust::minimum<bool>{} );
  };

Runtime are similar for count, logical_and, minimum.

I created new thrust::any_of using transform_reduce with logical_or, and used it for thrust::all_of (along with early exit). This is faster.

No early out(us):
all of: count: 100 mean: 62023 min: 57710 max: 65030
count if: count: 100 mean: 2478 min: 1577 max: 3471
reduce and: count: 100 mean: 2469 min: 1588 max: 3476
reduce min: count: 100 mean: 2494 min: 1566 max: 3477
new all_of: count: 100 mean: 933 min: 352 max: 1942

With early out(us):
all of: count: 100 mean: 33574 min: 4148 max: 114289
count if: count: 100 mean: 2466 min: 1615 max: 2926
reduce and: count: 100 mean: 2536 min: 1577 max: 3792
reduce min: count: 100 mean: 2518 min: 1585 max: 3528
new all_of: count: 100 mean: 918 min: 381 max: 1949

@karthikeyann
Copy link
Contributor

new_any_of and new_all_of implementation attached with

Reproducer code:

thrust_logical.cu.zip

@jrhemstad
Copy link
Collaborator Author

I don't know what new all_of is, but there must be something wrong with the implementation because these numbers are impossible:

No early out(us):
new all_of: count: 100 mean: 933 min: 352 max: 1942

With early out(us):
new all_of: count: 100 mean: 918 min: 381 max: 1949

If no early out exists, then you need to read all 100,000,000 int64_t elements in the input.

(100,000,000 * 8B) / 352us -> 2.2 TB/s

That's well over the 900GB/s theoretical peak of a V100 GPU.

@jrhemstad
Copy link
Collaborator Author

I would expect any reduction based implementation to perform the same (as your results show). Since reduction is bandwidth bound, it doesn't really matter what your binary operator is (sum, or, and, etc.) in the reduction.

Furthermore, your results are fishy because if an early out does not exist, then the new all_of implementation should not be any faster than any of the other reduction based implementations. Since your new all_of is just doing a batched transform_reduce, how could it be faster than just doing a single transform_reduce?

@karthikeyann
Copy link
Contributor

You are right. My implementation has a bug.
I fixed it and have the updated benchmarks.

No early out(us):
all of: count: 100 mean: 63981 min: 59478 max: 67889
count if: count: 100 mean: 2515 min: 1588 max: 3482
reduce and: count: 100 mean: 2454 min: 1582 max: 2946
reduce min: count: 100 mean: 2462 min: 1573 max: 3475
new all_of: count: 100 mean: 38071 min: 34610 max: 102533

With early out(us):
all of: count: 100 mean: 36221 min: 6262 max: 64545
count if: count: 100 mean: 2434 min: 1565 max: 2843
reduce and: count: 100 mean: 2448 min: 1624 max: 3183
reduce min: count: 100 mean: 2462 min: 1562 max: 3470
new all_of: count: 100 mean: 24840 min: 934 max: 94396

new_all_of is slower. In fact, max time is worst among all. (early out min is only faster!)

@jrhemstad
Copy link
Collaborator Author

@karthikeyann Those results look much more like what I would expect.

While the new all_of can be faster, the fact that on average it is 10x slower confirms in my mind that the extra complexity of trying to take advantage of an early out actually harms performance in the general case.

rapids-bot bot referenced this issue in rapidsai/cudf Jul 8, 2022
…11202)

The current implementation of `cudf::contains(column_view, scalar)` uses `thrust::find` and `thrust::any_of` (which also calls `thrust::find_if` under the hood). These thrust APIs were known to have performance regression (https://github.com/NVIDIA/thrust/issues/1016).

This PR replaces `thrust::find` and `thrust::any_of` in `cudf::contains` by `thrust::count_if`, which improves performance significantly.
Benchmarks show that the run time can be reduced as much as 80% after modification, or up to 5X speedup.

Closes #3806.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Bradley Dice (https://github.com/bdice)

URL: #11202
@jrhemstad jrhemstad added the thrust For all items related to Thrust. label Feb 22, 2023
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 8, 2023
@jarmak-nv jarmak-nv transferred this issue from NVIDIA/thrust Nov 8, 2023
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jun 14, 2024
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jun 24, 2024
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jun 24, 2024
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jul 5, 2024
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jul 5, 2024
bernhardmgruber added a commit to bernhardmgruber/cccl that referenced this issue Jul 8, 2024
@wmaxey wmaxey moved this from Todo to In Progress in CCCL Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thrust For all items related to Thrust.
Projects
Status: In Progress
Development

No branches or pull requests

3 participants