thrust::sort fails for > 2.1B keys #1453

maltenbergert · 2021-06-06T10:22:49Z

Context

We are benchmarking the performance of thrust::sort with a pre-allocated temporary buffer. In a nutshell, we generate the data on the host, copy it onto the device, initialize a stream, pre-allocate a temporary buffer for thrust::sort, and measure the sort duration.

Example

#include <thrust/sort.h>
#include <thrust/generate.h>
#include <thrust/device_vector.h>

#include <time.h>
#include <stdlib.h>
#include <iomanip>
#include <iostream>
#include <algorithm>
#include <chrono>

struct DeviceAllocator {
  using value_type = uint8_t;

  void Malloc(size_t num_bytes) {
    cudaMalloc(reinterpret_cast<void**>(&begin_pointer), num_bytes);
  }
  
  uint8_t* allocate(size_t num_bytes) {
    return begin_pointer;
  }
  void deallocate(uint8_t* current, size_t num_bytes) {}

  uint8_t* begin_pointer = nullptr;
};

int main(int argc, char* argv[]) {
  const size_t num_elements = std::stoull(argv[1]);

  thrust::host_vector<int> host_elements(num_elements);
  std::generate(host_elements.begin(), host_elements.end(), rand);
  thrust::device_vector<int> elements = host_elements;

  cudaSetDevice(0);
  cudaDeviceSynchronize();
  cudaStream_t stream;
  cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

  DeviceAllocator device_allocator;
  device_allocator.Malloc(sizeof(int) * num_elements + 128000000);

  auto t1 = std::chrono::high_resolution_clock::now();
  thrust::sort(thrust::cuda::par(device_allocator).on(stream), elements.begin(), elements.end());
  cudaStreamSynchronize(stream);
  std::chrono::duration<double> t2 = std::chrono::high_resolution_clock::now() - t1;

  std::cout << num_elements << "," << std::fixed << std::setprecision(9) << t2.count() << "\n";

  if (!thrust::is_sorted(elements.begin(), elements.end())) {
    std::cout << "Error: Invalid sort order.\n";
  }

  return 0;
}

We compile the example with nvcc -O3 -std=c++17 -o thrust_sort thrust_sort.cu and run it with ./thrust_sort <num_elements> on two different platforms.

IBM AC922: 4x NVIDIA Tesla V100 SXM2 32 GB, CUDA 11.2, THRUST 1.11/1.12
NVIDIA DGX A100: 8x NVIDIA A100 SXM4 40 GB, CUDA 11.0, THRUST 1.11/1.12

Observation

When varying the number of elements (through num_elements), the sort duration grows (almost) linearly with the number of elements up to <2.1B elements. Up until this point, all output elements are valid and in sorted order. Then, however, the sort duration drops sharply for ~2.1B elements. From there on, the output elements are all 0s. Nevertheless, the sort duration grows linearly again.

We observe this behavior consistently on both systems, regardless of the thrust version (i.e., 1.11 or 1.12).

Moreover, cub::DeviceRadixSort::SortKeys fails at precisely the same point. Unlike thrust::sort, however, it fails instantly (i.e., in less than 0.0001s) and does not touch the input elements at all, making it immediately apparent that something went wrong.

Conclusion

thrust::sort and cub::DeviceRadixSort::SortKeys fail for > 2.1B elements.

The text was updated successfully, but these errors were encountered:

griwes · 2021-06-06T11:13:32Z

That is suspicious indeed, especially since that is right where we know of indexing bugs due to the code truncating the sizes.

My suspicion is that thrust::is_sorted has the same bug as thrust::sort. If you copy the data back to the host and check with std::is_sorted instead, does it still say the range is fully sorted?

maltenbergert · 2021-06-06T12:06:44Z

@griwes A sanity check, as well as a run with cub::DeviceRadixSort::SortKeys, reveals the underlying issue. For > 2.1B keys, thrust::sort writes only 0s into the result buffer, whereas cub::DeviceRadixSort::SortKeys returns instantly, leaving the input data untouched. I updated the issue descriptions accordingly.

alliepiper · 2021-06-07T13:13:09Z

This is related to NVIDIA/cub#212. I'm hoping to take a look at this in the next release or two.

maltenbergert · 2021-06-07T20:33:17Z

Thanks for your replies!

We could apply a local hotfix along the lines of NVIDIA/cub#129 by changing the following files:

agent_radix_sort_downsweep.cuh:253
for (OffsetT block_jdx = block_idx - 1; /*block_jdx >= 0*/ block_jdx != UINT32_MAX; --block_jdx)

device_radix_sort.cuh:601
uint32_t            num_items,                              ///< [in] Number of items to sort

device_radix_sort.cuh:608
typedef uint32_t OffsetT;

However, since we our multi-GPU sorting approach also utilizes thrust::merge, we now encounter a similar issue there #1454.

jrhemstad · 2023-03-07T03:15:40Z

Closing in favor of NVIDIA/cccl#744

maltenbergert changed the title ~~Inexplicable performance of thrust::sort for > 2.1B keys~~ thrust::sort fails for > 2.1B keys Jun 6, 2021

odanivan mentioned this issue Jun 7, 2021

thrust::merge fails for > 2.1B keys #1454

Closed

jrhemstad added this to CCCL Aug 11, 2022

jrhemstad added the thrust label Feb 22, 2023

jarmak-nv assigned ericniebler Feb 23, 2023

jrhemstad closed this as not planned Won't fix, can't repro, duplicate, stale Mar 7, 2023

github-project-automation bot moved this to Done in CCCL Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thrust::sort fails for > 2.1B keys #1453

thrust::sort fails for > 2.1B keys #1453

maltenbergert commented Jun 6, 2021 •

edited

Loading

griwes commented Jun 6, 2021

maltenbergert commented Jun 6, 2021 •

edited

Loading

alliepiper commented Jun 7, 2021

maltenbergert commented Jun 7, 2021

jrhemstad commented Mar 7, 2023

thrust::sort fails for > 2.1B keys #1453

thrust::sort fails for > 2.1B keys #1453

Comments

maltenbergert commented Jun 6, 2021 • edited Loading

Context

Example

Observation

Conclusion

griwes commented Jun 6, 2021

maltenbergert commented Jun 6, 2021 • edited Loading

alliepiper commented Jun 7, 2021

maltenbergert commented Jun 7, 2021

jrhemstad commented Mar 7, 2023

maltenbergert commented Jun 6, 2021 •

edited

Loading

maltenbergert commented Jun 6, 2021 •

edited

Loading