Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

thrust::sort fails for > 2.1B keys #1453

Closed
maltenbergert opened this issue Jun 6, 2021 · 5 comments
Closed

thrust::sort fails for > 2.1B keys #1453

maltenbergert opened this issue Jun 6, 2021 · 5 comments
Assignees
Labels

Comments

@maltenbergert
Copy link

maltenbergert commented Jun 6, 2021

Context

We are benchmarking the performance of thrust::sort with a pre-allocated temporary buffer. In a nutshell, we generate the data on the host, copy it onto the device, initialize a stream, pre-allocate a temporary buffer for thrust::sort, and measure the sort duration.

Example

#include <thrust/sort.h>
#include <thrust/generate.h>
#include <thrust/device_vector.h>

#include <time.h>
#include <stdlib.h>
#include <iomanip>
#include <iostream>
#include <algorithm>
#include <chrono>

struct DeviceAllocator {
  using value_type = uint8_t;

  void Malloc(size_t num_bytes) {
    cudaMalloc(reinterpret_cast<void**>(&begin_pointer), num_bytes);
  }
  
  uint8_t* allocate(size_t num_bytes) {
    return begin_pointer;
  }
  void deallocate(uint8_t* current, size_t num_bytes) {}

  uint8_t* begin_pointer = nullptr;
};

int main(int argc, char* argv[]) {
  const size_t num_elements = std::stoull(argv[1]);

  thrust::host_vector<int> host_elements(num_elements);
  std::generate(host_elements.begin(), host_elements.end(), rand);
  thrust::device_vector<int> elements = host_elements;

  cudaSetDevice(0);
  cudaDeviceSynchronize();
  cudaStream_t stream;
  cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

  DeviceAllocator device_allocator;
  device_allocator.Malloc(sizeof(int) * num_elements + 128000000);

  auto t1 = std::chrono::high_resolution_clock::now();
  thrust::sort(thrust::cuda::par(device_allocator).on(stream), elements.begin(), elements.end());
  cudaStreamSynchronize(stream);
  std::chrono::duration<double> t2 = std::chrono::high_resolution_clock::now() - t1;

  std::cout << num_elements << "," << std::fixed << std::setprecision(9) << t2.count() << "\n";

  if (!thrust::is_sorted(elements.begin(), elements.end())) {
    std::cout << "Error: Invalid sort order.\n";
  }

  return 0;
}

We compile the example with nvcc -O3 -std=c++17 -o thrust_sort thrust_sort.cu and run it with ./thrust_sort <num_elements> on two different platforms.

  • IBM AC922: 4x NVIDIA Tesla V100 SXM2 32 GB, CUDA 11.2, THRUST 1.11/1.12
  • NVIDIA DGX A100: 8x NVIDIA A100 SXM4 40 GB, CUDA 11.0, THRUST 1.11/1.12

Observation

When varying the number of elements (through num_elements), the sort duration grows (almost) linearly with the number of elements up to <2.1B elements. Up until this point, all output elements are valid and in sorted order. Then, however, the sort duration drops sharply for ~2.1B elements. From there on, the output elements are all 0s. Nevertheless, the sort duration grows linearly again.

We observe this behavior consistently on both systems, regardless of the thrust version (i.e., 1.11 or 1.12).

Moreover, cub::DeviceRadixSort::SortKeys fails at precisely the same point. Unlike thrust::sort, however, it fails instantly (i.e., in less than 0.0001s) and does not touch the input elements at all, making it immediately apparent that something went wrong.

enter image description here

Conclusion

thrust::sort and cub::DeviceRadixSort::SortKeys fail for > 2.1B elements.

@griwes
Copy link
Collaborator

griwes commented Jun 6, 2021

That is suspicious indeed, especially since that is right where we know of indexing bugs due to the code truncating the sizes.

My suspicion is that thrust::is_sorted has the same bug as thrust::sort. If you copy the data back to the host and check with std::is_sorted instead, does it still say the range is fully sorted?

@maltenbergert
Copy link
Author

maltenbergert commented Jun 6, 2021

@griwes A sanity check, as well as a run with cub::DeviceRadixSort::SortKeys, reveals the underlying issue. For > 2.1B keys, thrust::sort writes only 0s into the result buffer, whereas cub::DeviceRadixSort::SortKeys returns instantly, leaving the input data untouched. I updated the issue descriptions accordingly.

@maltenbergert maltenbergert changed the title Inexplicable performance of thrust::sort for > 2.1B keys thrust::sort fails for > 2.1B keys Jun 6, 2021
@alliepiper
Copy link
Collaborator

This is related to NVIDIA/cub#212. I'm hoping to take a look at this in the next release or two.

@maltenbergert
Copy link
Author

Thanks for your replies!

We could apply a local hotfix along the lines of NVIDIA/cub#129 by changing the following files:

agent_radix_sort_downsweep.cuh:253
for (OffsetT block_jdx = block_idx - 1; /*block_jdx >= 0*/ block_jdx != UINT32_MAX; --block_jdx)

device_radix_sort.cuh:601
uint32_t            num_items,                              ///< [in] Number of items to sort

device_radix_sort.cuh:608
typedef uint32_t OffsetT;

However, since we our multi-GPU sorting approach also utilizes thrust::merge, we now encounter a similar issue there #1454.

@jrhemstad
Copy link
Collaborator

Closing in favor of NVIDIA/cccl#744

@jrhemstad jrhemstad closed this as not planned Won't fix, can't repro, duplicate, stale Mar 7, 2023
@github-project-automation github-project-automation bot moved this to Done in CCCL Mar 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Archived in project
Development

No branches or pull requests

5 participants