This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
CUB 1.15.0 (NVIDIA HPC SDK 21.11) #391
alliepiper
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Summary
CUB 1.15.0 accompanies the NVIDIA HPC SDK 21.11 release. It includes a new
cub::DeviceSegmentedSort
algorithm, which demonstrates up to 5000x speedup compared tocub::DeviceSegmentedRadixSort
when sorting a large number of small segments. A newcub::FutureValue<T>
helper allows thecub::DeviceScan
algorithms to lazily load theinitial_value
from a pointer.cub::DeviceScan
also addedScanByKey
functionality.The new
DeviceSegmentedSort
algorithm partitions segments into size groups. Each group is processed with specialized kernels using a variety of sorting algorithms. This approach varies the number of threads allocated for sorting each segment and utilizes the GPU more efficiently.cub::FutureValue<T>
provides the ability to use the result of a previous kernel as a scalar input to a CUB device-scope algorithm without unnecessary synchronization:Previously, an explicit synchronization would have been necessary to obtain the intermediate result, which was passed by value into ExclusiveScan. This new feature enables better performance in workflows that use cub::DeviceScan.
Deprecation Notices
A future version of CUB will change the
debug_synchronous
behavior of device-scope algorithms when invoked via CUDA Dynamic Parallelism (CDP).This will only affect calls to CUB device-scope algorithms launched from device-side code with
debug_synchronous = true
. These algorithms will continue to print extra debugging information, but they will no longer synchronize after kernel launches.Breaking Changes
cub::DispatchScan
have changed to support the newcub::FutureValue
helper. More details under "New Features".operator->()
#377: Remove brokenoperator->()
fromcub::TransformInputIterator
, since this cannot be implemented without returning a temporary object's address. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.New Features
cub::DeviceScan
algorithms that allow the output of a previous kernel to be used asinitial_value
without explicit synchronization. See the newcub::FutureValue
helper for details. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.cub::BlockRunLengthDecode
algorithm. Thanks to Elias Stehle (@elstehle) for this contribution.cub::DeviceSegmentedSort
, an optimized version ofcub::DeviceSegmentedSort
with improved load balancing and small array performance.cub::DeviceScan
. Thanks to Xiang Gao (@zasdfgbnm) for this contribution.Bug Fixes
cub::DeviceMergeSort
algorithms.-Wconversion
warnings. Thanks to Matt Stack (@matt-stack) for this contribution.cub::CachingDeviceAllocator
.This discussion was created from the release CUB 1.15.0 (NVIDIA HPC SDK 21.11).
Beta Was this translation helpful? Give feedback.
All reactions