Full documentation for rocPRIM is available at https://codedocs.xyz/ROCmSoftwarePlatform/rocPRIM/
- New block level
radix_rank
primitive. - New block level
radix_rank_match
primitive. - Added a stable block sorting implementation. This be used with
block_sort
by using theblock_sort_algorithm::stable_merge_sort
algorithm.
- Improved the performance of
block_radix_sort
anddevice_radix_sort
. - Improved the performance of
device_merge_sort
.
- Disabled GPU error messages relating to incorrect warp operation usage with Navi GPUs on Windows, due to GPU printf performance issues on Windows.
device_partition
,device_unique
, anddevice_reduce_by_key
now support problem sizes larger than 2^32 items.- Device algorithms now return
hipErrorInvalidValue
if the amount of passed temporary memory is insufficient. - Lists of sizes for tests are unified, restored scan/reduce tests for
half
andbfloat16
values.
block_sort::sort()
overload for keys and values with a dynamic size. This overload was documented but the implementation is missing. To avoid further confusion the documentation is removed until a decision is made on implementing the function.
- Fixed the compilation failure in
device_merge
if the two key iterators don't match.
- Fixed the compilation failure in device_merge if the two key iterators don't match.
- device_merge no longer correctly supports using different types for
keys_input1
andkeys_input2
(starting from the 5.3.0 release).
- New functions
subtract_left
andsubtract_right
inblock_adjacent_difference
to apply functions on pairs of adjacent items distributed between threads in a block. - New device level
adjacent_difference
primitives. - Added experimental tooling for automatic kernel configuration tuning for various architectures
- Benchmarks collect and output more detailed system information
- CMake functionality to improve build parallelism of the test suite that splits compilation units by function or by parameters.
- Reverse iterator.
- Support for problem sizes over
UINT_MAX
in device functionsinclusive_scan_by_key
andexclusive_scan_by_key
.
- Improved the performance of warp primitives using the swizzle operation on Navi
- Improved build parallelism of the test suite by splitting up large compilation units
device_select
now supports problem sizes larger than 2^32 items.device_segmented_radix_sort
now partitions segments to groups small, medium and large segments. Each segment group can be sorted by specialized kernels to improve throughput.- Improved performance of histogram for the case of highly uneven sample distribution.
- Packages for tests and benchmark executable on all supported OSes using CPack.
- Added File/Folder Reorg Changes and Enabled Backward compatibility support using wrapper headers.
- Fixed radix sort int64_t bug introduced in [2.10.11]
- Future value
- Added device partition_three_way to partition input to three output iterators based on two predicates
- The reduce/scan algorithm precision issues in the tests has been resolved for half types.
- The device radix sort algorithm supports indexing with 64 bit unsigned integers.
- The indexer type is chosen based on the type argument of parameter
size
. - If
sizeof(size)
is not larger than 4 bytes, the indexer type is 32 bit unsigned int, - Else the indexer type is 64 bit unsigned int.
- The maximum problem size is based on the compile time configuration of the algorithm according to the following formula:
max_problem_size = (UINT_MAX + 1) * config::scan::block_size * config::scan::items_per_thread
.
- The indexer type is chosen based on the type argument of parameter
- The flags API of
block_adjacent_difference
is now deprecated and will be removed in a future version.
- device_segmented_radix_sort unit test failing for HIP on Windows
- Enable bfloat16 tests and reduce threshold for bfloat16
- Fix device scan limit_size feature
- Non-optimized builds no longer trigger local memory limit errors
- Added scan size limit feature
- Added reduce size limit feature
- Added transform size limit feature
- Add block_load_striped and block_store_striped
- Add gather_to_blocked to gather values from other threads into a blocked arrangement
- The block sizes for device merge sorts initial block sort and its merge steps are now separate in its kernel config
- the block sort step supports multiple items per thread
- size_limit for scan, reduce and transform can now be set in the config struct instead of a parameter
- Device_scan and device_segmented_scan:
inclusive_scan
now uses the input-type as accumulator-type,exclusive_scan
uses initial-value-type.- This particularly changes behaviour of small-size input types with large-size output types (e.g.
short
input,int
output). - And low-res input with high-res output (e.g.
float
input,double
output)
- This particularly changes behaviour of small-size input types with large-size output types (e.g.
- Revert old Fiji workaround, because they solved the issue at compiler side
- Update README cmake minimum version number
- Block sort support multiple items per thread
- currently only powers of two block sizes, and items per threads are supported and only for full blocks
- Bumped the minimum required version of CMake to 3.16
- Unit tests may soft hang on MI200 when running in hipMallocManaged mode.
- device_segmented_radix_sort, device_scan unit tests failing for HIP on Windows
- ReduceEmptyInput cause random faulire with bfloat16
- Initial HIP on Windows support. See README for instructions on how to build and install.
- bfloat16 support added.
- Packaging split into a runtime package called rocprim and a development package called rocprim-devel. The development package depends on runtime. The runtime package suggests the development package for all supported OSes except CentOS 7 to aid in the transition. The suggests feature in packaging is introduced as a deprecated feature and will be removed in a future rocm release.
- As rocPRIM is a header-only library, the runtime package is an empty placeholder used to aid in the transition. This package is also a deprecated feature and will be removed in a future rocm release.
- Unit tests may soft hang on MI200 when running in hipMallocManaged mode.
- The warp_size() function is now deprecated; please switch to host_warp_size() and device_warp_size() for host and device references respectively.
- Code coverage tools build option
- Address sanitizer build option
- gfx1030 support added.
- Experimental HIP-CPU support; build using GCC/Clang/MSVC on Win/Linux. It is work in progress, many algorithms still known to fail.
- Added single tile radix sort for smaller sizes.
- Improved performance for radix sort for larger element sizes.
- The warp_size() function is now deprecated; please switch to host_warp_size() and device_warp_size() for host and device references respectively.
- Bugfix & minor performance improvement for merge_sort when input and output storage are the same.
- gfx90a support added.
- The warp_size() function is now deprecated; please switch to host_warp_size() and device_warp_size() for host and device references respectively.
- Size zero inputs are now properly handled with newer ROCm builds that no longer allow zero-size kernel grid/block dimensions
- Minimum cmake version required is now 3.10.2
- Device scan unit test currently failing due to LLVM bug.
- Texture cache iteration support has been re-enabled.
- Benchmark builds have been re-enabled.
- Unique operator no longer called on invalid elements.
- Device scan unit test currently failing due to LLVM bug.
- No new features
- Updates to DPP instructions for warp shuffle
- Benchmark builds are disabled due to compiler bug.
- Added HIP cmake dependency
- Updates to warp shuffle for gfx10
- Disable DPP functions on gfx10++
- Benchmark builds are disabled due to compiler bug.
- Fix for rocPRIM texture cache iterator
- None
- Package dependency correct to hip-rocclr
- rocPRIM texture cache iterator functionality is broken in the runtime. It will be fixed in the next release. Please use the prior release if calling this function.
- No new features
- Point release with compilation fix.
- Improved tests with fixed and random seeds for test data
- Network interface improvements with API v3
- Switched to hip-clang as default compiler
- CMake searches for rocPRIM locally first; downloads from github if local search fails