diff --git a/README.md b/README.md index 28682a073..1b55873f7 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@ int main(void) return 0; } ``` - + This code sample computes the sum of 100 random numbers in parallel: ```c++ @@ -81,4 +81,4 @@ for details. Development process ------------------- -For information on development process and branching, see [this document](doc/branching.md). +For information on development process, see [this document](doc/development_model.md). diff --git a/dependencies/cub b/dependencies/cub index d106ddb99..2a231db32 160000 --- a/dependencies/cub +++ b/dependencies/cub @@ -1 +1 @@ -Subproject commit d106ddb991a56c3df1b6d51b2409e36ba8181ce4 +Subproject commit 2a231db3226a9bfcd008bb6120bec12fe0a98cd1 diff --git a/doc/branching.md b/doc/branching.md deleted file mode 100644 index 90ca0f375..000000000 --- a/doc/branching.md +++ /dev/null @@ -1,127 +0,0 @@ -# Thrust Branching and Development Model - -The following is a description of how the Thrust development teams approaches branching and release tagging. This -is a living document that will evolve as our process evolves. - -## Thrust Version - -Thrust has historically had its own versioning system, independent of the versioning scheme of the CUDA Toolkit. -Today, Thrust is released with the CUDA Toolkit, but we currently still maintain the double versioning scheme. - -The following is a mapping from Thrust versions to CUDA Toolkit versions and vice versa. Note that some Thrust versions don't directly map to any CUDA Toolkit version. - -| Thrust version | CUDA version | -| ----------------- | ------------- | -| 1.9.8 | 11.0 EA | -| 1.9.7 | 10.2 | -| 1.9.6 | 10.1 Update 2 | -| 1.9.5 | 10.1 Update 1 | -| 1.9.4 | 10.1 | -| 1.9.3 | 10.0 | -| 1.9.2 | 9.2 | -| 1.9.1 | 9.1 | -| 1.9.0 | 9.0 | -| 1.8.3 | 8.0 | -| 1.8.2 | 7.5 | -| 1.8.1 | 7.0 | -| 1.8.0 | *N/A* | -| 1.7.2 | 6.5 | -| 1.7.1 | 6.0 | -| 1.7.0 | 5.5 | -| 1.6.0 | *N/A* | -| 1.5.3 | 5.0 | -| 1.5.2 | 4.2 | -| 1.5.1 | 4.1 | -| 1.5.0 | *N/A* | -| 1.4.0 | 4.0 | -| 1.3.0 | 3.2 | -| 1.2.1 | 3.1 | -| 1.2.0 | *N/A* | -| 1.1.1 | *N/A* | -| 1.1.0 | *N/A* | -| 1.0.0 | *N/A* | - -## Repositories - -As Thrust is developed both on GitHub and internally at NVIDIA, there's three main places where code lives: - - * The [public Thrust repository](https://github.com/thrust/thrust), referred to as `github` later in this - document. - * An internal GitLab repository, referred to as `gitlab` later in this document. - * An internal Perforce repository, referred to as `perforce` later in this document. - -## Branches and Tags - -The following tag names are used in the Thrust project: - - * `github/cuda-X.Y`: the tag that directly corresponds to what has been shipped in the CUDA Toolkit release X.Y. - * `github/A.B.C`: the tag that directly corresponds to a Thrust version A.B.C. - -The following branch names are used in the Thrust project: - - * `github/master`: the Source of Truth development branch of Thrust. - * `github/old-master`: the old Source of Truth branch, before unification of public and internal repositories. - * `perforce/private`: mirrored github/master, plus files necessary for internal NVIDIA testing systems. - * `gitlab/staging/cuda-X.Y`: the branch for a CUDA Toolkit release that has not been released yet. cuda-X.Y should - be tagged on this branch after the final commit freeze (see "Release branches" below). - * `github/maintenance/cuda-Z.W`: the continuation of gitlab/staging/cuda-Z.W, but after release of CUDA Z.W, plus - post-release fixes if any are needed (see "Old release branches" below). - * `gitlab/feature/`: feature branch for internally developed features. - * `gitlab/bug/-`: bug fix branch, where `bug-system` is `github` or `nvbug`. Permits a description - after `bug-id`. - * `gitlab/master`: same as `github/master`, but not yet published, during a freezing period (see "Feature freeze" - below). - -## Development Process Described - -### Normal development - -During regular parts of the development cycle, when we develop features on feature branches, and fix bugs on the -main branch, we can: - - * Merge internal fixes to `github/master` and to `perforce/private`. - * Merge Github contributions to `github/master` and to `perforce/private`. - -### Feature freeze - -In case where we have a new feature for a CUDA Toolkit release: just before the CUDA Toolkit feature freeze for a -new release branch, we should stop merging commits (including public contributions) to `github/master`, and move to -development on `gitlab/master`, and merge the not yet public features there. - -In those cases, we should wait until the new version of the toolkit is released before we push the new updated -`gitlab/master` to `github/master`, roughly at the same time as we push from `gitlab/staging/cuda-X.Y` to -`github/maintenance/cuda-X.Y` and tag `cuda-X.Y`, and the appropriate Thrust version tag. - -If we don't have big, not-public-before-release features landing in X.Y, however, we can avoid having a feature -freeze period. - -The reason for having a freeze period at all is: `github/master` is supposed to be the Source of Truth. We want the -history to follow the same order of commits in both Git and Perforce, and once a change is merged, we cannot rebase -things that went into `perforce/internal` on top of it. Therefore: since we only really commit to Perforce but not -`github/master` when we have a feature that is ready to be delivered, but is only a part of a new release and -shouldn't/can't be public yet, we have to make sure that after it is merged to `gitlab/master` (and to `perforce/internal`), -nothing new lands in `github/master` before we push the feature out. - -To avoid situations like this with bug fixes, when we fix a bug at a not crazy point in the release cycle, we -should develop it on git, merge/push it on Github, and then pull the new commit to Perforce. - -### Release branches - -These are the internal Git branches that map directly to internal CUDA release branches. These branches are primarily -developed in Git, and commits applied to them are then pushed to Perforce. - -After a CUDA Toolkit version is released, these transition to being old release branches. - -### Old release branches - -These branches represent a version that has landed in a CUDA Toolkit version, but with bugfixes for things that do -deserve being fixed on a release branch. These shouldn't be groundbreaking; the following are an acceptable set of -fixes to go into these branches, because they can remove annoyances, but shouldn't change behavior: - - * Documentation fixes and updates. - * Thrust build system changes. - * Additional examples, fixes to examples and tests. - * (Possibly:) Fixing missing headers. This one is slightly less obvious, because it makes it possible for users - of standalone Thrust to write programs that won't compile with CUDA Thrust. Determinations will be made on a - case by case basis. - diff --git a/doc/changelog.md b/doc/changelog.md index 85997e8ae..d51a26247 100644 --- a/doc/changelog.md +++ b/doc/changelog.md @@ -1,30 +1,261 @@ -# Thrust v1.9.8 (CUDA 11.0) +# Thrust 1.9.10 (NVIDIA HPC SDK 20.5) ## Summary -Thrust v1.9.8, which is included in the CUDA 11.0 release, removes Thrust's - internal derivative of CUB, upstreams all relevant changes too CUB, and adds - CUB as a Git submodule. +Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5. +It adds CMake support for compilation with NVC++ and a number of minor bug fixes + for NVC++. +It also adds CMake `find_package` support. + +## New Features + +- #1130: CMake `find_package` support. + This is significant because there is a legacy `FindThrust.cmake` script + authored by a third party in widespread use in the community which has a + bug in how it parses Thrust version numbers which will cause it to + incorrectly parse 1.9.10. + This script only handles the first digit of each part of the Thrust version + number correctly: for example, Thrust 17.17.17 would be interpreted as + Thrust 1.1.1701717. + You can find directions for using the new CMake `find_package` support and + migrating away from the legacy `FindThrust.cmake` [here](https://github.com/thrust/thrust/blob/master/thrust/cmake/README.md) +- #1129: Added `thrust::detail::single_device_tls_caching_allocator`, a + convenient way to get an MR caching allocator for device memory, which is + used by NVC++. + +## Other Enhancements + +- #1129: Refactored RDC handling in CMake to be a global option and not create + two targets for each example and test. + +## Bug Fixes + +- #1129: Fix the legacy `thrust::return_temporary_buffer` API to support + passing a size. + This was necessary to enable usage of Thrust caching MR allocators with + synchronous Thrust algorithms. + This change has allowed NVC++’s C++17 Parallel Algorithms implementation to + switch to use Thrust caching MR allocators for device temporary storage, + which gives a 2x speedup on large multi-GPU systems such as V100 and A100 + DGX where `cudaMalloc` is very slow. +- #1128: Respect `CUDA_API_PER_THREAD_DEFAULT_STREAM`. + Thanks to Rong Ou for this contribution. +- #1131: Fix the one-policy overload of `thrust::async::copy` to not copy the + policy, resolving use-afer-move issues. +- #1145: When cleaning up type names in `unittest::base_class_name`, only call + `std::string::replace` if we found the substring we are looking to replace. +- #1139: Don't use `cxx::__demangle` in NVC++. +- #1102: Don't use `thrust::detail::normal_distribution_nvcc` for Feta because + it uses `erfcinv`, a non-standard function that Feta doesn't have. + +# Thrust 1.9.9 (CUDA Toolkit 11.0) + +## Summary + +Thrust 1.9.9 adds support for NVC++, which uses Thrust to implement + GPU-accelerated C++17 Parallel Algorithms. +`thrust::zip_function` and `thrust::shuffle` were also added. +As of this release, C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are + deprecated. +Starting with the upcoming 1.10.0 release, C++03 support will be dropped + entirely. +All other deprecated platforms will be dropped in the near future. + +## Breaking Changes + +- #1082: Thrust now checks that it is compatible with the version of CUB found + in your include path, generating an error if it is not. + If you are using your own verison of CUB, it may be too old. + It is recommended to simply delete your own version of CUB and use the + version of CUB that comes with Thrust. +- #1089 C++03 and C++11 are deprecated. + Using these dialects will generate a compile-time warning. + These warnings can be suppressed by defining + `THRUST_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11 + deprecation warnings) or `THRUST_IGNORE_DEPRECATED_CPP11` (to suppress C++11 + deprecation warnings). + Suppression is only a short term solution. + We will be dropping support for C++03 in the 1.10.0 release and C++11 in the + near future. +- #1089: GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. + Using these compilers will generate a compile-time warning. + These warnings can be suppressed by defining + `THRUST_IGNORE_DEPRECATED_COMPILER`. + Supression is only a short term solution. + We will be dropping support for these compilers in the near future. + +## New Features + +- #1086: Support for NVC++ aka "Feta". + The most significant change is in how we use `__CUDA_ARCH__`. + Now, there are four macros that must be used: + - `THRUST_IS_DEVICE_CODE`, which should be used in an `if` statement around + device-only code. + - `THRUST_INCLUDE_DEVICE_CODE`, which should be used in an `#if` preprocessor + directive inside of the `if` statement mentioned in the prior bullet. + - `THRUST_IS_HOST_CODE`, which should be used in an `if` statement around + host-only code. + - `THRUST_INCLUDE_HOST_CODE`, which should be used in an `#if` preprocessor + directive inside of the `if` statement mentioned in the prior bullet. +- #1085: `thrust::shuffle`. + Thanks to Rory Mitchell for this contribution. +- #1029: `thrust::zip_function`, a facility for zipping functions that take N + parameters instead of a tuple of N parameters as `thrust::zip_iterator` + does. + Thanks to Ben Jude for this contribution. +- #1068: `thrust::system::cuda::managed_memory_pointer`, a universal memory + strongly typed pointer compatible with the ISO C++ Standard Library. + +## Other Enhancements + +- #1029: Thrust is now built and tested with NVCC warnings treated as errors. +- #1029: MSVC C++11 support. +- #1029: `THRUST_DEPRECATED` abstraction for generating compile-time + deprecation warning messages. +- #1029: `thrust::pointer::pointer_to(reference)`. +- #1070: Unit test for `thrust::inclusive_scan` with a user defined types. + Thanks to Conor Hoekstra for this contribution. + +## Bug Fixes + +- #1088: Allow `thrust::replace` to take functions that have non-`const` + `operator()`. +- #1094: Add missing `constexpr` to `par_t` constructors. + Thanks to Patrick Stotko for this contribution. +- #1077: Remove `__device__` from CUDA MR-based device allocators to fix + obscure "host function called from host device function" warning that occurs + when you use the new Thrust MR-based allocators. +- #1029: Remove inconsistently-used `THRUST_BEGIN`/`END_NS` macros. +- #1029: Fix C++ dialect detection on newer MSVC. +- #1029 Use `_Pragma`/`__pragma` instead of `#pragma` in macros. +- #1029: Replace raw `__cplusplus` checks with the appropriate Thrust macros. +- #1105: Add a missing `` include. +- #1103: Fix regression of `thrust::detail::temporary_allocator` with non-CUDA + back ends. +- #1111: Use Thrust's random number engine instead of `std::`s in device code. +- #1108: Get rid of a GCC 9 warning about deprecated generation of copy ctors. + +# Thrust 1.9.8-1 (NVIDIA HPC SDK 20.3) + +## Summary + +Thrust 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3. +It contains modifications necessary to serve as the implementation of NVC++'s + GPU-accelerated C++17 Parallel Algorithms. + +# Thrust 1.9.8 (CUDA Toolkit 11.0 Early Access) + +## Summary + +Thrust 1.9.8, which is included in the CUDA Toolkit 11.0 release, removes + Thrust's internal derivative of CUB, upstreams all relevant changes too CUB, + and adds CUB as a Git submodule. It will now be necessary to do `git clone --recursive` when checking out Thrust, and to update the CUB submodule when pulling in new Thrust changes. Additionally, CUB is now included as a first class citizen in the CUDA toolkit. -Thrust v1.9.8 also fixes bugs preventing most Thrust algorithms from working - with more than `2^32` elements. -Now, `reduce`, `*_scan`, and related algorithms (aka most of Thrust) work with - large element counts. -`sort` remains limited to `2^32` elements for now. +Thrust 1.9.8 also fixes bugs preventing most Thrust algorithms from working + with more than `2^31-1` elements. +Now, `thrust::reduce`, `thrust::*_scan`, and related algorithms (aka most of + Thrust) work with large element counts. + +## Breaking Changes + +- Thrust will now use the version of CUB in your include path instead of its own + internal copy. + If you are using your own version of CUB, it may be older and incompatible + with Thrust. + It is recommended to simply delete your own version of CUB and use the + version of CUB that comes with Thrust. + +## Other Enhancements + +- Refactor Thrust and CUB to support 64-bit indices in most algorithms. + In most cases, Thrust now selects between kernels that use 32-bit indices and + 64-bit indices at runtime depending on the size of the input. + This means large element counts work, but small element counts do not have to + pay for the register usage of 64-bit indices if they are not needed. + Now, `thrust::reduce`, `thrust::*_scan`, and related algorithms (aka most of + Thrust) work with more than `2^31-1` elements. + Notably, `thrust::sort` is still limited to less than `2^31-1` elements. +- CUB is now a submodule and the internal copy of CUB has been removed. +- #1051: Stop specifying the `__launch_bounds__` minimum blocks parameter + because it messes up register allocation and increases register pressure, + and we don't actually know at compile time how many blocks we will use + (aside from single tile kernels). -# Thrust v1.9.7 (CUDA 10.2) +## Bug Fixes + +- #1020: After making a CUDA API call, always clear the global CUDA error state + by calling `cudaGetLastError`. +- #1021: Avoid calling destroy in the destructor of a Thrust vector if the + vector is empty. +- #1046: Actually throw `thrust::bad_alloc` when `thrust::system::cuda::malloc` + fails instead of just constructing a temporary and doing nothing with it. +- Add missing copy constructor or copy assignment operator to all classes that + GCC 9's `-Wdeprecated-copy` complains about +- Add missing move operations to `thrust::system::cuda::vector`. +- #1015: Check that the backend is CUDA before using CUDA-specifics in + `thrust::detail::temporary_allocator`. + Thanks to Hugh Winkler for this contribution. +- #1055: More correctly detect the presence of aligned/sized `new`/`delete`. +- #1043: Fix ill-formed specialization of `thrust::system::is_error_code_enum` + for `thrust::event_errc`. + Thanks to Toru Niina for this contribution. +- #1027: Add tests for `thrust::tuple_for_each` and `thrust::tuple_subset`. + Thanks to Ben Jude for this contribution. +- #1027: Use correct macro in `thrust::tuple_for_each`. + Thanks to Ben Jude for this contribution. +- #1026: Use correct MSVC version formatting in CMake. + Thanks to Ben Jude for this contribution. +- Workaround an NVCC issue with type aliases with template template arguments + containing a parameter pack. +- Remove unused functions from the CUDA backend which call slow CUDA attribute + query APIs. +- Replace `CUB_RUNTIME_FUNCTION` with `THRUST_RUNTIME_FUNCTION`. +- Correct typo in `thrust::transform` documentation. + Thanks to Eden Yefet for this contribution. + +## Known Issues + +- `thrust::sort` remains limited to `2^31-1` elements for now. + +# Thrust 1.9.7-1 (CUDA Toolkit 10.2) ## Summary -Thrust v1.9.7 is a minor release accompanying the CUDA 10.2 release. +Thrust 1.9.7-1 is a minor release accompanying the CUDA Toolkit 10.2 release + for Tegra. +It is nearly identical to 1.9.7. -# Thrust v1.9.6 (CUDA 10.1 Update 2) +# Thrust 1.9.7 (CUDA Toolkit 10.2) ## Summary -Thrust v1.9.6 is a minor release accompanying the CUDA 10.1 Update 2 release. +Thrust 1.9.7 is a minor release accompanying the CUDA Toolkit 10.2 release. + +## Bug Fixes + +- #967, NVBug 2448170: Fix the CUDA backend `thrust::for_each` so that it + supports large input sizes with 64-bit indices. +- NVBug 2646034: Fix incorrect dependency handling for stream acquisition in + `thrust::future` +- #968, NVBug 2612102: Fix the `thrust::mr::polymorphic_adaptor` to actually + use its template parameter. + +# Thrust 1.9.6-1 (NVIDIA HPC SDK 20.3) + +## Summary + +Thrust 1.9.6-1 is a variant of 1.9.6 accompanying the NVIDIA HPC SDK 20.3. +It contains modifications necessary to serve as the implementation of NVC++'s + GPU-accelerated C++17 Parallel Algorithms. + +# Thrust 1.9.6 (CUDA Toolkit 10.1 Update 2) + +## Summary + +Thrust 1.9.6 is a minor release accompanying the CUDA Toolkit 10.1 Update 2 + release. ## Bug Fixes @@ -34,23 +265,24 @@ Thrust v1.9.6 is a minor release accompanying the CUDA 10.1 Update 2 release. - NVBug 200488234 CUDA header files contain unicode characters which leads compiling errors on Windows - #949, #973, NVBug 2422333, NVBug 2522259, NVBug 2528822 - `thrust::detail::aligned_reinterpret_cast` must be annotated with - `__host__ __device__`. + `thrust::detail::aligned_reinterpret_cast` must be annotated with + `__host__ __device__`. - NVBug 2599629 Missing include in the OpenMP sort implementation - NVBug 200513211 Truncation warning in test code under VC142 -# Thrust v1.9.5 (CUDA 10.1 Update 1) +# Thrust 1.9.5 (CUDA Toolkit 10.1 Update 1) ## Summary -Thrust 1.9.5 is a minor release accompanying the CUDA 10.1 Update 1 release. +Thrust 1.9.5 is a minor release accompanying the CUDA Toolkit 10.1 Update 1 + release. ## Bug Fixes - NVBug 2502854: Fixed assignment of `thrust::device_vector>` between host and device. -# Thrust 1.9.4 (CUDA 10.1) +# Thrust 1.9.4 (CUDA Toolkit 10.1) ## Summary @@ -287,7 +519,7 @@ Use the new asynchronous Thrust algorithms for non-blocking behavior. - #924, NVBug 2096679, NVBug 2315990: Fix dispatch for the CUDA backend's `thrust::reduce` to use two functions (one with the pragma for disabling exec checks, one with `THRUST_RUNTIME_FUNCTION`) instead of one. This fixes - a regression with device compilation that started in CUDA 9.2. + a regression with device compilation that started in CUDA Toolkit 9.2. - #928, NVBug 2341455: Add missing `__host__ __device__` annotations to a `thrust::complex::operator=` to satisfy GoUDA. - NVBug 2094642: Make `thrust::vector_base::clear` not depend on the element @@ -300,7 +532,7 @@ Use the new asynchronous Thrust algorithms for non-blocking behavior. `thrust::counting_iterator` perform proper truncation. - NVBug 2455952: Refactor questionable `thrust::copy_if` unit tests. -# Thrust 1.9.3 (CUDA 10.0) +# Thrust 1.9.3 (CUDA Toolkit 10.0) ## Summary @@ -328,7 +560,7 @@ Thrust 1.9.3 unifies and integrates CUDA Thrust and GitHub Thrust. - Thanks to Francisco Facioni for contributing optimizations for `thrust::min/max_element`. -# Thrust 1.9.2 (CUDA 9.2) +# Thrust 1.9.2 (CUDA Toolkit 9.2) ## Summary @@ -367,7 +599,7 @@ Additionally, the unit test suite and framework was enhanced to increase overlooked but `deallocate` to be called with GCC <= 4.3. - NVBug 1777043: Fixed `thrust::complex` to work with `thrust::sequence`. -# Thrust 1.9.1 (CUDA 9.1) +# Thrust 1.9.1 (CUDA Toolkit 9.1) ## Summary @@ -382,7 +614,7 @@ for `thrust::reduce` based on CUB. - NVBug 1904217: Allow callables that take non-const refs to be used with `thrust::reduce` and `thrust::*_scan`. -# Thrust 1.9.0 (CUDA 9.0) +# Thrust 1.9.0 (CUDA Toolkit 9.0) ## Summary @@ -430,7 +662,7 @@ This brings a substantial performance improvement to the CUDA backend across - Thanks to Duane Merrill for developing CUB and helping to integrate it into Thrust's backend. -# Thrust 1.8.3 (CUDA 8.0) +# Thrust 1.8.3 (CUDA Toolkit 8.0) Thrust 1.8.3 is a small bug fix release. @@ -446,7 +678,7 @@ Thrust 1.8.3 is a small bug fix release. - `thrust::clear` operations on vector types no longer requires the element type to have a default constructor. -# Thrust 1.8.2 (CUDA 7.5) +# Thrust 1.8.2 (CUDA Toolkit 7.5) Thrust 1.8.2 is a small bug fix release. @@ -465,7 +697,7 @@ Thrust 1.8.2 is a small bug fix release. - #628: `thrust::reduce_by_key` for the CUDA backend fails for Compute Capability 5.0 devices. -# Thrust 1.8.1 (CUDA 7.0) +# Thrust 1.8.1 (CUDA Toolkit 7.0) Thrust 1.8.1 is a small bug fix release. @@ -481,53 +713,44 @@ Thrust 1.8.1 is a small bug fix release. # Thrust 1.8.0 -Summary -- Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams, -- and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code, -- providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing -- Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy -- allows users to require sequential algorithm execution in the calling thread and makes a -- sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to -- request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm -- implementations provide substantial performance improvements. +## Summary +Thrust 1.8.0 introduces support for algorithm invocation from CUDA device + code, support for CUDA streams, and algorithm performance improvements. +Users may now invoke Thrust algorithms from CUDA device code, providing a + parallel algorithms library to CUDA programmers authoring custom kernels, as + well as allowing Thrust programmers to nest their algorithm calls within + functors. +The `thrust::seq` execution policy allows users to require sequential algorithm + execution in the calling thread and makes a sequential algorithms library + available to individual CUDA threads. +The `.on(stream)` syntax allows users to request a CUDA stream for kernels + launched during algorithm execution. +Finally, new CUDA algorithm implementations provide substantial performance + improvements. ## New Features -- Algorithms in CUDA __device__ code - Thrust algorithms may now be invoked from CUDA __device__ and __host__ __device__ functions. - - Algorithms invoked in this manner must be invoked with an execution policy as the first parameter: - - __device__ int my_device_sort(int *data, size_t n) - { - thrust::sort(thrust::device, data, data + n); - } - +- Algorithms in CUDA Device Code: + - Thrust algorithms may now be invoked from CUDA `__device__` and + `__host__` __device__ functions. + Algorithms invoked in this manner must be invoked with an execution + policy as the first parameter. The following execution policies are supported in CUDA __device__ code: - thrust::seq - thrust::cuda::par - thrust::device, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA - - Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available. - -- Execution Policies - CUDA Streams - The thrust::cuda::par.on(stream) syntax allows users to request that CUDA __global__ functions launched during algorithm - execution should occur on a given stream: - - // execute for_each on stream s - thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor); - - Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary - storage or returning results to the CPU. - - thrust::seq - The thrust::seq execution policy allows users to require that an algorithm execute sequentially in the calling thread: - - // execute for_each sequentially in this thread - thrust::for_each(thrust::seq, begin, end, my_functor); - -- Other - The new thrust::complex template provides complex number support. + - `thrust::seq` + - `thrust::cuda::par` + - `thrust::device`, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA. + - Device-side algorithm execution may not be parallelized unless CUDA Dynamic + Parallelism is available. +- Execution Policies: + - CUDA Streams + - The `thrust::cuda::par.on(stream)` syntax allows users to request that + CUDA kernels launched during algorithm execution should occur on a given + stream. + - Algorithms executed with a CUDA stream in this manner may still + synchronize with other streams when allocating temporary storage or + returning results to the CPU. + - `thrust::seq`, which allows users to require that an algorithm execute + sequentially in the calling thread. +- `thrust::complex`, a complex number data type. ## New Examples - simple_cuda_streams demonstrates how to request a CUDA stream during algorithm execution. @@ -565,7 +788,7 @@ Acknowledgments - Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation. - Thanks to Filipe Maia for contributing the implementation of thrust::complex. -# Thrust 1.7.2 (CUDA 6.5) +# Thrust 1.7.2 (CUDA Toolkit 6.5) Summary - Small bug fixes @@ -573,7 +796,7 @@ Summary ## Bug Fixes - Avoid use of std::min in generic find implementation -# Thrust 1.7.1 (CUDA 6.0) +# Thrust 1.7.1 (CUDA Toolkit 6.0) Summary - Small bug fixes @@ -583,68 +806,83 @@ Summary - Eliminate unused variable warning in CUDA reduce_by_key implementation - Avoid deriving function objects from std::unary_function and std::binary_function -# Thrust 1.7.0 (CUDA 5.5) - -Summary -- Thrust 1.7.0 introduces a new interface for controlling algorithm execution as -- well as several new algorithms and performance improvements. With this new -- interface, users may directly control how algorithms execute as well as details -- such as the allocation of temporary storage. Key/value versions of thrust::merge -- and the set operation algorithms have been added, as well stencil versions of -- partitioning algorithms. thrust::tabulate has been introduced to tabulate the -- values of functions taking integers. For 32b types, new CUDA merge and set -- operations provide 2-15x faster performance while a new CUDA comparison sort -- provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation -- provides 80% faster performance. +# Thrust 1.7.0 (CUDA Toolkit 5.5) -## Breaking Changes -- Dispatch - Custom user backend systems' tag types must now inherit from the corresponding system's execution_policy template (e.g. thrust::cuda::execution_policy) instead - of the tag struct (e.g. thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch. - See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples. - - thrust::advance and thrust::distance are no longer dispatched based on iterator system type and thus may no longer be customized. +## Summary -- Iterators - iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated. - iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor). - iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade). - iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access). - All iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently impossible. - Floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type. +Thrust 1.7.0 introduces a new interface for controlling algorithm execution as + well as several new algorithms and performance improvements. +With this new interface, users may directly control how algorithms execute as + well as details such as the allocation of temporary storage. +Key/value versions of thrust::merge and the set operation algorithms have been + added, as well stencil versions of partitioning algorithms. +thrust::tabulate has been introduced to tabulate the values of functions taking + integers. +For 32b types, new CUDA merge and set operations provide 2-15x faster + performance while a new CUDA comparison sort provides 1.3-4x faster + performance. +Finally, a new TBB reduce_by_key implementation provides 80% faster + performance. -- Other - normal_distribution has been moved into the thrust::random namespace (previously thrust::random::experimental::normal_distribution). - Placeholder expressions may no longer include the comma operator. +## Breaking Changes +- Dispatch: + - Custom user backend systems' tag types must now inherit from the + corresponding system's execution_policy template (e.g. + thrust::cuda::execution_policy) instead of the tag struct (e.g. + thrust::cuda::tag). Otherwise, algorithm specializations will silently go + unfound during dispatch. See examples/minimal_custom_backend.cu and + examples/cuda/fallback_allocator.cu for usage examples. + - thrust::advance and thrust::distance are no longer dispatched based on + iterator system type and thus may no longer be customized. +- Iterators: + - iterator_facade and iterator_adaptor's Pointer template parameters have + been eliminated. + - iterator_adaptor has been moved into the thrust namespace (previously + thrust::experimental::iterator_adaptor). + - iterator_facade has been moved into the thrust namespace (previously + thrust::experimental::iterator_facade). + - iterator_core_access has been moved into the thrust namespace (previously + thrust::experimental::iterator_core_access). + - All iterators' nested pointer typedef (the type of the result of + operator->) is now void instead of a pointer type to indicate that such + expressions are currently impossible. + - Floating point counting_iterators' nested difference_type typedef is now a + signed integral type instead of a floating point type. +- Other: + - normal_distribution has been moved into the thrust::random namespace + (previously thrust::random::experimental::normal_distribution). + - Placeholder expressions may no longer include the comma operator. ## New Features -- Execution Policies - Users may directly control the dispatch of algorithm invocations with optional execution policy arguments. - For example, instead of wrapping raw pointers allocated by cudaMalloc with thrust::device_ptr, the thrust::device execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution. - The following execution policies are supported in this version: - - thrust::host - thrust::device - thrust::cpp::par - thrust::cuda::par - thrust::omp::par - thrust::tbb::par - -- Algorithms - free - get_temporary_buffer - malloc - merge_by_key - partition with stencil - partition_copy with stencil - return_temporary_buffer - set_difference_by_key - set_intersection_by_key - set_symmetric_difference_by_key - set_union_by_key - stable_partition with stencil - stable_partition_copy with stencil - tabulate +- Execution Policies: + - Users may directly control the dispatch of algorithm invocations with + optional execution policy arguments. + For example, instead of wrapping raw pointers allocated by cudaMalloc with + thrust::device_ptr, the thrust::device execution_policy may be passed as + an argument to an algorithm invocation to enable CUDA execution. + - The following execution policies are supported in this version: + - `thrust::host` + - `thrust::device` + - `thrust::cpp::par` + - `thrust::cuda::par` + - `thrust::omp::par` + - `thrust::tbb::par` +- Algorithms: + - `thrust::merge_by_key` + - `thrust::partition` with stencil + - `thrust::partition_copy` with stencil + - `thrust::set_difference_by_key` + - `thrust::set_intersection_by_key` + - `thrust::set_symmetric_difference_by_key` + - `thrust::set_union_by_key` + - `thrust::stable_partition with stencil` + - `thrust::stable_partition_copy with stencil` + - `thrust::tabulate` +- Memory Allocation: + - `thrust::malloc` + - `thrust::free` + - `thrust::get_temporary_buffer` + - `thrust::return_temporary_buffer` ## New Examples - uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in thrust::device_vector. @@ -678,24 +916,25 @@ Summary - #10 fix ambiguous overloads of reinterpret_tag ## Known Issues -- g++ versions 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation. +- GCC 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation. -Acknowledgments +## Acknowledgments - Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA. - Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA. - Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm. # Thrust 1.6.0 -Summary -- Thrust v1.6.0 provides an interface for customization and extension and a new -- backend system based on the Threading Building Blocks library. With this -- new interface, programmers may customize the behavior of specific algorithms -- as well as control the allocation of temporary storage or invent entirely new -- backends. These enhancements also allow multiple different backend systems -- such as CUDA and OpenMP to coexist within a single program. Support for TBB -- allows Thrust programs to integrate more naturally into applications which -- may already employ the TBB task scheduler. +## Summary +Thrust 1.6.0 provides an interface for customization and extension and a new + backend system based on the Threading Building Blocks library. +With this new interface, programmers may customize the behavior of specific + algorithms as well as control the allocation of temporary storage or invent + entirely new backends. +These enhancements also allow multiple different backend systems + such as CUDA and OpenMP to coexist within a single program. +Support for TBB allows Thrust programs to integrate more naturally into + applications which may already employ the TBB task scheduler. ## Breaking Changes - The header has been moved to @@ -708,25 +947,24 @@ Summary - thrust::any_space_tag has been renamed thrust::any_system_tag - thrust::iterator_space has been renamed thrust::iterator_system - ## New Features - Backend Systems - Threading Building Blocks (TBB) is now supported + - Threading Building Blocks (TBB) is now supported - Functions - for_each_n - raw_reference_cast + - `thrust::for_each_n` + - `thrust::raw_reference_cast` - Types - pointer - reference + - `thrust::pointer` + - `thrust::reference` ## New Examples -- cuda/custom_temporary_allocation -- cuda/fallback_allocator -- device_ptr -- expand -- minimal_custom_backend -- raw_reference_cast -- set_operations +- `cuda/custom_temporary_allocation` +- `cuda/fallback_allocator` +- `device_ptr` +- `expand` +- `minimal_custom_backend` +- `raw_reference_cast` +- `set_operations` ## Other Enhancements - thrust::for_each now returns the end of the input range similar to most other algorithms @@ -736,60 +974,59 @@ Summary - the safe use of different backend systems is now possible within a single binary ## Bug Fixes -- #469 min_element and max_element algorithms no longer require a const comparison operator +- #469 `min_element` and `max_element` algorithms no longer require a const comparison operator ## Known Issues -- cudafe++.exe may crash when parsing TBB headers on Windows. +- NVCC may crash when parsing TBB headers on Windows. -# Thrust 1.5.3 (CUDA 5.0) +# Thrust 1.5.3 (CUDA Toolkit 5.0) -Summary -- Small bug fixes +Thrust 1.5.3 is a minor bug fix release. ## Bug Fixes -- Avoid warnings about potential race due to __shared__ non-POD variable +- Avoid warnings about potential race due to `__shared__` non-POD variable -# Thrust 1.5.2 (CUDA 4.2) +# Thrust 1.5.2 (CUDA Toolkit 4.2) -Summary -- Small bug fixes +Thrust 1.5.2 is a minor bug fix release. ## Bug Fixes - Fixed warning about C-style initialization of structures -# Thrust 1.5.1 (CUDA 4.1) +# Thrust 1.5.1 (CUDA Toolkit 4.1) -Summary -- Small bug fixes +Thrust 1.5.1 is a minor bug fix release. ## Bug Fixes - Sorting data referenced by permutation_iterators on CUDA produces invalid results # Thrust 1.5.0 -Summary -- Thrust v1.5.0 provides introduces new programmer productivity and performance -- enhancements. New functionality for creating anonymous "lambda" functions has -- been added. A faster host sort provides 2-10x faster performance for sorting -- arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides -- 2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting -- arithmetic types with the OpenMP backend the combined performance improvement -- is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x -- (8-bit types). A new CUDA reduce_by_key implementation provides 2-3x faster -- performance. +## Summary + +Thrust 1.5.0 provides introduces new programmer productivity and performance + enhancements. +New functionality for creating anonymous "lambda" functions has been added. +A faster host sort provides 2-10x faster performance for sorting arithmetic + types on (single-threaded) CPUs. +A new OpenMP sort provides 2.5x-3.0x speedup over the host sort using a + quad-core CPU. +When sorting arithmetic types with the OpenMP backend the combined performance + improvement is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to + 14.2x (8-bit types). +A new CUDA `reduce_by_key` implementation provides 2-3x faster + performance. ## Breaking Changes - device_ptr no longer unsafely converts to device_ptr without an -- explicit cast. Use the expression -- device_pointer_cast(static_cast(void_ptr.get())) -- to convert, for example, device_ptr to device_ptr. + explicit cast. + Use the expression device_pointer_cast(static_cast(void_ptr.get())) to + convert, for example, device_ptr to device_ptr. ## New Features -- Functions - stencil-less transform_if - -- Types - lambda placeholders +- Algorithms: + - Stencil-less `thrust::transform_if`. +- Lambda placeholders ## New Examples - lambda @@ -797,63 +1034,63 @@ Summary ## Other Enhancements - host sort is 2-10x faster for arithmetic types - OMP sort provides speedup over host sort -- reduce_by_key is 2-3x faster -- reduce_by_key no longer requires O(N) temporary storage +- `reduce_by_key` is 2-3x faster +- `reduce_by_key` no longer requires O(N) temporary storage - CUDA scan algorithms are 10-40% faster -- host_vector and device_vector are now documented +- `host_vector` and `device_vector` are now documented - out-of-memory exceptions now provide detailed information from CUDART - improved histogram example -- device_reference now has a specialized swap -- reduce_by_key and scan algorithms are compatible with discard_iterator - -Removed Functionality +- `device_reference` now has a specialized swap +- `reduce_by_key` and scan algorithms are compatible with `discard_iterator` ## Bug Fixes - #44 allow host_vector to compile when value_type uses __align__ -- #198 allow adjacent_difference to permit safe in-situ operation +- #44 allow `host_vector` to compile when `value_type` uses `__align__` +- #198 allow `adjacent_difference` to permit safe in-situ operation - #303 make thrust thread-safe -- #313 avoid race conditions in device_vector::insert +- #313 avoid race conditions in `device_vector::insert` - #314 avoid unintended adl invocation when dispatching copy - #365 fix merge and set operation failures ## Known Issues - None -Acknowledgments -- Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived. -- Thanks to Jean-Francois Bastien for suggesting a fix for issue 303. +## Acknowledgments +- Thanks to Manjunath Kudlur for contributing his Carbon library, from which + the lambda functionality is derived. +- Thanks to Jean-Francois Bastien for suggesting a fix for #303. -# Thrust 1.4.0 (CUDA 4.0) +# Thrust 1.4.0 (CUDA Toolkit 4.0) -Summary -- Thrust v1.4.0 provides support for CUDA 4.0 in addition to many feature -- and performance improvements. New set theoretic algorithms operating on -- sorted sequences have been added. Additionally, a new fancy iterator -- allows discarding redundant or otherwise unnecessary output from -- algorithms, conserving memory storage and bandwidth. +## Summary + +Thrust 1.4.0 provides support for CUDA Toolkit 4.0 in addition to many feature + and performance improvements. +New set theoretic algorithms operating on sorted sequences have been added. +Additionally, a new fancy iterator allows discarding redundant or otherwise + unnecessary output from algorithms, conserving memory storage and bandwidth. ## Breaking Changes - Eliminations - thrust/is_sorted.h - thrust/utility.h - thrust/set_intersection.h - thrust/experimental/cuda/ogl_interop_allocator.h and the functionality therein - thrust::deprecated::copy_when - thrust::deprecated::absolute_value + - `thrust/is_sorted.h` + - `thrust/utility.h` + - `thrust/set_intersection.h` + - `thrust/experimental/cuda/ogl_interop_allocator.h` and the functionality therein + - `thrust::deprecated::copy_when` + - `thrust::deprecated::absolute_value` ## New Features -- Functions - copy_n - merge - set_difference - set_symmetric_difference - set_union +- Algorithms: + - `thrust::copy_n` + - `thrust::merge` + - `thrust::set_difference` + - `thrust::set_symmetric_difference` + - `thrust::set_union` - Types - discard_iterator + - `thrust::discard_iterator` -- Device support - Compute Capability 2.1 GPUs +- Device Support: + - Compute Capability 2.1 GPUs. ## New Examples - run_length_decoding @@ -892,73 +1129,72 @@ Removed Functionality - thrust::inclusive_scan, thrust::exclusive_scan, thrust::inclusive_scan_by_key, - and thrust::exclusive_scan_by_key are currently incompatible with thrust::discard_iterator. -Acknowledgments +## Acknowledgments - Thanks to David Tarjan for improving the performance of set_intersection. - Thanks to Duane Merrill for continued help with sort. - Thanks to Nathan Whitehead for help with CUDA Toolkit integration. -# Thrust 1.3.0 (CUDA 3.2) - -Summary -- Thrust v1.3.0 provides support for CUDA 3.2 in addition to many feature -- and performance enhancements. - -- Performance of the sort and sort_by_key algorithms is improved by as much -- as 3x in certain situations. The performance of stream compaction algorithms, -- such as copy_if, is improved by as much as 2x. Reduction performance is -- also improved, particularly for small input sizes. +# Thrust 1.3.0 (CUDA Toolkit 3.2) -- CUDA errors are now converted to runtime exceptions using the system_error -- interface. Combined with a debug mode, also new in v1.3, runtime errors -- can be located with greater precision. - -- Lastly, a few header files have been consolidated or renamed for clarity. -- See the deprecations section below for additional details. +Thrust 1.3.0 provides support for CUDA Toolkit 3.2 in addition to many feature + and performance enhancements. +Performance of the sort and sort_by_key algorithms is improved by as much as 3x + in certain situations. +The performance of stream compaction algorithms, such as copy_if, is improved + by as much as 2x. +CUDA errors are now converted to runtime exceptions using the system_error + interface. +Combined with a debug mode, also new in 1.3, runtime errors can be located with + greater precision. +Lastly, a few header files have been consolidated or renamed for clarity. +See the deprecations section below for additional details. ## Breaking Changes + - Promotions - thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface - thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface - thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface - thrust::next::gather has been renamed thrust::gather - thrust::next::gather_if has been renamed thrust::gather_if - thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy + - thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface + - thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface + - thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface + - thrust::next::gather has been renamed thrust::gather + - thrust::next::gather_if has been renamed thrust::gather_if + - thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy - Deprecations - thrust::copy_when has been renamed thrust::deprecated::copy_when - thrust::absolute_value has been renamed thrust::deprecated::absolute_value - The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead - The header thrust/utility.h is now deprecated; use thrust/swap.h instead - The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead + - thrust::copy_when has been renamed thrust::deprecated::copy_when + - thrust::absolute_value has been renamed thrust::deprecated::absolute_value + - The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead + - The header thrust/utility.h is now deprecated; use thrust/swap.h instead + - The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead - Eliminations - thrust::deprecated::gather - thrust::deprecated::gather_if - thrust/experimental/arch.h and the functions therein - thrust/sorting/merge_sort.h - thrust/sorting/radix_sort.h + - thrust::deprecated::gather + - thrust::deprecated::gather_if + - thrust/experimental/arch.h and the functions therein + - thrust/sorting/merge_sort.h + - thrust/sorting/radix_sort.h +- NVCC 2.3 is no longer supported ## New Features -- Functions - exclusive_scan_by_key - find - find_if - find_if_not - inclusive_scan_by_key - is_partitioned - is_sorted_until - mismatch - partition_point - reverse - reverse_copy - stable_partition_copy - -- Types - system_error and related types - experimental::cuda::ogl_interop_allocator - bit_and, bit_or, and bit_xor - -- Device support - gf104-based GPUs +- Algorithms: + - `thrust::exclusive_scan_by_key` + - `thrust::find` + - `thrust::find_if` + - `thrust::find_if_not` + - `thrust::inclusive_scan_by_key` + - `thrust::is_partitioned` + - `thrust::is_sorted_until` + - `thrust::mismatch` + - `thrust::partition_point` + - `thrust::reverse` + - `thrust::reverse_copy` + - `thrust::stable_partition_copy` + +- Types: + - `thrust::system_error` and related types. + - `thrust::experimental::cuda::ogl_interop_allocator`. + - `thrust::bit_and`, `thrust::bit_or`, and `thrust::bit_xor`. + +- Device Support: + - GF104-based GPUs. ## New Examples - opengl_interop.cu @@ -980,9 +1216,6 @@ Summary - Performance of device_vector initialized in .cpp files is substantially improved in common cases - Performance of thrust::sort_by_key on the host is substantially improved -Removed Functionality -- nvcc 2.3 is no longer supported - ## Bug Fixes - Debug device code now compiles correctly - thrust::uninitialized_copy and thrust::unintialized_fill now dispatch constructors on the device rather than the host @@ -998,10 +1231,11 @@ Acknowledgments - Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports - Thanks to Cliff Woolley for help with testing -# Thrust 1.2.1 (CUDA 3.1) +# Thrust 1.2.1 (CUDA Toolkit 3.1) -Summary -- Small fixes for compatibility with CUDA 3.1 +## Summary + +Small fixes for compatibility with CUDA Toolkit 3.1 ## Known Issues - inclusive_scan & exclusive_scan may fail with very large types @@ -1013,13 +1247,15 @@ Summary # Thrust 1.2.0 -Summary -- Thrust v1.2 introduces support for compilation to multicore CPUs -- and the Ocelot virtual machine, and several new facilities for -- pseudo-random number generation. New algorithms such as set -- intersection and segmented reduction have also been added. Lastly, -- improvements to the robustness of the CUDA backend ensure -- correctness across a broad set of (uncommon) use cases. +## Summary + +Thrust 1.2 introduces support for compilation to multicore CPUs and the Ocelot + virtual machine, and several new facilities for pseudo-random number + generation. +New algorithms such as set intersection and segmented reduction have also been + added. +Lastly, improvements to the robustness of the CUDA backend ensure correctness + across a broad set of (uncommon) use cases. ## Breaking Changes - thrust::gather's interface was incorrect and has been removed. @@ -1030,166 +1266,170 @@ Summary - will be promoted to thrust:: in Thrust version 1.3. For more details, - please refer to this thread: - http://groups.google.com/group/thrust-users/browse_thread/thread/f5f0583cb97b51fd - - The thrust::sorting namespace has been deprecated in favor of the -- top-level sorting functions, such as thrust::sort() and +- Top-level sorting functions, such as thrust::sort() and - thrust::sort_by_key(). +- Removed support for equal between host & device sequences +- Removed support for gather() and scatter() between host & device sequences ## New Features -- Functions - reduce_by_key - set_intersection - tie - unique_copy - unique_by_key - unique_copy_by_key - +- Algorithms: + - `thrust::reduce_by_key` + - `thrust::set_intersection` + - `thrust::unique_copy` + - `thrust::unique_by_key` + - `thrust::unique_copy_by_key` - Types - Random Number Generation - discard_block_engine - default_random_engine - linear_congruential_engine - linear_feedback_shift_engine - minstd_rand - minstd_rand0 - normal_distribution (experimental) - ranlux24 - ranlux48 - ranlux24_base - ranlux48_base - subtract_with_carry_engine - taus88 - uniform_int_distribution - uniform_real_distribution - xor_combine_engine - Functionals - project1st - project2nd - -- Fancy Iterators - permutation_iterator - reverse_iterator - -- Device support - Add support for multicore CPUs via OpenMP - Add support for Fermi-class GPUs - Add support for Ocelot virtual machine +- Random Number Generation: + - `thrust::discard_block_engine` + - `thrust::default_random_engine` + - `thrust::linear_congruential_engine` + - `thrust::linear_feedback_shift_engine` + - `thrust::subtract_with_carry_engine` + - `thrust::xor_combine_engine` + - `thrust::minstd_rand` + - `thrust::minstd_rand0` + - `thrust::ranlux24` + - `thrust::ranlux48` + - `thrust::ranlux24_base` + - `thrust::ranlux48_base` + - `thrust::taus88` + - `thrust::uniform_int_distribution` + - `thrust::uniform_real_distribution` + - `thrust::normal_distribution` (experimental) +- Function Objects: + - `thrust::project1st` + - `thrust::project2nd` +- `thrust::tie` +- Fancy Iterators: + - `thrust::permutation_iterator` + - `thrust::reverse_iterator` +- Vector Functions: + - `operator!=` + - `rbegin` + - `crbegin` + - `rend` + - `crend` + - `data` + - `shrink_to_fit` +- Device Support: + - Multicore CPUs via OpenMP. + - Fermi-class GPUs. + - Ocelot virtual machines. +- Support for NVCC 3.0. ## New Examples -- cpp_integration -- histogram -- mode -- monte_carlo -- monte_carlo_disjoint_sequences -- padded_grid_reduction -- permutation_iterator -- row_sum -- run_length_encoding -- segmented_scan -- stream_compaction -- summary_statistics -- transform_iterator -- word_count +- `cpp_integration` +- `histogram` +- `mode` +- `monte_carlo` +- `monte_carlo_disjoint_sequences` +- `padded_grid_reduction` +- `permutation_iterator` +- `row_sum` +- `run_length_encoding` +- `segmented_scan` +- `stream_compaction` +- `summary_statistics` +- `transform_iterator` +- `word_count` ## Other Enhancements -- vector functions operator!=, rbegin, crbegin, rend, crend, data, & shrink_to_fit -- integer sorting performance is improved when max is large but (max - min) is small and when min is negative -- performance of inclusive_scan() and exclusive_scan() is improved by 20-25% for primitive types -- support for nvcc 3.0 - -Removed Functionality -- removed support for equal between host & device sequences -- removed support for gather() and scatter() between host & device sequences +- Integer sorting performance is improved when max is large but (max - min) is + small and when min is negative +- Performance of `thrust::inclusive_scan` and `thrust::exclusive_scan` is + improved by 20-25% for primitive types. ## Bug Fixes -- # 8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time -- # 42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms -- # 46 gather & scatter handle any space iterators correctly -- # 51 thrust::experimental::arch functions gracefully handle unrecognized GPUs -- # 52 avoid collisions with common user macros such as BLOCK_SIZE -- # 62 provide better documentation for device_reference -- # 68 allow built-in CUDA vector types to work with device_vector in pure C++ mode -- # 102 eliminated a race condition in device_vector::erase +- #8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time +- #42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms +- #46 gather & scatter handle any space iterators correctly +- #51 thrust::experimental::arch functions gracefully handle unrecognized GPUs +- #52 avoid collisions with common user macros such as BLOCK_SIZE +- #62 provide better documentation for device_reference +- #68 allow built-in CUDA vector types to work with device_vector in pure C++ mode +- #102 eliminated a race condition in device_vector::erase - various compilation warnings eliminated ## Known Issues - inclusive_scan & exclusive_scan may fail with very large types - the Microsoft compiler may fail to compile code using both sort and binary search algorithms - uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device - # 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads - default_random_engine::discard is not accelerated with nvcc 2.3 +- inclusive_scan & exclusive_scan may fail with very large types +- the Microsoft compiler may fail to compile code using both sort and binary search algorithms +- uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device +- #109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads +- default_random_engine::discard is not accelerated with nvcc 2.3 -Acknowledgments - Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection - Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot - Thanks to Tom Bradley for contributing an implementation of normal_distribution - Thanks to Joseph Rhoads for contributing the example summary_statistics +## Acknowledgments + +- Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection +- Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot +- Thanks to Tom Bradley for contributing an implementation of normal_distribution +- Thanks to Joseph Rhoads for contributing the example summary_statistics # Thrust 1.1.1 -Summary -- Small fixes for compatibility with CUDA 2.3a and Mac OSX Snow Leopard. +## Summary + +Small fixes for compatibility with CUDA Toolkit 2.3a and Mac OSX Snow Leopard. # Thrust 1.1.0 -Summary -- Thrust v1.1 introduces fancy iterators, binary search functions, and -- several specialized reduction functions. Experimental support for -- segmented scan has also been added. +## Summary + +Thrust 1.1.0 introduces fancy iterators, binary search functions, and several + specialized reduction functions. +Experimental support for segmented scans has also been added. ## Breaking Changes -- counting_iterator has been moved into the thrust namespace (previously thrust::experimental) +- `thrust::counting_iterator` has been moved into the `thrust` namespace (previously `thrust::experimental`). ## New Features -- Functions - copy_if - lower_bound - upper_bound - vectorized lower_bound - vectorized upper_bound - equal_range - binary_search - vectorized binary_search - all_of - any_of - none_of - minmax_element - advance - inclusive_segmented_scan (experimental) - exclusive_segmented_scan (experimental) - -- Types - pair - tuple - device_malloc_allocator - -- Fancy Iterators - constant_iterator - counting_iterator - transform_iterator - zip_iterator +- Algorithms: + - `thrust::copy_if` + - `thrust::lower_bound` + - `thrust::upper_bound` + - `thrust::vectorized lower_bound` + - `thrust::vectorized upper_bound` + - `thrust::equal_range` + - `thrust::binary_search` + - `thrust::vectorized binary_search` + - `thrust::all_of` + - `thrust::any_of` + - `thrust::none_of` + - `thrust::minmax_element` + - `thrust::advance` + - `thrust::inclusive_segmented_scan` (experimental) + - `thrust::exclusive_segmented_scan` (experimental) +- Types: + - `thrust::pair` + - `thrust::tuple` + - `thrust::device_malloc_allocator` +- Fancy Iterators: + - `thrust::constant_iterator` + - `thrust::counting_iterator` + - `thrust::transform_iterator` + - `thrust::zip_iterator` ## New Examples -- computing the maximum absolute difference between vectors -- computing the bounding box of a two-dimensional point set -- sorting multiple arrays together (lexicographical sorting) -- constructing a summed area table -- using zip_iterator to mimic an array of structs -- using constant_iterator to increment array values +- Computing the maximum absolute difference between vectors. +- Computing the bounding box of a two-dimensional point set. +- Sorting multiple arrays together (lexicographical sorting). +- Constructing a summed area table. +- Using `thrust::zip_iterator` to mimic an array of structs. +- Using `thrust::constant_iterator` to increment array values. ## Other Enhancements -- added pinned memory allocator (experimental) -- added more methods to host_vector & device_vector (issue #4) -- added variant of remove_if with a stencil argument (issue #29) -- scan and reduce use cudaFuncGetAttributes to determine grid size -- exceptions are reported when temporary device arrays cannot be allocated +- Added pinned memory allocator (experimental). +- Added more methods to host_vector & device_vector (issue #4). +- Added variant of remove_if with a stencil argument (issue #29). +- Scan and reduce use cudaFuncGetAttributes to determine grid size. +- Exceptions are reported when temporary device arrays cannot be allocated. ## Bug Fixes - #5 make vector work for larger data types - #9 stable_partition_copy doesn't respect OutputIterator concept semantics -- #10 scans should return OutputIterator -- #16 make algorithms work for larger data types -- #27 dispatch radix_sort even when comp=less is explicitly provided +- #5: Make vector work for larger data types +- #9: stable_partition_copy doesn't respect OutputIterator concept semantics +- #10: scans should return OutputIterator +- #16: make algorithms work for larger data types +- #27: Dispatch radix_sort even when comp=less is explicitly provided ## Known Issues - Using functors with Thrust entry points may not compile on Mac OSX with gcc @@ -1198,7 +1438,7 @@ Summary constructors on the host rather than the device. - `thrust::inclusive_scan`, `thrust::inclusive_scan_by_key`, `thrust::exclusive_scan`, and `thrust::exclusive_scan_by_key` may fail when - used with large types with the CUDA 3.1 driver. + used with large types with the CUDA Toolkit 3.1. # Thrust 1.0.0 diff --git a/doc/development_model.md b/doc/development_model.md new file mode 100644 index 000000000..0327f68e3 --- /dev/null +++ b/doc/development_model.md @@ -0,0 +1,113 @@ +# Thrust Branching and Development Model + +The following is a description of how the Thrust development teams approaches branching and release tagging. This +is a living document that will evolve as our process evolves. + +Thrust is distributed in three ways: + + * On GitHub. + * In the NVIDIA HPC SDK. + * In the CUDA Toolkit. + +## Trunk Based Development + +Thrust uses [trunk based development](https://trunkbaseddevelopment.com). There is a single long-lived +branch called `master`. Engineers may create branches for feature development. such branches always +merge into `master`. There are no release branches. Releases are produced by taking a snapshot of +`master` ("snapping"). After a release has been snapped from `master`, it will never be changed. + +## Repositories + +As Thrust is developed both on GitHub and internally at NVIDIA, there's three main places where code lives: + + * The Source of Truth, the [public Thrust repository](https://github.com/thrust/thrust), referred to as + `github` later in this document. + * An internal GitLab repository, referred to as `gitlab` later in this document. + * An internal Perforce repository, referred to as `perforce` later in this document. + +## Versioning + +Thrust has its own versioning system for releases, independent of the versioning scheme of the NVIDIA +HPC SDK or the CUDA Toolkit. + +Today, Thrust version numbers have a specific [semantic meaning](https://semver.org/). +Releases prior to 1.10.0 largely, but not strictly, followed these semantic meanings. + +The version number for a Thrust release uses the following format: +`MMM.mmm.ss-ppp`, where: + + * `THRUST_VERSION_MAJOR`/`MMM`: Major version, up to 3 decimal digits. It is incremented + when the fundamental nature of the library evolves, leading to widespread changes across the + entire library interface with no guarantee of API, ABI, or semantic compatibility with former + versions. + * `THRUST_VERISON_MINOR`/`mmm`: Minor version, up to 3 decimal digits. It is incremented when + breaking API, ABI, or semantic changes are made. + * `THRUST_VERSION_SUBMINOR`/`ss`: Subminor version, up to 2 decimal digits. It is incremented + when notable new features or bug fixes or features that are API, ABI, and semantic backwards + compatible are added. + * `THRUST_PATCH_NUMBER`/`ppp`: Patch number, up to 3 decimal digits. It is incremented if any + change in the repo whatsoever is made and no other version component has been incremented. + +The `` header defines `THRUST_*` macros for all of the version components mentioned +above. Additionally, a `THRUST_VERSION` macro is defined, which is an integer literal containing all +of the version components except for `THRUST_PATCH_NUMBER` + +## Thrust Releases + +| Thrust Release | Included In | +| ----------------- | ------------------------------ | +| 1.9.10 | NVIDIA HPC SDK 20.5 | +| 1.9.9 | CUDA Toolkit 11.0 | +| 1.9.8-1 | NVIDIA HPC SDK 20.3 | +| 1.9.8 | CUDA Toolkit 11.0 Early Access | +| 1.9.7-1 | CUDA Toolkit 10.2 for Tegra | +| 1.9.7 | CUDA Toolkit 10.2 | +| 1.9.6-1 | NVIDIA HPC SDK 20.3 | +| 1.9.6 | CUDA Toolkit 10.1 Update 2 | +| 1.9.5 | CUDA Toolkit 10.1 Update 1 | +| 1.9.4 | CUDA Toolkit 10.1 | +| 1.9.3 | CUDA Toolkit 10.0 | +| 1.9.2 | CUDA Toolkit 9.2 | +| 1.9.1 | CUDA Toolkit 9.1 | +| 1.9.0 | CUDA Toolkit 9.0 | +| 1.8.3 | CUDA Toolkit 8.0 | +| 1.8.2 | CUDA Toolkit 7.5 | +| 1.8.1 | CUDA Toolkit 7.0 | +| 1.8.0 | | +| 1.7.2 | CUDA Toolkit 6.5 | +| 1.7.1 | CUDA Toolkit 6.0 | +| 1.7.0 | CUDA Toolkit 5.5 | +| 1.6.0 | | +| 1.5.3 | CUDA Toolkit 5.0 | +| 1.5.2 | CUDA Toolkit 4.2 | +| 1.5.1 | CUDA Toolkit 4.1 | +| 1.5.0 | | +| 1.4.0 | CUDA Toolkit 4.0 | +| 1.3.0 | CUDA Toolkit 3.2 | +| 1.2.1 | CUDA Toolkit 3.1 | +| 1.2.0 | | +| 1.1.1 | | +| 1.1.0 | | +| 1.0.0 | | + +## Branches and Tags + +The following tag names are used in the Thrust project: + + * `github/nvhpc-X.Y`: the tag that directly corresponds to what has been shipped in the NVIDIA HPC SDK release X.Y. + * `github/cuda-X.Y`: the tag that directly corresponds to what has been shipped in the CUDA Toolkit release X.Y. + * `github/A.B.C`: the tag that directly corresponds to a Thrust version A.B.C. + +The following branch names are used in the Thrust project: + + * `github/master`: the Source of Truth development branch of Thrust. + * `github/old-master`: the old Source of Truth branch, before unification of public and internal repositories. + * `github/feature/`: feature branch for a feature under development. + * `github/bug//-`: bug fix branch, where `bug-system` is `github` or `nvidia`. + * `gitlab/master`: mirror of `github/master`. + * `perforce/private`: mirrored `github/master`, plus files necessary for internal NVIDIA testing systems. + +On the rare occasion that we cannot do work in the open, for example when developing a change specific to an +unreleased product, these branches may exist on `gitlab` instead of `github`. By default, everything should be +in the open on `github` unless there is a strong motivation for it to not be open. + diff --git a/thrust/version.h b/thrust/version.h index 06e6cfa51..84f9af141 100644 --- a/thrust/version.h +++ b/thrust/version.h @@ -47,7 +47,7 @@ * THRUST_VERSION / 100 % 1000 is the minor version. * THRUST_VERSION / 100000 is the major version. */ -#define THRUST_VERSION 100910 +#define THRUST_VERSION 101000 /*! \def THRUST_MAJOR_VERSION * \brief The preprocessor macro \p THRUST_MAJOR_VERSION encodes the