diff --git a/README.md b/README.md
index 28682a073..1b55873f7 100644
--- a/README.md
+++ b/README.md
@@ -44,7 +44,7 @@ int main(void)
   return 0;
 }
 ```
-  
+
 This code sample computes the sum of 100 random numbers in parallel:
 
 ```c++
@@ -81,4 +81,4 @@ for details.
 Development process
 -------------------
 
-For information on development process and branching, see [this document](doc/branching.md).
+For information on development process, see [this document](doc/development_model.md).
diff --git a/dependencies/cub b/dependencies/cub
index d106ddb99..2a231db32 160000
--- a/dependencies/cub
+++ b/dependencies/cub
@@ -1 +1 @@
-Subproject commit d106ddb991a56c3df1b6d51b2409e36ba8181ce4
+Subproject commit 2a231db3226a9bfcd008bb6120bec12fe0a98cd1
diff --git a/doc/branching.md b/doc/branching.md
deleted file mode 100644
index 90ca0f375..000000000
--- a/doc/branching.md
+++ /dev/null
@@ -1,127 +0,0 @@
-# Thrust Branching and Development Model
-
-The following is a description of how the Thrust development teams approaches branching and release tagging. This
-is a living document that will evolve as our process evolves.
-
-## Thrust Version
-
-Thrust has historically had its own versioning system, independent of the versioning scheme of the CUDA Toolkit.
-Today, Thrust is released with the CUDA Toolkit, but we currently still maintain the double versioning scheme.
-
-The following is a mapping from Thrust versions to CUDA Toolkit versions and vice versa. Note that some Thrust versions don't directly map to any CUDA Toolkit version.
-
-| Thrust version    | CUDA version  |
-| ----------------- | ------------- |
-| 1.9.8             | 11.0 EA       |
-| 1.9.7             | 10.2          |
-| 1.9.6             | 10.1 Update 2 |
-| 1.9.5             | 10.1 Update 1 |
-| 1.9.4             | 10.1          |
-| 1.9.3             | 10.0          |
-| 1.9.2             | 9.2           |
-| 1.9.1             | 9.1           |
-| 1.9.0             | 9.0           |
-| 1.8.3             | 8.0           |
-| 1.8.2             | 7.5           |
-| 1.8.1             | 7.0           |
-| 1.8.0             | *N/A*         |
-| 1.7.2             | 6.5           |
-| 1.7.1             | 6.0           |
-| 1.7.0             | 5.5           |
-| 1.6.0             | *N/A*         |
-| 1.5.3             | 5.0           |
-| 1.5.2             | 4.2           |
-| 1.5.1             | 4.1           |
-| 1.5.0             | *N/A*         |
-| 1.4.0             | 4.0           |
-| 1.3.0             | 3.2           |
-| 1.2.1             | 3.1           |
-| 1.2.0             | *N/A*         |
-| 1.1.1             | *N/A*         |
-| 1.1.0             | *N/A*         |
-| 1.0.0             | *N/A*         |
-
-## Repositories
-
-As Thrust is developed both on GitHub and internally at NVIDIA, there's three main places where code lives:
-
-  * The [public Thrust repository](https://github.com/thrust/thrust), referred to as `github` later in this
-    document.
-  * An internal GitLab repository, referred to as `gitlab` later in this document.
-  * An internal Perforce repository, referred to as `perforce` later in this document.
-
-## Branches and Tags
-
-The following tag names are used in the Thrust project:
-
-  * `github/cuda-X.Y`: the tag that directly corresponds to what has been shipped in the CUDA Toolkit release X.Y.
-  * `github/A.B.C`: the tag that directly corresponds to a Thrust version A.B.C.
-
-The following branch names are used in the Thrust project:
-
-  * `github/master`: the Source of Truth development branch of Thrust.
-  * `github/old-master`: the old Source of Truth branch, before unification of public and internal repositories.
-  * `perforce/private`: mirrored github/master, plus files necessary for internal NVIDIA testing systems.
-  * `gitlab/staging/cuda-X.Y`: the branch for a CUDA Toolkit release that has not been released yet. cuda-X.Y should
-    be tagged on this branch after the final commit freeze (see "Release branches" below).
-  * `github/maintenance/cuda-Z.W`: the continuation of gitlab/staging/cuda-Z.W, but after release of CUDA Z.W, plus
-    post-release fixes if any are needed (see "Old release branches" below).
-  * `gitlab/feature/<name>`: feature branch for internally developed features.
-  * `gitlab/bug/<bug-system>-<bug-id>`: bug fix branch, where `bug-system` is `github` or `nvbug`. Permits a description
-    after `bug-id`.
-  * `gitlab/master`: same as `github/master`, but not yet published, during a freezing period (see "Feature freeze"
-    below).
-
-## Development Process Described
-
-### Normal development
-
-During regular parts of the development cycle, when we develop features on feature branches, and fix bugs on the
-main branch, we can:
-
-  * Merge internal fixes to `github/master` and to `perforce/private`.
-  * Merge Github contributions to `github/master` and to `perforce/private`.
-
-### Feature freeze
-
-In case where we have a new feature for a CUDA Toolkit release: just before the CUDA Toolkit feature freeze for a
-new release branch, we should stop merging commits (including public contributions) to `github/master`, and move to
-development on `gitlab/master`, and merge the not yet public features there.
-
-In those cases, we should wait until the new version of the toolkit is released before we push the new updated
-`gitlab/master` to `github/master`, roughly at the same time as we push from `gitlab/staging/cuda-X.Y` to
-`github/maintenance/cuda-X.Y` and tag `cuda-X.Y`, and the appropriate Thrust version tag.
-
-If we don't have big, not-public-before-release features landing in X.Y, however, we can avoid having a feature
-freeze period.
-
-The reason for having a freeze period at all is: `github/master` is supposed to be the Source of Truth. We want the
-history to follow the same order of commits in both Git and Perforce, and once a change is merged, we cannot rebase
-things that went into `perforce/internal` on top of it. Therefore: since we only really commit to Perforce but not
-`github/master` when we have a feature that is ready to be delivered, but is only a part of a new release and
-shouldn't/can't be public yet, we have to make sure that after it is merged to `gitlab/master` (and to `perforce/internal`),
-nothing new lands in `github/master` before we push the feature out.
-
-To avoid situations like this with bug fixes, when we fix a bug at a not crazy point in the release cycle, we
-should develop it on git, merge/push it on Github, and then pull the new commit to Perforce.
-
-### Release branches
-
-These are the internal Git branches that map directly to internal CUDA release branches. These branches are primarily
-developed in Git, and commits applied to them are then pushed to Perforce.
-
-After a CUDA Toolkit version is released, these transition to being old release branches.
-
-### Old release branches
-
-These branches represent a version that has landed in a CUDA Toolkit version, but with bugfixes for things that do
-deserve being fixed on a release branch. These shouldn't be groundbreaking; the following are an acceptable set of
-fixes to go into these branches, because they can remove annoyances, but shouldn't change behavior:
-
-  * Documentation fixes and updates.
-  * Thrust build system changes.
-  * Additional examples, fixes to examples and tests.
-  * (Possibly:) Fixing missing headers. This one is slightly less obvious, because it makes it possible for users
-    of standalone Thrust to write programs that won't compile with CUDA Thrust. Determinations will be made on a
-    case by case basis.
-
diff --git a/doc/changelog.md b/doc/changelog.md
index 85997e8ae..d51a26247 100644
--- a/doc/changelog.md
+++ b/doc/changelog.md
@@ -1,30 +1,261 @@
-# Thrust v1.9.8 (CUDA 11.0)
+# Thrust 1.9.10 (NVIDIA HPC SDK 20.5)
 
 ## Summary
 
-Thrust v1.9.8, which is included in the CUDA 11.0 release, removes Thrust's
-  internal derivative of CUB, upstreams all relevant changes too CUB, and adds
-  CUB as a Git submodule.
+Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5.
+It adds CMake support for compilation with NVC++ and a number of minor bug fixes
+  for NVC++.
+It also adds CMake `find_package` support.
+
+## New Features
+
+- #1130: CMake `find_package` support.
+  This is significant because there is a legacy `FindThrust.cmake` script
+    authored by a third party in widespread use in the community which has a
+    bug in how it parses Thrust version numbers which will cause it to
+    incorrectly parse 1.9.10.
+  This script only handles the first digit of each part of the Thrust version
+    number correctly: for example, Thrust 17.17.17 would be interpreted as
+    Thrust 1.1.1701717.
+  You can find directions for using the new CMake `find_package` support and
+    migrating away from the legacy `FindThrust.cmake` [here](https://github.com/thrust/thrust/blob/master/thrust/cmake/README.md)
+- #1129: Added `thrust::detail::single_device_tls_caching_allocator`, a
+    convenient way to get an MR caching allocator for device memory, which is
+    used by NVC++.
+
+## Other Enhancements
+
+- #1129: Refactored RDC handling in CMake to be a global option and not create
+    two targets for each example and test.
+
+## Bug Fixes
+
+- #1129: Fix the legacy `thrust::return_temporary_buffer` API to support
+    passing a size.
+  This was necessary to enable usage of Thrust caching MR allocators with
+    synchronous Thrust algorithms.
+  This change has allowed NVC++’s C++17 Parallel Algorithms implementation to
+    switch to use Thrust caching MR allocators for device temporary storage,
+    which gives a 2x speedup on large multi-GPU systems such as V100 and A100
+    DGX where `cudaMalloc` is very slow.
+- #1128: Respect `CUDA_API_PER_THREAD_DEFAULT_STREAM`.
+  Thanks to Rong Ou for this contribution.
+- #1131: Fix the one-policy overload of `thrust::async::copy` to not copy the
+    policy, resolving use-afer-move issues.
+- #1145: When cleaning up type names in `unittest::base_class_name`, only call
+    `std::string::replace` if we found the substring we are looking to replace.
+- #1139: Don't use `cxx::__demangle` in NVC++.
+- #1102: Don't use `thrust::detail::normal_distribution_nvcc` for Feta because
+    it uses `erfcinv`, a non-standard function that Feta doesn't have.
+
+# Thrust 1.9.9 (CUDA Toolkit 11.0)
+
+## Summary
+
+Thrust 1.9.9 adds support for NVC++, which uses Thrust to implement
+  GPU-accelerated C++17 Parallel Algorithms.
+`thrust::zip_function` and `thrust::shuffle` were also added.
+As of this release, C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are
+  deprecated.
+Starting with the upcoming 1.10.0 release, C++03 support will be dropped
+  entirely.
+All other deprecated platforms will be dropped in the near future.
+
+## Breaking Changes
+
+- #1082: Thrust now checks that it is compatible with the version of CUB found
+    in your include path, generating an error if it is not.
+  If you are using your own verison of CUB, it may be too old.
+  It is recommended to simply delete your own version of CUB and use the
+    version of CUB that comes with Thrust.
+- #1089 C++03 and C++11 are deprecated.
+  Using these dialects will generate a compile-time warning.
+  These warnings can be suppressed by defining
+    `THRUST_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11
+    deprecation warnings) or `THRUST_IGNORE_DEPRECATED_CPP11` (to suppress C++11
+    deprecation warnings).
+  Suppression is only a short term solution.
+  We will be dropping support for C++03 in the 1.10.0 release and C++11 in the
+    near future.
+- #1089: GCC < 5, Clang < 6, and MSVC < 2017 are deprecated.
+  Using these compilers will generate a compile-time warning.
+  These warnings can be suppressed by defining
+  `THRUST_IGNORE_DEPRECATED_COMPILER`.
+  Supression is only a short term solution.
+  We will be dropping support for these compilers in the near future.
+
+## New Features
+
+- #1086: Support for NVC++ aka "Feta".
+  The most significant change is in how we use `__CUDA_ARCH__`.
+  Now, there are four macros that must be used:
+  - `THRUST_IS_DEVICE_CODE`, which should be used in an `if` statement around
+      device-only code.
+  - `THRUST_INCLUDE_DEVICE_CODE`, which should be used in an `#if` preprocessor
+      directive inside of the `if` statement mentioned in the prior bullet.
+  - `THRUST_IS_HOST_CODE`, which should be used in an `if` statement around
+      host-only code.
+  - `THRUST_INCLUDE_HOST_CODE`, which should be used in an `#if` preprocessor
+      directive inside of the `if` statement mentioned in the prior bullet.
+- #1085: `thrust::shuffle`.
+  Thanks to Rory Mitchell for this contribution.
+- #1029: `thrust::zip_function`, a facility for zipping functions that take N
+    parameters instead of a tuple of N parameters as `thrust::zip_iterator`
+    does.
+  Thanks to Ben Jude for this contribution.
+- #1068: `thrust::system::cuda::managed_memory_pointer`, a universal memory
+    strongly typed pointer compatible with the ISO C++ Standard Library.
+
+## Other Enhancements
+
+- #1029: Thrust is now built and tested with NVCC warnings treated as errors.
+- #1029: MSVC C++11 support.
+- #1029: `THRUST_DEPRECATED` abstraction for generating compile-time
+    deprecation warning messages.
+- #1029: `thrust::pointer<T>::pointer_to(reference)`.
+- #1070: Unit test for `thrust::inclusive_scan` with a user defined types.
+  Thanks to Conor Hoekstra for this contribution.
+
+## Bug Fixes
+
+- #1088: Allow `thrust::replace` to take functions that have non-`const`
+    `operator()`.
+- #1094: Add missing `constexpr` to `par_t` constructors.
+  Thanks to Patrick Stotko for this contribution.
+- #1077: Remove `__device__` from CUDA MR-based device allocators to fix
+    obscure "host function called from host device function" warning that occurs
+    when you use the new Thrust MR-based allocators.
+- #1029: Remove inconsistently-used `THRUST_BEGIN`/`END_NS` macros.
+- #1029: Fix C++ dialect detection on newer MSVC.
+- #1029 Use `_Pragma`/`__pragma` instead of `#pragma` in macros.
+- #1029: Replace raw `__cplusplus` checks with the appropriate Thrust macros.
+- #1105: Add a missing `<math.h>` include.
+- #1103: Fix regression of `thrust::detail::temporary_allocator` with non-CUDA
+    back ends.
+- #1111: Use Thrust's random number engine instead of `std::`s in device code.
+- #1108: Get rid of a GCC 9 warning about deprecated generation of copy ctors.
+
+# Thrust 1.9.8-1 (NVIDIA HPC SDK 20.3)
+
+## Summary
+
+Thrust 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3.
+It contains modifications necessary to serve as the implementation of NVC++'s
+  GPU-accelerated C++17 Parallel Algorithms.
+
+# Thrust 1.9.8 (CUDA Toolkit 11.0 Early Access)
+
+## Summary
+
+Thrust 1.9.8, which is included in the CUDA Toolkit 11.0 release, removes
+  Thrust's internal derivative of CUB, upstreams all relevant changes too CUB,
+  and adds CUB as a Git submodule.
 It will now be necessary to do `git clone --recursive` when checking out
   Thrust, and to update the CUB submodule when pulling in new Thrust changes.
 Additionally, CUB is now included as a first class citizen in the CUDA toolkit.
-Thrust v1.9.8 also fixes bugs preventing most Thrust algorithms from working
-  with more than `2^32` elements.
-Now, `reduce`, `*_scan`, and related algorithms (aka most of Thrust) work with
-  large element counts.
-`sort` remains limited to `2^32` elements for now.
+Thrust 1.9.8 also fixes bugs preventing most Thrust algorithms from working
+  with more than `2^31-1` elements.
+Now, `thrust::reduce`, `thrust::*_scan`, and related algorithms (aka most of
+  Thrust) work with large element counts.
+
+## Breaking Changes
+
+- Thrust will now use the version of CUB in your include path instead of its own
+    internal copy.
+  If you are using your own version of CUB, it may be older and incompatible
+    with Thrust.
+  It is recommended to simply delete your own version of CUB and use the
+    version of CUB that comes with Thrust.
+
+## Other Enhancements
+
+- Refactor Thrust and CUB to support 64-bit indices in most algorithms.
+  In most cases, Thrust now selects between kernels that use 32-bit indices and
+    64-bit indices at runtime depending on the size of the input.
+  This means large element counts work, but small element counts do not have to
+    pay for the register usage of 64-bit indices if they are not needed.
+  Now, `thrust::reduce`, `thrust::*_scan`, and related algorithms (aka most of
+    Thrust) work with more than `2^31-1` elements.
+  Notably, `thrust::sort` is still limited to less than `2^31-1` elements.
+- CUB is now a submodule and the internal copy of CUB has been removed.
+- #1051: Stop specifying the `__launch_bounds__` minimum blocks parameter
+    because it messes up register allocation and increases register pressure,
+    and we don't actually know at compile time how many blocks we will use
+    (aside from single tile kernels).
 
-# Thrust v1.9.7 (CUDA 10.2)
+## Bug Fixes
+
+- #1020: After making a CUDA API call, always clear the global CUDA error state
+    by calling `cudaGetLastError`.
+- #1021: Avoid calling destroy in the destructor of a Thrust vector if the
+    vector is empty.
+- #1046: Actually throw `thrust::bad_alloc` when `thrust::system::cuda::malloc`
+    fails instead of just constructing a temporary and doing nothing with it.
+- Add missing copy constructor or copy assignment operator to all classes that
+    GCC 9's `-Wdeprecated-copy` complains about
+- Add missing move operations to `thrust::system::cuda::vector`.
+- #1015: Check that the backend is CUDA before using CUDA-specifics in
+    `thrust::detail::temporary_allocator`.
+  Thanks to Hugh Winkler for this contribution.
+- #1055: More correctly detect the presence of aligned/sized `new`/`delete`.
+- #1043: Fix ill-formed specialization of `thrust::system::is_error_code_enum`
+    for `thrust::event_errc`.
+  Thanks to Toru Niina for this contribution.
+- #1027: Add tests for `thrust::tuple_for_each` and `thrust::tuple_subset`.
+  Thanks to Ben Jude for this contribution.
+- #1027: Use correct macro in `thrust::tuple_for_each`.
+  Thanks to Ben Jude for this contribution.
+- #1026: Use correct MSVC version formatting in CMake.
+  Thanks to Ben Jude for this contribution.
+- Workaround an NVCC issue with type aliases with template template arguments
+    containing a parameter pack.
+- Remove unused functions from the CUDA backend which call slow CUDA attribute
+    query APIs.
+- Replace `CUB_RUNTIME_FUNCTION` with `THRUST_RUNTIME_FUNCTION`.
+- Correct typo in `thrust::transform` documentation.
+  Thanks to Eden Yefet for this contribution.
+
+## Known Issues
+
+- `thrust::sort` remains limited to `2^31-1` elements for now.
+
+# Thrust 1.9.7-1 (CUDA Toolkit 10.2)
 
 ## Summary
 
-Thrust v1.9.7 is a minor release accompanying the CUDA 10.2 release.
+Thrust 1.9.7-1 is a minor release accompanying the CUDA Toolkit 10.2 release
+  for Tegra.
+It is nearly identical to 1.9.7.
 
-# Thrust v1.9.6 (CUDA 10.1 Update 2)
+# Thrust 1.9.7 (CUDA Toolkit 10.2)
 
 ## Summary
 
-Thrust v1.9.6 is a minor release accompanying the CUDA 10.1 Update 2 release.
+Thrust 1.9.7 is a minor release accompanying the CUDA Toolkit 10.2 release.
+
+## Bug Fixes
+
+- #967, NVBug 2448170: Fix the CUDA backend `thrust::for_each` so that it
+    supports large input sizes with 64-bit indices.
+- NVBug 2646034: Fix incorrect dependency handling for stream acquisition in
+    `thrust::future`
+- #968, NVBug 2612102: Fix the `thrust::mr::polymorphic_adaptor` to actually
+    use its template parameter.
+
+# Thrust 1.9.6-1 (NVIDIA HPC SDK 20.3)
+
+## Summary
+
+Thrust 1.9.6-1 is a variant of 1.9.6 accompanying the NVIDIA HPC SDK 20.3.
+It contains modifications necessary to serve as the implementation of NVC++'s
+  GPU-accelerated C++17 Parallel Algorithms.
+
+# Thrust 1.9.6 (CUDA Toolkit 10.1 Update 2)
+
+## Summary
+
+Thrust 1.9.6 is a minor release accompanying the CUDA Toolkit 10.1 Update 2
+  release.
 
 ## Bug Fixes
 
@@ -34,23 +265,24 @@ Thrust v1.9.6 is a minor release accompanying the CUDA 10.1 Update 2 release.
 - NVBug 200488234 CUDA header files contain unicode characters which leads
     compiling errors on Windows
 - #949, #973, NVBug 2422333, NVBug 2522259, NVBug 2528822
-  `thrust::detail::aligned_reinterpret_cast` must be annotated with
-  `__host__ __device__`.
+    `thrust::detail::aligned_reinterpret_cast` must be annotated with
+    `__host__ __device__`.
 - NVBug 2599629 Missing include in the OpenMP sort implementation
 - NVBug 200513211 Truncation warning in test code under VC142
 
-# Thrust v1.9.5 (CUDA 10.1 Update 1)
+# Thrust 1.9.5 (CUDA Toolkit 10.1 Update 1)
 
 ## Summary
 
-Thrust 1.9.5 is a minor release accompanying the CUDA 10.1 Update 1 release.
+Thrust 1.9.5 is a minor release accompanying the CUDA Toolkit 10.1 Update 1
+  release.
 
 ## Bug Fixes
 
 - NVBug 2502854: Fixed assignment of
     `thrust::device_vector<thrust::complex<T>>` between host and device.
 
-# Thrust 1.9.4 (CUDA 10.1)
+# Thrust 1.9.4 (CUDA Toolkit 10.1)
 
 ## Summary
 
@@ -287,7 +519,7 @@ Use the new asynchronous Thrust algorithms for non-blocking behavior.
 - #924, NVBug 2096679, NVBug 2315990: Fix dispatch for the CUDA backend's
     `thrust::reduce` to use two functions (one with the pragma for disabling
     exec checks, one with `THRUST_RUNTIME_FUNCTION`) instead of one. This fixes
-    a regression with device compilation that started in CUDA 9.2.
+    a regression with device compilation that started in CUDA Toolkit 9.2.
 - #928, NVBug 2341455: Add missing `__host__ __device__` annotations to a
     `thrust::complex::operator=` to satisfy GoUDA.
 - NVBug 2094642: Make `thrust::vector_base::clear` not depend on the element
@@ -300,7 +532,7 @@ Use the new asynchronous Thrust algorithms for non-blocking behavior.
     `thrust::counting_iterator` perform proper truncation.
 - NVBug 2455952: Refactor questionable `thrust::copy_if` unit tests.
 
-# Thrust 1.9.3 (CUDA 10.0)
+# Thrust 1.9.3 (CUDA Toolkit 10.0)
 
 ## Summary
 
@@ -328,7 +560,7 @@ Thrust 1.9.3 unifies and integrates CUDA Thrust and GitHub Thrust.
 - Thanks to Francisco Facioni for contributing optimizations for
     `thrust::min/max_element`.
 
-# Thrust 1.9.2 (CUDA 9.2)
+# Thrust 1.9.2 (CUDA Toolkit 9.2)
 
 ## Summary
 
@@ -367,7 +599,7 @@ Additionally, the unit test suite and framework was enhanced to increase
     overlooked but `deallocate` to be called with GCC <= 4.3.
 - NVBug 1777043: Fixed `thrust::complex` to work with `thrust::sequence`.
 
-# Thrust 1.9.1 (CUDA 9.1)
+# Thrust 1.9.1 (CUDA Toolkit 9.1)
 
 ## Summary
 
@@ -382,7 +614,7 @@ for `thrust::reduce` based on CUB.
 - NVBug 1904217: Allow callables that take non-const refs to be used with
     `thrust::reduce` and `thrust::*_scan`.
 
-# Thrust 1.9.0 (CUDA 9.0)
+# Thrust 1.9.0 (CUDA Toolkit 9.0)
 
 ## Summary
 
@@ -430,7 +662,7 @@ This brings a substantial performance improvement to the CUDA backend across
 - Thanks to Duane Merrill for developing CUB and helping to integrate it into
     Thrust's backend.
 
-# Thrust 1.8.3 (CUDA 8.0)
+# Thrust 1.8.3 (CUDA Toolkit 8.0)
 
 Thrust 1.8.3 is a small bug fix release.
 
@@ -446,7 +678,7 @@ Thrust 1.8.3 is a small bug fix release.
 - `thrust::clear` operations on vector types no longer requires the element
     type to have a default constructor.
 
-# Thrust 1.8.2 (CUDA 7.5)
+# Thrust 1.8.2 (CUDA Toolkit 7.5)
 
 Thrust 1.8.2 is a small bug fix release.
 
@@ -465,7 +697,7 @@ Thrust 1.8.2 is a small bug fix release.
 - #628: `thrust::reduce_by_key` for the CUDA backend fails for Compute
     Capability 5.0 devices.
 
-# Thrust 1.8.1 (CUDA 7.0)
+# Thrust 1.8.1 (CUDA Toolkit 7.0)
 
 Thrust 1.8.1 is a small bug fix release.
 
@@ -481,53 +713,44 @@ Thrust 1.8.1 is a small bug fix release.
 
 # Thrust 1.8.0
 
-Summary
-- Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams,
-- and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code,
-- providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing
-- Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy
-- allows users to require sequential algorithm execution in the calling thread and makes a
-- sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to
-- request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm
-- implementations provide substantial performance improvements.
+## Summary
+Thrust 1.8.0 introduces support for algorithm invocation from CUDA device
+  code, support for CUDA streams, and algorithm performance improvements.
+Users may now invoke Thrust algorithms from CUDA device code, providing a
+  parallel algorithms library to CUDA programmers authoring custom kernels, as
+  well as allowing Thrust programmers to nest their algorithm calls within
+  functors.
+The `thrust::seq` execution policy allows users to require sequential algorithm
+  execution in the calling thread and makes a sequential algorithms library
+  available to individual CUDA threads.
+The `.on(stream)` syntax allows users to request a CUDA stream for kernels
+  launched during algorithm execution.
+Finally, new CUDA algorithm implementations provide substantial performance
+  improvements.
 
 ## New Features
-- Algorithms in CUDA __device__ code
-      Thrust algorithms may now be invoked from CUDA __device__ and __host__ __device__ functions.
-
-      Algorithms invoked in this manner must be invoked with an execution policy as the first parameter:
-
-      __device__ int my_device_sort(int *data, size_t n)
-      {
-        thrust::sort(thrust::device, data, data + n);
-      }
-
+- Algorithms in CUDA Device Code:
+    - Thrust algorithms may now be invoked from CUDA `__device__` and
+        `__host__` __device__ functions.
+      Algorithms invoked in this manner must be invoked with an execution
+        policy as the first parameter.
       The following execution policies are supported in CUDA __device__ code:
-        thrust::seq
-        thrust::cuda::par
-        thrust::device, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA
-
-      Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available.
-
-- Execution Policies
-      CUDA Streams
-        The thrust::cuda::par.on(stream) syntax allows users to request that CUDA __global__ functions launched during algorithm
-        execution should occur on a given stream:
-
-        // execute for_each on stream s
-        thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor);
-
-        Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary
-        storage or returning results to the CPU.
-
-      thrust::seq
-        The thrust::seq execution policy allows users to require that an algorithm execute sequentially in the calling thread:
-
-        // execute for_each sequentially in this thread
-        thrust::for_each(thrust::seq, begin, end, my_functor);
-
-- Other
-      The new thrust::complex template provides complex number support.
+      - `thrust::seq`
+      - `thrust::cuda::par`
+      - `thrust::device`, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA.
+  - Device-side algorithm execution may not be parallelized unless CUDA Dynamic
+      Parallelism is available.
+- Execution Policies:
+  - CUDA Streams
+    - The `thrust::cuda::par.on(stream)` syntax allows users to request that
+        CUDA kernels launched during algorithm execution should occur on a given
+        stream.
+    - Algorithms executed with a CUDA stream in this manner may still
+        synchronize with other streams when allocating temporary storage or
+        returning results to the CPU.
+  - `thrust::seq`, which allows users to require that an algorithm execute
+      sequentially in the calling thread.
+- `thrust::complex`, a complex number data type.
 
 ## New Examples
 - simple_cuda_streams demonstrates how to request a CUDA stream during algorithm execution.
@@ -565,7 +788,7 @@ Acknowledgments
 - Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation.
 - Thanks to Filipe Maia for contributing the implementation of thrust::complex.
 
-# Thrust 1.7.2 (CUDA 6.5)
+# Thrust 1.7.2 (CUDA Toolkit 6.5)
 
 Summary
 - Small bug fixes
@@ -573,7 +796,7 @@ Summary
 ## Bug Fixes
 - Avoid use of std::min in generic find implementation
 
-# Thrust 1.7.1 (CUDA 6.0)
+# Thrust 1.7.1 (CUDA Toolkit 6.0)
 
 Summary
 - Small bug fixes
@@ -583,68 +806,83 @@ Summary
 - Eliminate unused variable warning in CUDA reduce_by_key implementation
 - Avoid deriving function objects from std::unary_function and std::binary_function
 
-# Thrust 1.7.0 (CUDA 5.5)
-
-Summary
-- Thrust 1.7.0 introduces a new interface for controlling algorithm execution as
-- well as several new algorithms and performance improvements. With this new
-- interface, users may directly control how algorithms execute as well as details
-- such as the allocation of temporary storage. Key/value versions of thrust::merge
-- and the set operation algorithms have been added, as well stencil versions of
-- partitioning algorithms. thrust::tabulate has been introduced to tabulate the
-- values of functions taking integers. For 32b types, new CUDA merge and set
-- operations provide 2-15x faster performance while a new CUDA comparison sort
-- provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation
-- provides 80% faster performance.
+# Thrust 1.7.0 (CUDA Toolkit 5.5)
 
-## Breaking Changes
-- Dispatch
-      Custom user backend systems' tag types must now inherit from the corresponding system's execution_policy template (e.g. thrust::cuda::execution_policy) instead
-      of the tag struct (e.g. thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch.
-      See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples.
-
-      thrust::advance and thrust::distance are no longer dispatched based on iterator system type and thus may no longer be customized.
+## Summary
 
-- Iterators
-      iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated.
-      iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor).
-      iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade).
-      iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access).
-      All iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently impossible.
-      Floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type.
+Thrust 1.7.0 introduces a new interface for controlling algorithm execution as
+  well as several new algorithms and performance improvements.
+With this new interface, users may directly control how algorithms execute as
+  well as details such as the allocation of temporary storage.
+Key/value versions of thrust::merge and the set operation algorithms have been
+  added, as well stencil versions of partitioning algorithms.
+thrust::tabulate has been introduced to tabulate the values of functions taking
+  integers.
+For 32b types, new CUDA merge and set operations provide 2-15x faster
+  performance while a new CUDA comparison sort provides 1.3-4x faster
+  performance.
+Finally, a new TBB reduce_by_key implementation provides 80% faster
+  performance.
 
-- Other
-      normal_distribution has been moved into the thrust::random namespace (previously thrust::random::experimental::normal_distribution).
-      Placeholder expressions may no longer include the comma operator.
+## Breaking Changes
+- Dispatch:
+  - Custom user backend systems' tag types must now inherit from the
+      corresponding system's execution_policy template (e.g.
+      thrust::cuda::execution_policy) instead of the tag struct (e.g.
+      thrust::cuda::tag). Otherwise, algorithm specializations will silently go
+      unfound during dispatch. See examples/minimal_custom_backend.cu and
+      examples/cuda/fallback_allocator.cu for usage examples.
+  - thrust::advance and thrust::distance are no longer dispatched based on
+      iterator system type and thus may no longer be customized.
+- Iterators:
+  - iterator_facade and iterator_adaptor's Pointer template parameters have
+      been eliminated.
+  - iterator_adaptor has been moved into the thrust namespace (previously
+      thrust::experimental::iterator_adaptor).
+  - iterator_facade has been moved into the thrust namespace (previously
+      thrust::experimental::iterator_facade).
+  - iterator_core_access has been moved into the thrust namespace (previously
+      thrust::experimental::iterator_core_access).
+  - All iterators' nested pointer typedef (the type of the result of
+      operator->) is now void instead of a pointer type to indicate that such
+      expressions are currently impossible.
+  - Floating point counting_iterators' nested difference_type typedef is now a
+      signed integral type instead of a floating point type.
+- Other:
+  - normal_distribution has been moved into the thrust::random namespace
+      (previously thrust::random::experimental::normal_distribution).
+  - Placeholder expressions may no longer include the comma operator.
 
 ## New Features
-- Execution Policies
-      Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.
-      For example, instead of wrapping raw pointers allocated by cudaMalloc with thrust::device_ptr, the thrust::device execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.
-      The following execution policies are supported in this version:
-
-        thrust::host
-        thrust::device
-        thrust::cpp::par
-        thrust::cuda::par
-        thrust::omp::par
-        thrust::tbb::par
-
-- Algorithms
-	free
-	get_temporary_buffer
-	malloc
-        merge_by_key
-        partition with stencil
-        partition_copy with stencil
-	return_temporary_buffer
-        set_difference_by_key
-        set_intersection_by_key
-        set_symmetric_difference_by_key
-        set_union_by_key
-        stable_partition with stencil
-        stable_partition_copy with stencil
-	tabulate
+- Execution Policies:
+  - Users may directly control the dispatch of algorithm invocations with
+      optional execution policy arguments.
+    For example, instead of wrapping raw pointers allocated by cudaMalloc with
+      thrust::device_ptr, the thrust::device execution_policy may be passed as
+      an argument to an algorithm invocation to enable CUDA execution.
+  - The following execution policies are supported in this version:
+    - `thrust::host`
+    - `thrust::device`
+    - `thrust::cpp::par`
+    - `thrust::cuda::par`
+    - `thrust::omp::par`
+    - `thrust::tbb::par`
+- Algorithms:
+  - `thrust::merge_by_key`
+  - `thrust::partition` with stencil
+  - `thrust::partition_copy` with stencil
+  - `thrust::set_difference_by_key`
+  - `thrust::set_intersection_by_key`
+  - `thrust::set_symmetric_difference_by_key`
+  - `thrust::set_union_by_key`
+  - `thrust::stable_partition with stencil`
+  - `thrust::stable_partition_copy with stencil`
+  - `thrust::tabulate`
+- Memory Allocation:
+	- `thrust::malloc`
+	- `thrust::free`
+  - `thrust::get_temporary_buffer`
+  - `thrust::return_temporary_buffer`
 
 ## New Examples
 - uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in thrust::device_vector.
@@ -678,24 +916,25 @@ Summary
 - #10 fix ambiguous overloads of reinterpret_tag
 
 ## Known Issues
-- g++ versions 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation.
+- GCC 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation.
 
-Acknowledgments
+## Acknowledgments
 - Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
 - Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
 - Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.
 
 # Thrust 1.6.0
 
-Summary
-- Thrust v1.6.0 provides an interface for customization and extension and a new
-- backend system based on the Threading Building Blocks library. With this
-- new interface, programmers may customize the behavior of specific algorithms
-- as well as control the allocation of temporary storage or invent entirely new
-- backends. These enhancements also allow multiple different backend systems
-- such as CUDA and OpenMP to coexist within a single program. Support for TBB
-- allows Thrust programs to integrate more naturally into applications which
-- may already employ the TBB task scheduler.
+## Summary
+Thrust 1.6.0 provides an interface for customization and extension and a new
+  backend system based on the Threading Building Blocks library.
+With this new interface, programmers may customize the behavior of specific
+  algorithms as well as control the allocation of temporary storage or invent
+  entirely new backends.
+These enhancements also allow multiple different backend systems
+  such as CUDA and OpenMP to coexist within a single program.
+Support for TBB allows Thrust programs to integrate more naturally into
+  applications which may already employ the TBB task scheduler.
 
 ## Breaking Changes
 - The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h>
@@ -708,25 +947,24 @@ Summary
 - thrust::any_space_tag has been renamed thrust::any_system_tag
 - thrust::iterator_space has been renamed thrust::iterator_system
 
-
 ## New Features
 - Backend Systems
-        Threading Building Blocks (TBB) is now supported
+  - Threading Building Blocks (TBB) is now supported
 - Functions
-        for_each_n
-        raw_reference_cast
+  - `thrust::for_each_n`
+  - `thrust::raw_reference_cast`
 - Types
-        pointer
-        reference
+  - `thrust::pointer`
+  - `thrust::reference`
 
 ## New Examples
-- cuda/custom_temporary_allocation
-- cuda/fallback_allocator
-- device_ptr
-- expand
-- minimal_custom_backend
-- raw_reference_cast
-- set_operations
+- `cuda/custom_temporary_allocation`
+- `cuda/fallback_allocator`
+- `device_ptr`
+- `expand`
+- `minimal_custom_backend`
+- `raw_reference_cast`
+- `set_operations`
 
 ## Other Enhancements
 - thrust::for_each now returns the end of the input range similar to most other algorithms
@@ -736,60 +974,59 @@ Summary
 - the safe use of different backend systems is now possible within a single binary
 
 ## Bug Fixes
-- #469 min_element and max_element algorithms no longer require a const comparison operator
+- #469 `min_element` and `max_element` algorithms no longer require a const comparison operator
 
 ## Known Issues
-- cudafe++.exe may crash when parsing TBB headers on Windows.
+- NVCC may crash when parsing TBB headers on Windows.
 
-# Thrust 1.5.3 (CUDA 5.0)
+# Thrust 1.5.3 (CUDA Toolkit 5.0)
 
-Summary
-- Small bug fixes
+Thrust 1.5.3 is a minor bug fix release.
 
 ## Bug Fixes
-- Avoid warnings about potential race due to __shared__ non-POD variable
+- Avoid warnings about potential race due to `__shared__` non-POD variable
 
-# Thrust 1.5.2 (CUDA 4.2)
+# Thrust 1.5.2 (CUDA Toolkit 4.2)
 
-Summary
-- Small bug fixes
+Thrust 1.5.2 is a minor bug fix release.
 
 ## Bug Fixes
 - Fixed warning about C-style initialization of structures
 
-# Thrust 1.5.1 (CUDA 4.1)
+# Thrust 1.5.1 (CUDA Toolkit 4.1)
 
-Summary
-- Small bug fixes
+Thrust 1.5.1 is a minor bug fix release.
 
 ## Bug Fixes
 - Sorting data referenced by permutation_iterators on CUDA produces invalid results
 
 # Thrust 1.5.0
 
-Summary
-- Thrust v1.5.0 provides introduces new programmer productivity and performance
-- enhancements. New functionality for creating anonymous "lambda" functions has
-- been added. A faster host sort provides 2-10x faster performance for sorting
-- arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides
-- 2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting
-- arithmetic types with the OpenMP backend the combined performance improvement
-- is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x
-- (8-bit types). A new CUDA reduce_by_key implementation provides 2-3x faster
-- performance.
+## Summary
+
+Thrust 1.5.0 provides introduces new programmer productivity and performance
+  enhancements.
+New functionality for creating anonymous "lambda" functions has been added.
+A faster host sort provides 2-10x faster performance for sorting arithmetic
+  types on (single-threaded) CPUs.
+A new OpenMP sort provides 2.5x-3.0x speedup over the host sort using a
+  quad-core CPU.
+When sorting arithmetic types with the OpenMP backend the combined performance
+  improvement is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to
+  14.2x (8-bit types).
+A new CUDA `reduce_by_key` implementation provides 2-3x faster
+  performance.
 
 ## Breaking Changes
 - device_ptr<void> no longer unsafely converts to device_ptr<T> without an
-- explicit cast. Use the expression
-- device_pointer_cast(static_cast<int*>(void_ptr.get()))
-- to convert, for example, device_ptr<void> to device_ptr<int>.
+    explicit cast.
+  Use the expression device_pointer_cast(static_cast<int*>(void_ptr.get())) to
+    convert, for example, device_ptr<void> to device_ptr<int>.
 
 ## New Features
-- Functions
-        stencil-less transform_if
-
-- Types
-        lambda placeholders
+- Algorithms:
+  - Stencil-less `thrust::transform_if`.
+- Lambda placeholders
 
 ## New Examples
 - lambda
@@ -797,63 +1034,63 @@ Summary
 ## Other Enhancements
 - host sort is 2-10x faster for arithmetic types
 - OMP sort provides speedup over host sort
-- reduce_by_key is 2-3x faster
-- reduce_by_key no longer requires O(N) temporary storage
+- `reduce_by_key` is 2-3x faster
+- `reduce_by_key` no longer requires O(N) temporary storage
 - CUDA scan algorithms are 10-40% faster
-- host_vector and device_vector are now documented
+- `host_vector` and `device_vector` are now documented
 - out-of-memory exceptions now provide detailed information from CUDART
 - improved histogram example
-- device_reference now has a specialized swap
-- reduce_by_key and scan algorithms are compatible with discard_iterator
-
-Removed Functionality
+- `device_reference` now has a specialized swap
+- `reduce_by_key` and scan algorithms are compatible with `discard_iterator`
 
 ## Bug Fixes
-     #44 allow host_vector to compile when value_type uses __align__
-- #198 allow adjacent_difference to permit safe in-situ operation
+- #44 allow `host_vector` to compile when `value_type` uses `__align__`
+- #198 allow `adjacent_difference` to permit safe in-situ operation
 - #303 make thrust thread-safe
-- #313 avoid race conditions in device_vector::insert
+- #313 avoid race conditions in `device_vector::insert`
 - #314 avoid unintended adl invocation when dispatching copy
 - #365 fix merge and set operation failures
 
 ## Known Issues
 - None
 
-Acknowledgments
-- Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived.
-- Thanks to Jean-Francois Bastien for suggesting a fix for issue 303.
+## Acknowledgments
+- Thanks to Manjunath Kudlur for contributing his Carbon library, from which
+    the lambda functionality is derived.
+- Thanks to Jean-Francois Bastien for suggesting a fix for #303.
 
-# Thrust 1.4.0 (CUDA 4.0)
+# Thrust 1.4.0 (CUDA Toolkit 4.0)
 
-Summary
-- Thrust v1.4.0 provides support for CUDA 4.0 in addition to many feature
-- and performance improvements.  New set theoretic algorithms operating on
-- sorted sequences have been added.  Additionally, a new fancy iterator
-- allows discarding redundant or otherwise unnecessary output from
-- algorithms, conserving memory storage and bandwidth.
+## Summary
+
+Thrust 1.4.0 provides support for CUDA Toolkit 4.0 in addition to many feature
+  and performance improvements.
+New set theoretic algorithms operating on sorted sequences have been added.
+Additionally, a new fancy iterator allows discarding redundant or otherwise
+  unnecessary output from algorithms, conserving memory storage and bandwidth.
 
 ## Breaking Changes
 - Eliminations
-        thrust/is_sorted.h
-        thrust/utility.h
-        thrust/set_intersection.h
-        thrust/experimental/cuda/ogl_interop_allocator.h and the functionality therein
-        thrust::deprecated::copy_when
-        thrust::deprecated::absolute_value
+  - `thrust/is_sorted.h`
+  - `thrust/utility.h`
+  - `thrust/set_intersection.h`
+  - `thrust/experimental/cuda/ogl_interop_allocator.h` and the functionality therein
+  - `thrust::deprecated::copy_when`
+  - `thrust::deprecated::absolute_value`
 
 ## New Features
-- Functions
-        copy_n
-        merge
-        set_difference
-        set_symmetric_difference
-        set_union
+- Algorithms:
+  - `thrust::copy_n`
+  - `thrust::merge`
+  - `thrust::set_difference`
+  - `thrust::set_symmetric_difference`
+  - `thrust::set_union`
 
 - Types
-        discard_iterator
+  - `thrust::discard_iterator`
 
-- Device support
-        Compute Capability 2.1 GPUs
+- Device Support:
+  - Compute Capability 2.1 GPUs.
 
 ## New Examples
 - run_length_decoding
@@ -892,73 +1129,72 @@ Removed Functionality
 - thrust::inclusive_scan, thrust::exclusive_scan, thrust::inclusive_scan_by_key,
 - and thrust::exclusive_scan_by_key are currently incompatible with thrust::discard_iterator.
 
-Acknowledgments
+## Acknowledgments
 - Thanks to David Tarjan for improving the performance of set_intersection.
 - Thanks to Duane Merrill for continued help with sort.
 - Thanks to Nathan Whitehead for help with CUDA Toolkit integration.
 
-# Thrust 1.3.0 (CUDA 3.2)
-
-Summary
-- Thrust v1.3.0 provides support for CUDA 3.2 in addition to many feature
-- and performance enhancements.
-
-- Performance of the sort and sort_by_key algorithms is improved by as much
-- as 3x in certain situations.  The performance of stream compaction algorithms,
-- such as copy_if, is improved by as much as 2x.  Reduction performance is
-- also improved, particularly for small input sizes.
+# Thrust 1.3.0 (CUDA Toolkit 3.2)
 
-- CUDA errors are now converted to runtime exceptions using the system_error
-- interface.  Combined with a debug mode, also new in v1.3, runtime errors
-- can be located with greater precision.
-
-- Lastly, a few header files have been consolidated or renamed for clarity.
-- See the deprecations section below for additional details.
+Thrust 1.3.0 provides support for CUDA Toolkit 3.2 in addition to many feature
+  and performance enhancements.
+Performance of the sort and sort_by_key algorithms is improved by as much as 3x
+  in certain situations.
+The performance of stream compaction algorithms, such as copy_if, is improved
+  by as much as 2x.
 
+CUDA errors are now converted to runtime exceptions using the system_error
+  interface.
+Combined with a debug mode, also new in 1.3, runtime errors can be located with
+  greater precision.
+Lastly, a few header files have been consolidated or renamed for clarity.
+See the deprecations section below for additional details.
 
 ## Breaking Changes
+
 - Promotions
-        thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface
-        thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface
-        thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface
-        thrust::next::gather has been renamed thrust::gather
-        thrust::next::gather_if has been renamed thrust::gather_if
-        thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy
+  - thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface
+  - thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface
+  - thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface
+  - thrust::next::gather has been renamed thrust::gather
+  - thrust::next::gather_if has been renamed thrust::gather_if
+  - thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy
 - Deprecations
-        thrust::copy_when has been renamed thrust::deprecated::copy_when
-        thrust::absolute_value has been renamed thrust::deprecated::absolute_value
-        The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead
-        The header thrust/utility.h is now deprecated; use thrust/swap.h instead
-        The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead
+  - thrust::copy_when has been renamed thrust::deprecated::copy_when
+  - thrust::absolute_value has been renamed thrust::deprecated::absolute_value
+  - The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead
+  - The header thrust/utility.h is now deprecated; use thrust/swap.h instead
+  - The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead
 - Eliminations
-        thrust::deprecated::gather
-        thrust::deprecated::gather_if
-        thrust/experimental/arch.h and the functions therein
-        thrust/sorting/merge_sort.h
-        thrust/sorting/radix_sort.h
+  - thrust::deprecated::gather
+  - thrust::deprecated::gather_if
+  - thrust/experimental/arch.h and the functions therein
+  - thrust/sorting/merge_sort.h
+  - thrust/sorting/radix_sort.h
+- NVCC 2.3 is no longer supported
 
 ## New Features
-- Functions
-        exclusive_scan_by_key
-        find
-        find_if
-        find_if_not
-        inclusive_scan_by_key
-        is_partitioned
-        is_sorted_until
-        mismatch
-        partition_point
-        reverse
-        reverse_copy
-        stable_partition_copy
-
-- Types
-        system_error and related types
-        experimental::cuda::ogl_interop_allocator
-        bit_and, bit_or, and bit_xor
-
-- Device support
-        gf104-based GPUs
+- Algorithms:
+  - `thrust::exclusive_scan_by_key`
+  - `thrust::find`
+  - `thrust::find_if`
+  - `thrust::find_if_not`
+  - `thrust::inclusive_scan_by_key`
+  - `thrust::is_partitioned`
+  - `thrust::is_sorted_until`
+  - `thrust::mismatch`
+  - `thrust::partition_point`
+  - `thrust::reverse`
+  - `thrust::reverse_copy`
+  - `thrust::stable_partition_copy`
+
+- Types:
+  - `thrust::system_error` and related types.
+  - `thrust::experimental::cuda::ogl_interop_allocator`.
+  - `thrust::bit_and`, `thrust::bit_or`, and `thrust::bit_xor`.
+
+- Device Support:
+  - GF104-based GPUs.
 
 ## New Examples
 - opengl_interop.cu
@@ -980,9 +1216,6 @@ Summary
 - Performance of device_vector initialized in .cpp files is substantially improved in common cases
 - Performance of thrust::sort_by_key on the host is substantially improved
 
-Removed Functionality
-- nvcc 2.3 is no longer supported
-
 ## Bug Fixes
 - Debug device code now compiles correctly
 - thrust::uninitialized_copy and thrust::unintialized_fill now dispatch constructors on the device rather than the host
@@ -998,10 +1231,11 @@ Acknowledgments
 - Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports
 - Thanks to Cliff Woolley for help with testing
 
-# Thrust 1.2.1 (CUDA 3.1)
+# Thrust 1.2.1 (CUDA Toolkit 3.1)
 
-Summary
-- Small fixes for compatibility with CUDA 3.1
+## Summary
+
+Small fixes for compatibility with CUDA Toolkit 3.1
 
 ## Known Issues
 - inclusive_scan & exclusive_scan may fail with very large types
@@ -1013,13 +1247,15 @@ Summary
 
 # Thrust 1.2.0
 
-Summary
-- Thrust v1.2 introduces support for compilation to multicore CPUs
-- and the Ocelot virtual machine, and several new facilities for
-- pseudo-random number generation.  New algorithms such as set
-- intersection and segmented reduction have also been added.  Lastly,
-- improvements to the robustness of the CUDA backend ensure
-- correctness across a broad set of (uncommon) use cases.
+## Summary
+
+Thrust 1.2 introduces support for compilation to multicore CPUs and the Ocelot
+  virtual machine, and several new facilities for pseudo-random number
+  generation.
+New algorithms such as set intersection and segmented reduction have also been
+  added.
+Lastly, improvements to the robustness of the CUDA backend ensure correctness
+  across a broad set of (uncommon) use cases.
 
 ## Breaking Changes
 - thrust::gather's interface was incorrect and has been removed.
@@ -1030,166 +1266,170 @@ Summary
 - will be promoted to thrust:: in Thrust version 1.3. For more details,
 - please refer to this thread:
 - http://groups.google.com/group/thrust-users/browse_thread/thread/f5f0583cb97b51fd
-
 - The thrust::sorting namespace has been deprecated in favor of the
-- top-level sorting functions, such as thrust::sort() and
+- Top-level sorting functions, such as thrust::sort() and
 - thrust::sort_by_key().
+- Removed support for equal between host & device sequences
+- Removed support for gather() and scatter() between host & device sequences
 
 ## New Features
-- Functions
-        reduce_by_key
-        set_intersection
-        tie
-        unique_copy
-        unique_by_key
-        unique_copy_by_key
-
+- Algorithms:
+  - `thrust::reduce_by_key`
+  - `thrust::set_intersection`
+  - `thrust::unique_copy`
+  - `thrust::unique_by_key`
+  - `thrust::unique_copy_by_key`
 - Types
-        Random Number Generation
-            discard_block_engine
-            default_random_engine
-            linear_congruential_engine
-            linear_feedback_shift_engine
-            minstd_rand
-            minstd_rand0
-            normal_distribution (experimental)
-            ranlux24
-            ranlux48
-            ranlux24_base
-            ranlux48_base
-            subtract_with_carry_engine
-            taus88
-            uniform_int_distribution
-            uniform_real_distribution
-            xor_combine_engine
-        Functionals
-            project1st
-            project2nd
-
-- Fancy Iterators
-        permutation_iterator
-        reverse_iterator
-
-- Device support
-        Add support for multicore CPUs via OpenMP
-        Add support for Fermi-class GPUs
-        Add support for Ocelot virtual machine
+- Random Number Generation:
+  - `thrust::discard_block_engine`
+  - `thrust::default_random_engine`
+  - `thrust::linear_congruential_engine`
+  - `thrust::linear_feedback_shift_engine`
+  - `thrust::subtract_with_carry_engine`
+  - `thrust::xor_combine_engine`
+  - `thrust::minstd_rand`
+  - `thrust::minstd_rand0`
+  - `thrust::ranlux24`
+  - `thrust::ranlux48`
+  - `thrust::ranlux24_base`
+  - `thrust::ranlux48_base`
+  - `thrust::taus88`
+  - `thrust::uniform_int_distribution`
+  - `thrust::uniform_real_distribution`
+  - `thrust::normal_distribution` (experimental)
+- Function Objects:
+  - `thrust::project1st`
+  - `thrust::project2nd`
+- `thrust::tie`
+- Fancy Iterators:
+  - `thrust::permutation_iterator`
+  - `thrust::reverse_iterator`
+- Vector Functions:
+  - `operator!=`
+  - `rbegin`
+  - `crbegin`
+  - `rend`
+  - `crend`
+  - `data`
+  - `shrink_to_fit`
+- Device Support:
+  - Multicore CPUs via OpenMP.
+  - Fermi-class GPUs.
+  - Ocelot virtual machines.
+- Support for NVCC 3.0.
 
 ## New Examples
-- cpp_integration
-- histogram
-- mode
-- monte_carlo
-- monte_carlo_disjoint_sequences
-- padded_grid_reduction
-- permutation_iterator
-- row_sum
-- run_length_encoding
-- segmented_scan
-- stream_compaction
-- summary_statistics
-- transform_iterator
-- word_count
+- `cpp_integration`
+- `histogram`
+- `mode`
+- `monte_carlo`
+- `monte_carlo_disjoint_sequences`
+- `padded_grid_reduction`
+- `permutation_iterator`
+- `row_sum`
+- `run_length_encoding`
+- `segmented_scan`
+- `stream_compaction`
+- `summary_statistics`
+- `transform_iterator`
+- `word_count`
 
 ## Other Enhancements
-- vector functions operator!=, rbegin, crbegin, rend, crend, data, & shrink_to_fit
-- integer sorting performance is improved when max is large but (max - min) is small and when min is negative
-- performance of inclusive_scan() and exclusive_scan() is improved by 20-25% for primitive types
-- support for nvcc 3.0
-
-Removed Functionality
-- removed support for equal between host & device sequences
-- removed support for gather() and scatter() between host & device sequences
+- Integer sorting performance is improved when max is large but (max - min) is
+  small and when min is negative
+- Performance of `thrust::inclusive_scan` and `thrust::exclusive_scan` is
+  improved by 20-25% for primitive types.
 
 ## Bug Fixes
-- # 8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time
-- # 42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms
-- # 46 gather & scatter handle any space iterators correctly
-- # 51 thrust::experimental::arch functions gracefully handle unrecognized GPUs
-- # 52 avoid collisions with common user macros such as BLOCK_SIZE
-- # 62 provide better documentation for device_reference
-- # 68 allow built-in CUDA vector types to work with device_vector in pure C++ mode
-- # 102 eliminated a race condition in device_vector::erase
+- #8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time
+- #42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms
+- #46 gather & scatter handle any space iterators correctly
+- #51 thrust::experimental::arch functions gracefully handle unrecognized GPUs
+- #52 avoid collisions with common user macros such as BLOCK_SIZE
+- #62 provide better documentation for device_reference
+- #68 allow built-in CUDA vector types to work with device_vector in pure C++ mode
+- #102 eliminated a race condition in device_vector::erase
 - various compilation warnings eliminated
 
 ## Known Issues
-   inclusive_scan & exclusive_scan may fail with very large types
-   the Microsoft compiler may fail to compile code using both sort and binary search algorithms
-   uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
-   # 109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
-   default_random_engine::discard is not accelerated with nvcc 2.3
+- inclusive_scan & exclusive_scan may fail with very large types
+- the Microsoft compiler may fail to compile code using both sort and binary search algorithms
+- uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
+- #109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
+- default_random_engine::discard is not accelerated with nvcc 2.3
 
-Acknowledgments
-   Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection
-   Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot
-   Thanks to Tom Bradley for contributing an implementation of normal_distribution
-   Thanks to Joseph Rhoads for contributing the example summary_statistics
+## Acknowledgments
+
+- Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection
+- Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot
+- Thanks to Tom Bradley for contributing an implementation of normal_distribution
+- Thanks to Joseph Rhoads for contributing the example summary_statistics
 
 # Thrust 1.1.1
 
-Summary
-- Small fixes for compatibility with CUDA 2.3a and Mac OSX Snow Leopard.
+## Summary
+
+Small fixes for compatibility with CUDA Toolkit 2.3a and Mac OSX Snow Leopard.
 
 # Thrust 1.1.0
 
-Summary
-- Thrust v1.1 introduces fancy iterators, binary search functions, and
-- several specialized reduction functions.  Experimental support for
-- segmented scan has also been added.
+## Summary
+
+Thrust 1.1.0 introduces fancy iterators, binary search functions, and several
+  specialized reduction functions.
+Experimental support for segmented scans has also been added.
 
 ## Breaking Changes
-- counting_iterator has been moved into the thrust namespace (previously thrust::experimental)
+- `thrust::counting_iterator` has been moved into the `thrust` namespace (previously `thrust::experimental`).
 
 ## New Features
-- Functions
-        copy_if
-        lower_bound
-        upper_bound
-        vectorized lower_bound
-        vectorized upper_bound
-        equal_range
-        binary_search
-        vectorized binary_search
-        all_of
-        any_of
-        none_of
-        minmax_element
-        advance
-        inclusive_segmented_scan (experimental)
-        exclusive_segmented_scan (experimental)
-
-- Types
-        pair
-        tuple
-        device_malloc_allocator
-
-- Fancy Iterators
-        constant_iterator
-        counting_iterator
-        transform_iterator
-        zip_iterator
+- Algorithms:
+  - `thrust::copy_if`
+  - `thrust::lower_bound`
+  - `thrust::upper_bound`
+  - `thrust::vectorized lower_bound`
+  - `thrust::vectorized upper_bound`
+  - `thrust::equal_range`
+  - `thrust::binary_search`
+  - `thrust::vectorized binary_search`
+  - `thrust::all_of`
+  - `thrust::any_of`
+  - `thrust::none_of`
+  - `thrust::minmax_element`
+  - `thrust::advance`
+  - `thrust::inclusive_segmented_scan` (experimental)
+  - `thrust::exclusive_segmented_scan` (experimental)
+- Types:
+  - `thrust::pair`
+  - `thrust::tuple`
+  - `thrust::device_malloc_allocator`
+- Fancy Iterators:
+  - `thrust::constant_iterator`
+  - `thrust::counting_iterator`
+  - `thrust::transform_iterator`
+  - `thrust::zip_iterator`
 
 ## New Examples
-- computing the maximum absolute difference between vectors
-- computing the bounding box of a two-dimensional point set
-- sorting multiple arrays together (lexicographical sorting)
-- constructing a summed area table
-- using zip_iterator to mimic an array of structs
-- using constant_iterator to increment array values
+- Computing the maximum absolute difference between vectors.
+- Computing the bounding box of a two-dimensional point set.
+- Sorting multiple arrays together (lexicographical sorting).
+- Constructing a summed area table.
+- Using `thrust::zip_iterator` to mimic an array of structs.
+- Using `thrust::constant_iterator` to increment array values.
 
 ## Other Enhancements
-- added pinned memory allocator (experimental)
-- added more methods to host_vector & device_vector (issue #4)
-- added variant of remove_if with a stencil argument (issue #29)
-- scan and reduce use cudaFuncGetAttributes to determine grid size
-- exceptions are reported when temporary device arrays cannot be allocated
+- Added pinned memory allocator (experimental).
+- Added more methods to host_vector & device_vector (issue #4).
+- Added variant of remove_if with a stencil argument (issue #29).
+- Scan and reduce use cudaFuncGetAttributes to determine grid size.
+- Exceptions are reported when temporary device arrays cannot be allocated.
 
 ## Bug Fixes
-     #5 make vector work for larger data types
-     #9 stable_partition_copy doesn't respect OutputIterator concept semantics
-- #10 scans should return OutputIterator
-- #16 make algorithms work for larger data types
-- #27 dispatch radix_sort even when comp=less<T> is explicitly provided
+- #5: Make vector work for larger data types
+- #9: stable_partition_copy doesn't respect OutputIterator concept semantics
+- #10: scans should return OutputIterator
+- #16: make algorithms work for larger data types
+- #27: Dispatch radix_sort even when comp=less<T> is explicitly provided
 
 ## Known Issues
 - Using functors with Thrust entry points may not compile on Mac OSX with gcc
@@ -1198,7 +1438,7 @@ Summary
     constructors on the host rather than the device.
 - `thrust::inclusive_scan`, `thrust::inclusive_scan_by_key`,
     `thrust::exclusive_scan`, and `thrust::exclusive_scan_by_key` may fail when
-    used with large types with the CUDA 3.1 driver.
+    used with large types with the CUDA Toolkit 3.1.
 
 # Thrust 1.0.0
 
diff --git a/doc/development_model.md b/doc/development_model.md
new file mode 100644
index 000000000..0327f68e3
--- /dev/null
+++ b/doc/development_model.md
@@ -0,0 +1,113 @@
+# Thrust Branching and Development Model
+
+The following is a description of how the Thrust development teams approaches branching and release tagging. This
+is a living document that will evolve as our process evolves.
+
+Thrust is distributed in three ways:
+
+   * On GitHub.
+   * In the NVIDIA HPC SDK.
+   * In the CUDA Toolkit.
+
+## Trunk Based Development
+
+Thrust uses [trunk based development](https://trunkbaseddevelopment.com). There is a single long-lived
+branch called `master`. Engineers may create branches for feature development. such branches always
+merge into `master`. There are no release branches. Releases are produced by taking a snapshot of
+`master` ("snapping"). After a release has been snapped from `master`, it will never be changed.
+
+## Repositories
+
+As Thrust is developed both on GitHub and internally at NVIDIA, there's three main places where code lives:
+
+   * The Source of Truth, the [public Thrust repository](https://github.com/thrust/thrust), referred to as
+     `github` later in this document.
+   * An internal GitLab repository, referred to as `gitlab` later in this document.
+   * An internal Perforce repository, referred to as `perforce` later in this document.
+
+## Versioning
+
+Thrust has its own versioning system for releases, independent of the versioning scheme of the NVIDIA
+HPC SDK or the CUDA Toolkit.
+
+Today, Thrust version numbers have a specific [semantic meaning](https://semver.org/).
+Releases prior to 1.10.0 largely, but not strictly, followed these semantic meanings.
+
+The version number for a Thrust release uses the following format:
+`MMM.mmm.ss-ppp`, where:
+
+   * `THRUST_VERSION_MAJOR`/`MMM`: Major version, up to 3 decimal digits. It is incremented
+     when the fundamental nature of the library evolves, leading to widespread changes across the
+     entire library interface with no guarantee of API, ABI, or semantic compatibility with former
+     versions.
+   * `THRUST_VERISON_MINOR`/`mmm`: Minor version, up to 3 decimal digits. It is incremented when
+     breaking API, ABI, or semantic changes are made.
+   * `THRUST_VERSION_SUBMINOR`/`ss`: Subminor version, up to 2 decimal digits. It is incremented
+     when notable new features or bug fixes or features that are API, ABI, and semantic backwards
+     compatible are added.
+   * `THRUST_PATCH_NUMBER`/`ppp`: Patch number, up to 3 decimal digits. It is incremented if any
+     change in the repo whatsoever is made and no other version component has been incremented.
+
+The `<thrust/version.h>` header defines `THRUST_*` macros for all of the version components mentioned
+above. Additionally, a `THRUST_VERSION` macro is defined, which is an integer literal containing all
+of the version components except for `THRUST_PATCH_NUMBER`
+
+## Thrust Releases
+
+| Thrust Release    | Included In                    |
+| ----------------- | ------------------------------ |
+| 1.9.10            | NVIDIA HPC SDK 20.5            |
+| 1.9.9             | CUDA Toolkit 11.0              |
+| 1.9.8-1           | NVIDIA HPC SDK 20.3            |
+| 1.9.8             | CUDA Toolkit 11.0 Early Access |
+| 1.9.7-1           | CUDA Toolkit 10.2 for Tegra    |
+| 1.9.7             | CUDA Toolkit 10.2              |
+| 1.9.6-1           | NVIDIA HPC SDK 20.3            |
+| 1.9.6             | CUDA Toolkit 10.1 Update 2     |
+| 1.9.5             | CUDA Toolkit 10.1 Update 1     |
+| 1.9.4             | CUDA Toolkit 10.1              |
+| 1.9.3             | CUDA Toolkit 10.0              |
+| 1.9.2             | CUDA Toolkit 9.2               |
+| 1.9.1             | CUDA Toolkit 9.1               |
+| 1.9.0             | CUDA Toolkit 9.0               |
+| 1.8.3             | CUDA Toolkit 8.0               |
+| 1.8.2             | CUDA Toolkit 7.5               |
+| 1.8.1             | CUDA Toolkit 7.0               |
+| 1.8.0             |                                |
+| 1.7.2             | CUDA Toolkit 6.5               |
+| 1.7.1             | CUDA Toolkit 6.0               |
+| 1.7.0             | CUDA Toolkit 5.5               |
+| 1.6.0             |                                |
+| 1.5.3             | CUDA Toolkit 5.0               |
+| 1.5.2             | CUDA Toolkit 4.2               |
+| 1.5.1             | CUDA Toolkit 4.1               |
+| 1.5.0             |                                |
+| 1.4.0             | CUDA Toolkit 4.0               |
+| 1.3.0             | CUDA Toolkit 3.2               |
+| 1.2.1             | CUDA Toolkit 3.1               |
+| 1.2.0             |                                |
+| 1.1.1             |                                |
+| 1.1.0             |                                |
+| 1.0.0             |                                |
+
+## Branches and Tags
+
+The following tag names are used in the Thrust project:
+
+  * `github/nvhpc-X.Y`: the tag that directly corresponds to what has been shipped in the NVIDIA HPC SDK release X.Y.
+  * `github/cuda-X.Y`: the tag that directly corresponds to what has been shipped in the CUDA Toolkit release X.Y.
+  * `github/A.B.C`: the tag that directly corresponds to a Thrust version A.B.C.
+
+The following branch names are used in the Thrust project:
+
+  * `github/master`: the Source of Truth development branch of Thrust.
+  * `github/old-master`: the old Source of Truth branch, before unification of public and internal repositories.
+  * `github/feature/<name>`: feature branch for a feature under development.
+  * `github/bug/<bug-system>/<bug-description>-<bug-id>`: bug fix branch, where `bug-system` is `github` or `nvidia`.
+  * `gitlab/master`: mirror of `github/master`.
+  * `perforce/private`: mirrored `github/master`, plus files necessary for internal NVIDIA testing systems.
+
+On the rare occasion that we cannot do work in the open, for example when developing a change specific to an
+unreleased product, these branches may exist on `gitlab` instead of `github`. By default, everything should be
+in the open on `github` unless there is a strong motivation for it to not be open.
+
diff --git a/thrust/version.h b/thrust/version.h
index 06e6cfa51..84f9af141 100644
--- a/thrust/version.h
+++ b/thrust/version.h
@@ -47,7 +47,7 @@
  *         <tt>THRUST_VERSION / 100 % 1000</tt> is the minor version.
  *         <tt>THRUST_VERSION / 100000</tt> is the major version.
  */
-#define THRUST_VERSION 100910
+#define THRUST_VERSION 101000
 
 /*! \def THRUST_MAJOR_VERSION
  *  \brief The preprocessor macro \p THRUST_MAJOR_VERSION encodes the