[ENHANCEMENT]: Optimized modulo computation #284

sleeepyjack · 2023-03-29T18:18:44Z

Is your feature request related to a problem? Please describe.

Hash map probing requires frequent modulo computation, e.g., hash1(key) + i * hash2(key) % capacity for double hashing. The builtin % can be quite slow, especially if the compiler cannot apply common optimizations like x & (N-1) for when the divisor is guaranteed to be a power-of-two.

If the hash map resides in the GPU's global memory space, we can hide most of this computation behind the expensive memory operations. For shared memory hash tables, however, the modulo computation takes a considerable amount of the total runtime.

To solve this issue, the refactor branch (PR #278) introduces the concept of cuco::extents, similar to std::extent, i.e., an abstraction over static/dynamic size types. This way, we can pass in the divisor at compile time, allowing for some of the aforementioned compiler optimizations to happen. However, this approach introduces a ton of complexities to the design.

Describe the solution you'd like

I was recently informed of a blog article by Thomas Neumann (TU Munich; code is under MIT license; thanks to Clemens Lutz for the suggestion), who proposes a new approach: We use a list of pre-computed, equally-spaced prime numbers. For each prime, we also pre-compute two magic numbers, which allows us to transform the modulo computation into a multiply-and-shift computation.

I have composed some isolated benchmarks for the modulo computation (effectively calling hash(threadId) mod N in a loop) for the following scenarios:

Scenario	uint32_t [ms]	uint64_t [ms]
Builtin `%` operator with arbitrary runtime `N`	162	799
Builtin `%` operator with arbitrary constexpr `N`	106	277
Builtin `%` operator with runtime pow2 `N`	164	796
Builtin `%` operator with constexpr pow2 `N`	85	145
Neumann's approach with runtime prime `N`	116	290
Runtime pow2 `N` with optimized mod, i.e. `x & (N-1)`	84	139

As the numbers show, Neumann's approach with a runtime extent is on-par with (2.), i.e. using the builtin operator with a compile-time extent. This means we can eliminate the static/dynamic extent abstraction without sacrificing performance. We also improve performance for the dynamic extent case (e.g. cudf hash join).

I propose the following design:

// T can be e.g. uint32_t or uint64_t
template <typename T>
struct extent;

template <typename T>
struct prime_extent : public extent<T>;

// factory which selects the next larger prime
template <typename T>
constexpr prime_extent<T> make_prime_extent(extent<T>) noexcept;

Additionally, I propose a power-of-two extent type for even better performance:

template <typename T>
struct pow2_extent : public extent<T>;

// factory which selects the next larger power-of-two
template <typename T>
constexpr pow2_extent<T> make_pow2_extent(extent<T>) noexcept;

Each of these extent classes provides a member function __host__ __device__ constexpr value_type mod(value_type lhs) noexcept which implements the optimized modulo computation.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

jrhemstad · 2023-04-10T15:54:54Z

For reference, I've experimented with using the fastdiv library for this in the past: https://github.com/milakov/int_fastdiv

Adds `fast_int` functionality to `cuco::extent`. Related to #284

PointKernel · 2023-08-01T22:35:36Z

To be closed after resolving tasks in #315 (comment)

sleeepyjack added type: feature request New feature request helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue labels Mar 29, 2023

cccl-authenticator-app bot added this to CCCL Mar 29, 2023

sleeepyjack self-assigned this Apr 5, 2023

sleeepyjack mentioned this issue Apr 11, 2023

[FEA] Accelerate Bloom filtered joins NVIDIA/spark-rapids#7803

Closed

4 tasks

sleeepyjack mentioned this issue May 11, 2023

Faster modulo computation for runtime extents #304

Closed

4 tasks

sleeepyjack mentioned this issue Jun 5, 2023

Enable fast_int functionality for dynamic extent #315

Merged

PointKernel pushed a commit that referenced this issue Aug 1, 2023

Enable fast_int functionality for dynamic extent (#315)

3fe5704

Adds `fast_int` functionality to `cuco::extent`. Related to #284

sleeepyjack closed this as completed Sep 7, 2023

github-project-automation bot moved this to Done in CCCL Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENHANCEMENT]: Optimized modulo computation #284

[ENHANCEMENT]: Optimized modulo computation #284

sleeepyjack commented Mar 29, 2023

jrhemstad commented Apr 10, 2023

PointKernel commented Aug 1, 2023

[ENHANCEMENT]: Optimized modulo computation #284

[ENHANCEMENT]: Optimized modulo computation #284

Comments

sleeepyjack commented Mar 29, 2023

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

jrhemstad commented Apr 10, 2023

PointKernel commented Aug 1, 2023