[ENHANCEMENT]: Optimized modulo computation #284
Labels
helps: rapids
Helps or needed by RAPIDS
topic: performance
Performance related issue
type: feature request
New feature request
Is your feature request related to a problem? Please describe.
Hash map probing requires frequent modulo computation, e.g.,
hash1(key) + i * hash2(key) % capacity
for double hashing. The builtin%
can be quite slow, especially if the compiler cannot apply common optimizations likex & (N-1)
for when the divisor is guaranteed to be a power-of-two.If the hash map resides in the GPU's global memory space, we can hide most of this computation behind the expensive memory operations. For shared memory hash tables, however, the modulo computation takes a considerable amount of the total runtime.
To solve this issue, the refactor branch (PR #278) introduces the concept of
cuco::extent
s, similar tostd::extent
, i.e., an abstraction over static/dynamic size types. This way, we can pass in the divisor at compile time, allowing for some of the aforementioned compiler optimizations to happen. However, this approach introduces a ton of complexities to the design.Describe the solution you'd like
I was recently informed of a blog article by Thomas Neumann (TU Munich; code is under MIT license; thanks to Clemens Lutz for the suggestion), who proposes a new approach: We use a list of pre-computed, equally-spaced prime numbers. For each prime, we also pre-compute two magic numbers, which allows us to transform the modulo computation into a multiply-and-shift computation.
I have composed some isolated benchmarks for the modulo computation (effectively calling
hash(threadId) mod N
in a loop) for the following scenarios:%
operator with arbitrary runtimeN
%
operator with arbitrary constexprN
%
operator with runtime pow2N
%
operator with constexpr pow2N
N
N
with optimized mod, i.e.x & (N-1)
As the numbers show, Neumann's approach with a runtime extent is on-par with (2.), i.e. using the builtin operator with a compile-time extent. This means we can eliminate the static/dynamic extent abstraction without sacrificing performance. We also improve performance for the dynamic extent case (e.g. cudf hash join).
I propose the following design:
Additionally, I propose a power-of-two extent type for even better performance:
Each of these extent classes provides a member function
__host__ __device__ constexpr value_type mod(value_type lhs) noexcept
which implements the optimized modulo computation.Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: