Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCEMENT]: Optimized modulo computation #284

Closed
sleeepyjack opened this issue Mar 29, 2023 · 2 comments
Closed

[ENHANCEMENT]: Optimized modulo computation #284

sleeepyjack opened this issue Mar 29, 2023 · 2 comments
Assignees
Labels
helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue type: feature request New feature request

Comments

@sleeepyjack
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

Hash map probing requires frequent modulo computation, e.g., hash1(key) + i * hash2(key) % capacity for double hashing. The builtin % can be quite slow, especially if the compiler cannot apply common optimizations like x & (N-1) for when the divisor is guaranteed to be a power-of-two.

If the hash map resides in the GPU's global memory space, we can hide most of this computation behind the expensive memory operations. For shared memory hash tables, however, the modulo computation takes a considerable amount of the total runtime.

To solve this issue, the refactor branch (PR #278) introduces the concept of cuco::extents, similar to std::extent, i.e., an abstraction over static/dynamic size types. This way, we can pass in the divisor at compile time, allowing for some of the aforementioned compiler optimizations to happen. However, this approach introduces a ton of complexities to the design.

Describe the solution you'd like

I was recently informed of a blog article by Thomas Neumann (TU Munich; code is under MIT license; thanks to Clemens Lutz for the suggestion), who proposes a new approach: We use a list of pre-computed, equally-spaced prime numbers. For each prime, we also pre-compute two magic numbers, which allows us to transform the modulo computation into a multiply-and-shift computation.

I have composed some isolated benchmarks for the modulo computation (effectively calling hash(threadId) mod N in a loop) for the following scenarios:

Scenario uint32_t [ms] uint64_t [ms]
Builtin % operator with arbitrary runtime N 162 799
Builtin % operator with arbitrary constexpr N 106 277
Builtin % operator with runtime pow2 N 164 796
Builtin % operator with constexpr pow2 N 85 145
Neumann's approach with runtime prime N 116 290
Runtime pow2 N with optimized mod, i.e. x & (N-1) 84 139

As the numbers show, Neumann's approach with a runtime extent is on-par with (2.), i.e. using the builtin operator with a compile-time extent. This means we can eliminate the static/dynamic extent abstraction without sacrificing performance. We also improve performance for the dynamic extent case (e.g. cudf hash join).

I propose the following design:

// T can be e.g. uint32_t or uint64_t
template <typename T>
struct extent;

template <typename T>
struct prime_extent : public extent<T>;

// factory which selects the next larger prime
template <typename T>
constexpr prime_extent<T> make_prime_extent(extent<T>) noexcept;

Additionally, I propose a power-of-two extent type for even better performance:

template <typename T>
struct pow2_extent : public extent<T>;

// factory which selects the next larger power-of-two
template <typename T>
constexpr pow2_extent<T> make_pow2_extent(extent<T>) noexcept;

Each of these extent classes provides a member function __host__ __device__ constexpr value_type mod(value_type lhs) noexcept which implements the optimized modulo computation.

Describe alternatives you've considered

No response

Additional context

No response

@sleeepyjack sleeepyjack added type: feature request New feature request helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue labels Mar 29, 2023
@sleeepyjack sleeepyjack self-assigned this Apr 5, 2023
@jrhemstad
Copy link
Collaborator

For reference, I've experimented with using the fastdiv library for this in the past: https://github.com/milakov/int_fastdiv

@PointKernel
Copy link
Member

To be closed after resolving tasks in #315 (comment)

@github-project-automation github-project-automation bot moved this to Done in CCCL Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue type: feature request New feature request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants