Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Refactor to eliminate redundant device aggregation logic #17032

Open
3 tasks
PointKernel opened this issue Oct 9, 2024 · 0 comments
Open
3 tasks

[FEA] Refactor to eliminate redundant device aggregation logic #17032

PointKernel opened this issue Oct 9, 2024 · 0 comments
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.

Comments

@PointKernel
Copy link
Member

PointKernel commented Oct 9, 2024

Is your feature request related to a problem? Please describe.
Once #17031 is merged, three copies of similar device aggregator logic will exist in libcudf, and we need to address this issue

  1. https://github.com/rapidsai/cudf/blob/branch-24.12/cpp/src/groupby/hash/shared_memory_aggregator.cuh
  2. https://github.com/rapidsai/cudf/blob/branch-24.12/cpp/src/groupby/hash/global_memory_aggregator.cuh
  3. https://github.com/rapidsai/cudf/blob/branch-24.12/cpp/include/cudf/detail/aggregation/device_aggregators.cuh

We currently cannot share the same code path because the existing device aggregator only accepts column_device_view as input, and libcudf does not yet support constructing a column_device_view from shared memory.

Proposed Solution The initial plan was to extend column_device_view to allow its construction from shared memory. The ultimate goal is to create a unified aggregator that handles all types of aggregations: global-global, shared-global, and global-shared. However, after further discussions, it appears that unifying all three into a single aggregator may not be feasible. Nonetheless, there are several potential improvements we want to explore:

  • Replace the bool array used for shared memory nullability with a bitmask_type array. While preliminary tests show this can cause a 10% slowdown due to the atomic operations required by bitmasks, there's potential for optimization. The key benefit is that bitmasks save memory, allowing for more complex requests to be performed in shared memory.
  • Once we transition to using row bitmasks, it should be possible for all three aggregators to share a common source for underlying kernels.
  • Adding dictionary template instantiations results in a performance degradation of up to 5x. Profiling is needed to determine the cause. Additionally, on V100 GPUs, dictionary instantiations trigger a cudaErrorInvalidValue when querying available dynamic shared memory size using cudaOccupancyAvailableDynamicSMemPerBlock. This error seems related to the dictionary template instantiation in the aggregator, which causes a nested invocation of the type dispatcher. Notably, this error occurs on V100 but not on RTX8000.
@PointKernel PointKernel added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function and removed feature request New feature or request labels Oct 9, 2024
rapids-bot bot pushed a commit that referenced this issue Oct 18, 2024
This work is part of splitting the original bulk shared memory groupby PR #16619.

It introduces two device-side element aggregators:

- `shmem_element_aggregator`: aggregates data from global memory sources to shared memory targets,
- `gmem_element_aggregator`: aggregates from shared memory sources to global memory targets. 

These two aggregators are similar to the `elementwise_aggregator` functionality. Follow-up work is tracked via #17032.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - David Wendt (https://github.com/davidwendt)

URL: #17031
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

1 participant