[FEA] Refactor to eliminate redundant device aggregation logic #17032
Labels
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
Is your feature request related to a problem? Please describe.
Once #17031 is merged, three copies of similar device aggregator logic will exist in libcudf, and we need to address this issue
We currently cannot share the same code path because the existing device aggregator only accepts column_device_view as input, and libcudf does not yet support constructing a column_device_view from shared memory.
Proposed Solution The initial plan was to extend column_device_view to allow its construction from shared memory. The ultimate goal is to create a unified aggregator that handles all types of aggregations: global-global, shared-global, and global-shared. However, after further discussions, it appears that unifying all three into a single aggregator may not be feasible. Nonetheless, there are several potential improvements we want to explore:
bool
array used for shared memory nullability with abitmask_type
array. While preliminary tests show this can cause a 10% slowdown due to the atomic operations required by bitmasks, there's potential for optimization. The key benefit is that bitmasks save memory, allowing for more complex requests to be performed in shared memory.cudaErrorInvalidValue
when querying available dynamic shared memory size usingcudaOccupancyAvailableDynamicSMemPerBlock
. This error seems related to the dictionary template instantiation in the aggregator, which causes a nested invocation of the type dispatcher. Notably, this error occurs on V100 but not on RTX8000.The text was updated successfully, but these errors were encountered: