[FEA] Improve cudf::gather scalability as number of columns increases #13509
Labels
0 - Backlog
In queue waiting for assignment
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Spark
Functionality that helps Spark RAPIDS
Milestone
As the number of columns increases for
cudf::gather
with the same gather map, we see the number of kernels called increase proportionally and the runtime increases linearly. We are wondering if there are better ways to group or "batch" these calls so we perform less kernel invocations that can do more work all at once, in hopes of amortizing some of the cost with many columns or deeply nested schemas.A very simple example is below. This creates a column of 10
int32_t
rows and adds it to a struct N times (whereN
is between 2 and 1024):As the column count increases by 2x, the gather kernel takes 2x longer:
A similar argument can be made for columns that have nested things like arrays of structs (each with array members). The number of calls to underlying cub calls can increase drastically.
I am filing this issue to solicit comments/patches to see how we could improve this behavior.
The text was updated successfully, but these errors were encountered: