[FEA] Improve performance of copy_if_else for long strings #15014
Labels
feature request
New feature or request
Performance
Performance related issue
Spark
Functionality that helps Spark RAPIDS
Is your feature request related to a problem? Please describe.
We have a number of use cases where we do a copy_if_else on long strings. This can end up taking a long time, like in the case of
from_json
, when the input strings are large it ends up being the biggest kernel by total time. Larger than unsnap to decompress the original input string data. Larger that the tokenization kernels to tokenize the JSON data.That first line is the string copy_if_else kernel that took 36.4% of the total kernel time.
When I look at the strings copy_if_else code I see a single thread per string and it ends up doing a memcpy.
cudf/cpp/include/cudf/strings/detail/copy_if_else.cuh
Line 107 in 6638b52
I am not CUDA expert so I could be wrong about all of this, but I think we should be able to detect if the average string size is larger than a specific amount, and do a string per warp or something to help coalesce the memory read and write performance.
The text was updated successfully, but these errors were encountered: