[FEA] read_text
should support removing delimiters from the output
#11625
Labels
feature request
New feature or request
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Is your feature request related to a problem? Please describe.
At the moment, the string column generated by
read_text
uses the underlying input bytes verbatim without removing the delimiter entries.Describe the solution you'd like
This could be addressed by adding a
remove_delimiters
bool parameter to the function. This would only require one additional transform + scan to compute the output offsets without delimiters and one string gather to copy the bytes. Since we are far from peak memory bandwidth, it shouldn't matter much performance-wise.Describe alternatives you've considered
The copy operation could also happen directly inside the
multibyte_split
kernel, which would avoid the need for agather
step, but add a third scan operation to the kernel.The text was updated successfully, but these errors were encountered: