Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] read_text should support removing delimiters from the output #11625

Closed
upsj opened this issue Aug 30, 2022 · 1 comment · Fixed by #11946
Closed

[FEA] read_text should support removing delimiters from the output #11625

upsj opened this issue Aug 30, 2022 · 1 comment · Fixed by #11946
Assignees
Labels
feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@upsj
Copy link
Contributor

upsj commented Aug 30, 2022

Is your feature request related to a problem? Please describe.
At the moment, the string column generated by read_text uses the underlying input bytes verbatim without removing the delimiter entries.

Describe the solution you'd like
This could be addressed by adding a remove_delimiters bool parameter to the function. This would only require one additional transform + scan to compute the output offsets without delimiters and one string gather to copy the bytes. Since we are far from peak memory bandwidth, it shouldn't matter much performance-wise.

Describe alternatives you've considered
The copy operation could also happen directly inside the multibyte_split kernel, which would avoid the need for a gather step, but add a third scan operation to the kernel.

@upsj upsj added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Aug 30, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@upsj upsj self-assigned this Oct 18, 2022
@upsj upsj added improvement Improvement / enhancement to an existing function and removed inactive-30d Needs Triage Need team to review and classify labels Oct 18, 2022
rapids-bot bot pushed a commit that referenced this issue Oct 27, 2022
This adds a `strip_delimiters` post-processing option to `read_text`. I needed to implement some lightweight striping because a thread-per-row parallelization of the string gather gave pretty bad performance.

For consistency, I also removed the special-case handling of delimiters at the end (previously adding an empty row), to match the read_csv behavior.

Benchmark results:

```
benchmarks/MULTIBYTE_SPLIT_NVBENCH --axis size_approx[pow2]=30 --axis byte_range_percent=100 --axis T=device --axis delim_size=4
```

### [0] Tesla T4

|   T    | strip_delimiters | delim_percent |    size_approx    |   CPU Time  | Noise | Peak Memory Usage | Encoded file size |
|--------|------------------|---------------|-------------------|------------|-------|-------------------|-------------------|
| device |                0 |             1 | 2^30 = 1073741824 | 178.133 ms | 0.36% |         3.709 GiB |      1014.442 MiB |
| device |                1 |             1 | 2^30 = 1073741824 | 188.328 ms | 0.31% |         4.690 GiB |      1014.442 MiB |
| device |                0 |            25 | 2^30 = 1073741824 | 206.188 ms | 0.03% |         5.292 GiB |       953.075 MiB |
| device |                1 |            25 | 2^30 = 1073741824 | 242.534 ms | 0.50% |         5.975 GiB |       953.075 MiB |

Closes #11625

Authors:
  - Tobias Ribizel (https://github.com/upsj)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Bradley Dice (https://github.com/bdice)

URL: #11946
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant