Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Avoid host-side processing in CSV reader #13797

Open
vuule opened this issue Aug 1, 2023 · 0 comments
Open

[FEA] Avoid host-side processing in CSV reader #13797

vuule opened this issue Aug 1, 2023 · 0 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@vuule
Copy link
Contributor

vuule commented Aug 1, 2023

As of 23.10, the CSV reader cannot take advantage of kvikIO and GPUDirect Storage due to several processing steps that require the input data to be present in the system memory. These host-side processing steps include:

  1. Decompression;
  2. Skipping the partial data row at the start of the byte range (if reading a byte range);
  3. Skipping the byte order mark (BOM) chars;
  4. Parsing the column names in the header;

Decompression (1) is most likely faster on the CPU (not verified) since all data is compressed in a single block. For CSV files with decompression we may choose to decompress on the host side (also see #5142 and #12255). (2) could readily be processed on the device with thrust::find and (3) could be accomplished by a single thread kernel. For (4), we can copy the header data from the device and keep the column name parsing code (also see #12582), which avoids keeping input data on the host at the cost of a small D2H copy.

High level proposal:
When reading compressed CSV files, read to host and decompress, then wrap the host buffer into a datasource and pass to load_data_and_gather_row_offsets. When reading uncompressed input, just forward the source to load_data_and_gather_row_offsets. There, use device_read to load chunks to device memory. Items (2) and (3) can be done here (BOM skipping is currently outside of load_data_and_gather_row_offsets).

This approach brings several benefits:

  • maintains "byte range" support and avoids loading data outside of the requested byte range
  • enables direct device reads for uncompressed inputs
  • allows the CSV reader to use kvikIO and increases consistency between IO formats
@vuule vuule added feature request New feature or request cuIO cuIO issue Performance Performance related issue labels Aug 1, 2023
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Aug 2, 2023
@GregoryKimball GregoryKimball changed the title [FEA] Avoid host side parsing in CSV reader [FEA] Avoid host side processing in CSV reader Aug 10, 2023
@GregoryKimball GregoryKimball changed the title [FEA] Avoid host side processing in CSV reader [FEA] Avoid host-side processing in CSV reader Aug 10, 2023
@GregoryKimball GregoryKimball moved this to Story Issue in libcudf Aug 10, 2023
@GregoryKimball GregoryKimball added the 0 - Backlog In queue waiting for assignment label Aug 10, 2023
@GregoryKimball GregoryKimball removed the status in libcudf Aug 18, 2023
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
rapids-bot bot pushed a commit that referenced this issue Sep 28, 2024
…_read` (#16826)

Issue #13797

The CSV reader ingests all input data with single call to host_read.
This is a problem for a few reasons:

1. With `cudaHostRegister` we cannot reliably copy from the mapped region to the GPU without issues with mixing registered and unregistered areas. The reader can't know the datasource implementation details needed to avoid this issue.
2. Currently the reader performs the H2D copies manually, so there's no multi-threaded or pinned memory optimizations. Using `device_read` has the potential to outperform manual copies.

This PR changes `read_csv` IO to perform small `host_read`s to get the data like BOM and first row. Most of the data is then read in chunks using `device_read` calls. We can further remove host_reads by moving some of the host processing to the GPU.

No significant changes in performance. We are likely to get performance improvements from future changes like increasing the kvikIO thread pool size.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #16826
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

2 participants