[FEA] Avoid host-side processing in CSV reader #13797

vuule · 2023-08-01T22:35:14Z

As of 23.10, the CSV reader cannot take advantage of kvikIO and GPUDirect Storage due to several processing steps that require the input data to be present in the system memory. These host-side processing steps include:

Decompression;
Skipping the partial data row at the start of the byte range (if reading a byte range);
Skipping the byte order mark (BOM) chars;
Parsing the column names in the header;

Decompression (1) is most likely faster on the CPU (not verified) since all data is compressed in a single block. For CSV files with decompression we may choose to decompress on the host side (also see #5142 and #12255). (2) could readily be processed on the device with thrust::find and (3) could be accomplished by a single thread kernel. For (4), we can copy the header data from the device and keep the column name parsing code (also see #12582), which avoids keeping input data on the host at the cost of a small D2H copy.

High level proposal:
When reading compressed CSV files, read to host and decompress, then wrap the host buffer into a datasource and pass to load_data_and_gather_row_offsets. When reading uncompressed input, just forward the source to load_data_and_gather_row_offsets. There, use device_read to load chunks to device memory. Items (2) and (3) can be done here (BOM skipping is currently outside of load_data_and_gather_row_offsets).

This approach brings several benefits:

maintains "byte range" support and avoids loading data outside of the requested byte range
enables direct device reads for uncompressed inputs
allows the CSV reader to use kvikIO and increases consistency between IO formats

The text was updated successfully, but these errors were encountered:

…_read` (#16826) Issue #13797 The CSV reader ingests all input data with single call to host_read. This is a problem for a few reasons: 1. With `cudaHostRegister` we cannot reliably copy from the mapped region to the GPU without issues with mixing registered and unregistered areas. The reader can't know the datasource implementation details needed to avoid this issue. 2. Currently the reader performs the H2D copies manually, so there's no multi-threaded or pinned memory optimizations. Using `device_read` has the potential to outperform manual copies. This PR changes `read_csv` IO to perform small `host_read`s to get the data like BOM and first row. Most of the data is then read in chunks using `device_read` calls. We can further remove host_reads by moving some of the host processing to the GPU. No significant changes in performance. We are likely to get performance improvements from future changes like increasing the kvikIO thread pool size. Authors: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - MithunR (https://github.com/mythrocks) - Karthikeyan (https://github.com/karthikeyann) URL: #16826

vuule added feature request New feature or request cuIO cuIO issue Performance Performance related issue labels Aug 1, 2023

GregoryKimball added this to libcudf Aug 2, 2023

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Aug 2, 2023

GregoryKimball added this to the CSV reader continuous improvement milestone Aug 2, 2023

GregoryKimball changed the title ~~[FEA] Avoid host side parsing in CSV reader~~ [FEA] Avoid host side processing in CSV reader Aug 10, 2023

GregoryKimball changed the title ~~[FEA] Avoid host side processing in CSV reader~~ [FEA] Avoid host-side processing in CSV reader Aug 10, 2023

GregoryKimball moved this to Story Issue in libcudf Aug 10, 2023

GregoryKimball added the 0 - Backlog In queue waiting for assignment label Aug 10, 2023

GregoryKimball mentioned this issue Aug 18, 2023

[FEA] Modernize CSV reader and expand reader options #13916

Open

GregoryKimball removed the status in libcudf Aug 18, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

vuule mentioned this issue Sep 18, 2024

Rework read_csv IO to avoid reading whole input with a single host_read #16826

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Avoid host-side processing in CSV reader #13797

[FEA] Avoid host-side processing in CSV reader #13797

vuule commented Aug 1, 2023 •

edited by GregoryKimball

Loading

[FEA] Avoid host-side processing in CSV reader #13797

[FEA] Avoid host-side processing in CSV reader #13797

Comments

vuule commented Aug 1, 2023 • edited by GregoryKimball Loading

vuule commented Aug 1, 2023 •

edited by GregoryKimball

Loading