[FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings) #13024

nvdbaranec · 2023-03-28T17:01:42Z

For parquet files that contain very large schemas with strings (either large numbers of columns, or large numbers of nested columns) we pay a very heavy price postprocessing the string data after the core decode kernels runs.

Essentially, the "decode" process for strings is just emitting a large array of pointer/size pairs that are then passed to other cudf functions to reconstruct actual columns. The problem is that we are doing this with no batching - each output string column results in an entire cudf function call (make_strings_column) with multiple internal kernel calls each. In situations with thousands of columns, this gets very expensive.

In the image above, the green span represents the time spent in the decode kernel and the time spent in all of the make_strings_column calls afterwards. The time is totally dominated by the many many calls to make_strings_column (the red span).

Ideally, we would have some kind of batched interface to make_strings_column (make_strings_columns ?) that can do the work for the thousands of output columns coalesced into fewer kernels.

On a related note, the area under the blue line represents a similar problem involving preprocessing the file (thousands of calls to thrust::reduce and thrust::exclusive_scan_by_key). This has been largely addressed by this PR #12931

The text was updated successfully, but these errors were encountered:

@nvdbaranec

The current Parquet reader decodes string data into a list of {ptr, length} tuples, which are then used in a gather step by `make_strings_column`. This gather step can be time consuming, especially when there are a large number of string columns. This PR addresses this by changing the decode step to write char and offset data directly to the `column_buffer`, which can then be used directly, bypassing the gather step. The image below compares the new approach to the old. The green arc at the top (82ms) is `gpuDecodePageData`, and the red arc (252ms) is the time spent in `make_strings_column`. The green arc below (25ms) is `gpuDecodePageData`, the amber arc (22ms) is a new kernel that computes string sizes for each page, and the magenta arc (106ms) is the kernel that decodes string columns. ![flat_edited](https://user-images.githubusercontent.com/25541553/236529570-f2d0d8d4-b2b5-4078-93ae-5123fa489c3c.png) NVbench shows a good speed up for strings as well. There is a jump in time for the INTEGRAL benchmark, but little to no change for other data types. The INTEGRAL time seems to be affected by extra time spent in `malloc` allocating host memory for a `hostdevice_vector`. This `malloc` always occurs, but for some reason in this branch it takes much longer to return. This is comparing to @nvdbaranec's branch for #13203. ``` | data_type | io | cardinality | run_length | Ref Time | Cmp Time | Diff | %Diff | |-------------|---------------|---------------|--------------|------------|------------|-------------|---------| | INTEGRAL | DEVICE_BUFFER | 0 | 1 | 14.288 ms | 14.729 ms | 440.423 us | 3.08% | | INTEGRAL | DEVICE_BUFFER | 1000 | 1 | 13.397 ms | 13.997 ms | 600.596 us | 4.48% | | INTEGRAL | DEVICE_BUFFER | 0 | 32 | 11.831 ms | 12.354 ms | 522.485 us | 4.42% | | INTEGRAL | DEVICE_BUFFER | 1000 | 32 | 11.335 ms | 11.854 ms | 518.791 us | 4.58% | | FLOAT | DEVICE_BUFFER | 0 | 1 | 8.681 ms | 8.715 ms | 34.846 us | 0.40% | | FLOAT | DEVICE_BUFFER | 1000 | 1 | 8.473 ms | 8.472 ms | -0.680 us | -0.01% | | FLOAT | DEVICE_BUFFER | 0 | 32 | 7.217 ms | 7.192 ms | -25.311 us | -0.35% | | FLOAT | DEVICE_BUFFER | 1000 | 32 | 7.425 ms | 7.422 ms | -3.162 us | -0.04% | | STRING | DEVICE_BUFFER | 0 | 1 | 50.079 ms | 42.566 ms |-7513.004 us | -15.00% | | STRING | DEVICE_BUFFER | 1000 | 1 | 16.813 ms | 14.989 ms |-1823.660 us | -10.85% | | STRING | DEVICE_BUFFER | 0 | 32 | 49.875 ms | 42.443 ms |-7432.718 us | -14.90% | | STRING | DEVICE_BUFFER | 1000 | 32 | 15.312 ms | 13.953 ms |-1358.910 us | -8.87% | | LIST | DEVICE_BUFFER | 0 | 1 | 80.303 ms | 80.688 ms | 385.916 us | 0.48% | | LIST | DEVICE_BUFFER | 1000 | 1 | 71.921 ms | 72.356 ms | 435.153 us | 0.61% | | LIST | DEVICE_BUFFER | 0 | 32 | 61.658 ms | 62.129 ms | 471.022 us | 0.76% | | LIST | DEVICE_BUFFER | 1000 | 32 | 63.086 ms | 63.371 ms | 285.608 us | 0.45% | | STRUCT | DEVICE_BUFFER | 0 | 1 | 66.272 ms | 61.142 ms |-5130.639 us | -7.74% | | STRUCT | DEVICE_BUFFER | 1000 | 1 | 40.217 ms | 39.328 ms | -888.781 us | -2.21% | | STRUCT | DEVICE_BUFFER | 0 | 32 | 63.660 ms | 58.837 ms |-4822.647 us | -7.58% | | STRUCT | DEVICE_BUFFER | 1000 | 32 | 38.080 ms | 37.104 ms | -976.133 us | -2.56% | ``` May address #13024 ~Depends on #13203~ Authors: - Ed Seidl (https://github.com/etseidl) - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) - https://github.com/nvdbaranec - Vyas Ramasubramani (https://github.com/vyasr) URL: #13302

etseidl · 2023-06-23T21:45:35Z

@nvdbaranec I'm curious if #13302 had any impact on the file used to generate the profile above.

GregoryKimball · 2024-02-17T00:03:09Z

@vuule recently conducted some experiments using an internal stream pool to hide latencies during column buffer allocation. Perhaps evaluating string types with larger column counts would show a bigger signal.

…5632) Part of #13733. Adds support for reading and writing cuDF string columns where the string data exceeds 2GB. This is accomplished by skipping the final offsets calculation in the string decoding kernel when the 2GB threshold is exceeded, and instead uses `cudf::strings::detail::make_offsets_child_column()`. This could lead to increased overhead with many columns (see #13024), so this will need some more benchmarking. But if there are many columns that exceed the 2GB limit, it's likely reads will have to be chunked to stay within the memory budget. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) URL: #15632

mhaseeb123 · 2024-09-24T20:37:28Z

Investigated this issue by profiling the overhead of many make_strings_columns on read_parquet using a custom NVTX range over read_chunk_internal function. Observed that 1024 calls of make_strings_columns take roughly ~1.5ms to complete in the current state which seems negligible compared to the decode time (roughly 1.3ms for table size of 64MB across 1024 cols, 17ms for table size of 1GB across 1024 string cols and 178ms for table size of 12GB across 1024 string cols). This is because since @etseidl's #13303, the make_strings_column simply uses the inline_column_buffer to trivially create the column on the host instead of launching anything on the GPU.

From the investigation, it seems like large number of make_strings_column calls is not a significant performance issue in read_parquet anymore but using a thread pool to create these columns in parallel may help further optimize. @GregoryKimball @nvdbaranec @vuule

mhaseeb123 · 2024-09-24T22:22:50Z

Closing the issue based on above results.

nvdbaranec added feature request New feature or request Needs Triage Need team to review and classify improvement Improvement / enhancement to an existing function labels Mar 28, 2023

mattahrens added the Performance Performance related issue label Apr 3, 2023

mattahrens assigned SrikarVanavasam Apr 3, 2023

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Apr 3, 2023

GregoryKimball added this to the Parquet continuous improvement milestone Apr 3, 2023

mattahrens unassigned SrikarVanavasam Apr 13, 2023

etseidl mentioned this issue May 5, 2023

Write string data directly to column_buffer in Parquet reader #13302

Merged

3 tasks

GregoryKimball changed the title ~~[FEA] Performance issue with the Parquet reader for very large schemas containing strings~~ [FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings) Feb 16, 2024

etseidl mentioned this issue May 1, 2024

Add support for large string columns to Parquet reader and writer #15632

Merged

3 tasks

mhaseeb123 self-assigned this Sep 24, 2024

mhaseeb123 closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings) #13024

[FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings) #13024

nvdbaranec commented Mar 28, 2023

etseidl commented Jun 23, 2023

GregoryKimball commented Feb 17, 2024

mhaseeb123 commented Sep 24, 2024 •

edited

Loading

mhaseeb123 commented Sep 24, 2024 •

edited

Loading

[FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings) #13024

[FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings) #13024

Comments

nvdbaranec commented Mar 28, 2023

etseidl commented Jun 23, 2023

GregoryKimball commented Feb 17, 2024

mhaseeb123 commented Sep 24, 2024 • edited Loading

mhaseeb123 commented Sep 24, 2024 • edited Loading

mhaseeb123 commented Sep 24, 2024 •

edited

Loading

mhaseeb123 commented Sep 24, 2024 •

edited

Loading