Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings) #13024

Closed
nvdbaranec opened this issue Mar 28, 2023 · 4 comments
Assignees
Labels
cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@nvdbaranec
Copy link
Contributor

For parquet files that contain very large schemas with strings (either large numbers of columns, or large numbers of nested columns) we pay a very heavy price postprocessing the string data after the core decode kernels runs.

Essentially, the "decode" process for strings is just emitting a large array of pointer/size pairs that are then passed to other cudf functions to reconstruct actual columns. The problem is that we are doing this with no batching - each output string column results in an entire cudf function call (make_strings_column) with multiple internal kernel calls each. In situations with thousands of columns, this gets very expensive.

image

In the image above, the green span represents the time spent in the decode kernel and the time spent in all of the make_strings_column calls afterwards. The time is totally dominated by the many many calls to make_strings_column (the red span).

Ideally, we would have some kind of batched interface to make_strings_column (make_strings_columns ?) that can do the work for the thousands of output columns coalesced into fewer kernels.

On a related note, the area under the blue line represents a similar problem involving preprocessing the file (thousands of calls to thrust::reduce and thrust::exclusive_scan_by_key). This has been largely addressed by this PR #12931

@nvdbaranec nvdbaranec added feature request New feature or request Needs Triage Need team to review and classify improvement Improvement / enhancement to an existing function labels Mar 28, 2023
@mattahrens mattahrens added the Performance Performance related issue label Apr 3, 2023
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Apr 3, 2023
rapids-bot bot pushed a commit that referenced this issue Jun 23, 2023
The current Parquet reader decodes string data into a list of {ptr, length} tuples, which are then used in a gather step by `make_strings_column`. This gather step can be time consuming, especially when there are a large number of string columns. This PR addresses this by changing the decode step to write char and offset data directly to the `column_buffer`, which can then be used directly, bypassing the gather step.

The image below compares the new approach to the old. The green arc at the top (82ms) is `gpuDecodePageData`, and the red arc (252ms) is the time spent in `make_strings_column`.  The green arc below (25ms) is `gpuDecodePageData`, the amber arc (22ms) is a new kernel that computes string sizes for each page, and the magenta arc (106ms) is the kernel that decodes string columns.
![flat_edited](https://user-images.githubusercontent.com/25541553/236529570-f2d0d8d4-b2b5-4078-93ae-5123fa489c3c.png)

NVbench shows a good speed up for strings as well.  There is a jump in time for the INTEGRAL benchmark, but little to no change for other data types.  The INTEGRAL time seems to be affected by extra time spent in `malloc` allocating host memory for a `hostdevice_vector`. This `malloc` always occurs, but for some reason in this branch it takes much longer to return.

This is comparing to @nvdbaranec's branch for #13203.
```
|  data_type  |      io       |  cardinality  |  run_length  |   Ref Time |   Cmp Time |        Diff |   %Diff |  
|-------------|---------------|---------------|--------------|------------|------------|-------------|---------| 
|  INTEGRAL   | DEVICE_BUFFER |       0       |      1       |  14.288 ms |  14.729 ms |  440.423 us |   3.08% |   
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      1       |  13.397 ms |  13.997 ms |  600.596 us |   4.48% |   
|  INTEGRAL   | DEVICE_BUFFER |       0       |      32      |  11.831 ms |  12.354 ms |  522.485 us |   4.42% |   
|  INTEGRAL   | DEVICE_BUFFER |     1000      |      32      |  11.335 ms |  11.854 ms |  518.791 us |   4.58% |   
|    FLOAT    | DEVICE_BUFFER |       0       |      1       |   8.681 ms |   8.715 ms |   34.846 us |   0.40% |   
|    FLOAT    | DEVICE_BUFFER |     1000      |      1       |   8.473 ms |   8.472 ms |   -0.680 us |  -0.01% |   
|    FLOAT    | DEVICE_BUFFER |       0       |      32      |   7.217 ms |   7.192 ms |  -25.311 us |  -0.35% |   
|    FLOAT    | DEVICE_BUFFER |     1000      |      32      |   7.425 ms |   7.422 ms |   -3.162 us |  -0.04% |   
|   STRING    | DEVICE_BUFFER |       0       |      1       |  50.079 ms |  42.566 ms |-7513.004 us | -15.00% |   
|   STRING    | DEVICE_BUFFER |     1000      |      1       |  16.813 ms |  14.989 ms |-1823.660 us | -10.85% |   
|   STRING    | DEVICE_BUFFER |       0       |      32      |  49.875 ms |  42.443 ms |-7432.718 us | -14.90% |   
|   STRING    | DEVICE_BUFFER |     1000      |      32      |  15.312 ms |  13.953 ms |-1358.910 us |  -8.87% |   
|    LIST     | DEVICE_BUFFER |       0       |      1       |  80.303 ms |  80.688 ms |  385.916 us |   0.48% |   
|    LIST     | DEVICE_BUFFER |     1000      |      1       |  71.921 ms |  72.356 ms |  435.153 us |   0.61% |   
|    LIST     | DEVICE_BUFFER |       0       |      32      |  61.658 ms |  62.129 ms |  471.022 us |   0.76% |   
|    LIST     | DEVICE_BUFFER |     1000      |      32      |  63.086 ms |  63.371 ms |  285.608 us |   0.45% |   
|   STRUCT    | DEVICE_BUFFER |       0       |      1       |  66.272 ms |  61.142 ms |-5130.639 us |  -7.74% |   
|   STRUCT    | DEVICE_BUFFER |     1000      |      1       |  40.217 ms |  39.328 ms | -888.781 us |  -2.21% |   
|   STRUCT    | DEVICE_BUFFER |       0       |      32      |  63.660 ms |  58.837 ms |-4822.647 us |  -7.58% |   
|   STRUCT    | DEVICE_BUFFER |     1000      |      32      |  38.080 ms |  37.104 ms | -976.133 us |  -2.56% | 
```

May address #13024 

~Depends on #13203~

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - https://github.com/nvdbaranec
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - https://github.com/nvdbaranec
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #13302
@etseidl
Copy link
Contributor

etseidl commented Jun 23, 2023

@nvdbaranec I'm curious if #13302 had any impact on the file used to generate the profile above.

@GregoryKimball GregoryKimball changed the title [FEA] Performance issue with the Parquet reader for very large schemas containing strings [FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings) Feb 16, 2024
@GregoryKimball
Copy link
Contributor

@vuule recently conducted some experiments using an internal stream pool to hide latencies during column buffer allocation. Perhaps evaluating string types with larger column counts would show a bigger signal.

rapids-bot bot pushed a commit that referenced this issue May 3, 2024
…5632)

Part of #13733.

Adds support for reading and writing cuDF string columns where the string data exceeds 2GB. This is accomplished by skipping the final offsets calculation in the string decoding kernel when the 2GB threshold is exceeded, and instead uses `cudf::strings::detail::make_offsets_child_column()`.  This could lead to increased overhead with many columns (see #13024), so this will need some more benchmarking. But if there are many columns that exceed the 2GB limit, it's likely reads will have to be chunked to stay within the memory budget.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - David Wendt (https://github.com/davidwendt)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15632
@mhaseeb123 mhaseeb123 self-assigned this Sep 24, 2024
@mhaseeb123
Copy link
Member

mhaseeb123 commented Sep 24, 2024

Investigated this issue by profiling the overhead of many make_strings_columns on read_parquet using a custom NVTX range over read_chunk_internal function. Observed that 1024 calls of make_strings_columns take roughly ~1.5ms to complete in the current state which seems negligible compared to the decode time (roughly 1.3ms for table size of 64MB across 1024 cols, 17ms for table size of 1GB across 1024 string cols and 178ms for table size of 12GB across 1024 string cols). This is because since @etseidl's #13303, the make_strings_column simply uses the inline_column_buffer to trivially create the column on the host instead of launching anything on the GPU.

Image

From the investigation, it seems like large number of make_strings_column calls is not a significant performance issue in read_parquet anymore but using a thread pool to create these columns in parallel may help further optimize. @GregoryKimball @nvdbaranec @vuule

@mhaseeb123
Copy link
Member

mhaseeb123 commented Sep 24, 2024

Closing the issue based on above results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants