[FEA] The final step of the parquet preprocess step can get very slow for highly complicated schemas. #11922

nvdbaranec · 2022-10-14T15:47:18Z

For parquet files that contain columns with lists in them, we do a preprocess step to compute column sizes (and other information). As a very last step, we iterate over the input columns and march through them by depth here:

cudf/cpp/src/io/parquet/page_data.cu

Line 1827 in e91d7d9

for (size_t idx = 0; idx < input_columns.size(); idx++) {

For very complicated schemas, this can result in a large number of calls to thrust::reduce() and thrust::exclusive_scan_by_key(). In some cases, cases, so much so that it can dominate the rest of the decompress/decode work.

We should spend some time figuring out how to coalesce this into fewer (perhaps 1 each of reduce_by_key and exclusive_scan_by_key) calls.

The text was updated successfully, but these errors were encountered:

nvdbaranec · 2022-10-14T15:47:27Z

@abellina

Addresses #11922 Currently in Parquet preprocessing a `thrust::reduce()` and `thrust::exclusive_scan_by_key()` is performed to compute the column size and offsets for each nested column. For complicated schemas this results in a large number of kernel invocations. This PR calculates the sizes and offsets of all columns in single calls to `thrust::reduce_by_key()` and `thrust::exclusive_scan_by_key()`. This change results in around 1.3x speedup when reading a complicated schema. Before: ![image](https://user-images.githubusercontent.com/26264495/224823213-ae998654-274c-450a-8ad7-ea854541335e.png) After: ![image](https://user-images.githubusercontent.com/26264495/224823108-cb91c380-5e35-4c77-a6f9-6703e321be05.png) Authors: - Srikar Vanavasam (https://github.com/SrikarVanavasam) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #12931

vyasr · 2023-04-11T22:57:39Z

@SrikarVanavasam FYI Github supports certain keywords to automate linking issues to PRs. If you had included Closes #11922 instead of Addresses #11922 in the PR description for #12931 Github would have automatically linked and closed this issue when the PR merged. For future reference 😄

nvdbaranec added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue improvement Improvement / enhancement to an existing function labels Oct 14, 2022

nvdbaranec self-assigned this Oct 14, 2022

GregoryKimball added 0 - Backlog In queue waiting for assignment proposal Change current process or code and removed feature request New feature or request Needs Triage Need team to review and classify labels Oct 21, 2022

sameerz added the Performance Performance related issue label Nov 1, 2022

GregoryKimball added this to the Parquet continuous improvement milestone Nov 19, 2022

mattahrens unassigned nvdbaranec Jan 27, 2023

mattahrens assigned SrikarVanavasam Feb 8, 2023

SrikarVanavasam mentioned this issue Mar 13, 2023

Compute column sizes in Parquet preprocess with single kernel #12931

Merged

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

SrikarVanavasam closed this as completed Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] The final step of the parquet preprocess step can get very slow for highly complicated schemas. #11922

[FEA] The final step of the parquet preprocess step can get very slow for highly complicated schemas. #11922

nvdbaranec commented Oct 14, 2022 •

edited

Loading

nvdbaranec commented Oct 14, 2022

vyasr commented Apr 11, 2023 •

edited

Loading

[FEA] The final step of the parquet preprocess step can get very slow for highly complicated schemas. #11922

[FEA] The final step of the parquet preprocess step can get very slow for highly complicated schemas. #11922

Comments

nvdbaranec commented Oct 14, 2022 • edited Loading

nvdbaranec commented Oct 14, 2022

vyasr commented Apr 11, 2023 • edited Loading

nvdbaranec commented Oct 14, 2022 •

edited

Loading

vyasr commented Apr 11, 2023 •

edited

Loading