Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] The final step of the parquet preprocess step can get very slow for highly complicated schemas. #11922

Closed
nvdbaranec opened this issue Oct 14, 2022 · 2 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue proposal Change current process or code

Comments

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Oct 14, 2022

For parquet files that contain columns with lists in them, we do a preprocess step to compute column sizes (and other information). As a very last step, we iterate over the input columns and march through them by depth here:

for (size_t idx = 0; idx < input_columns.size(); idx++) {

For very complicated schemas, this can result in a large number of calls to thrust::reduce() and thrust::exclusive_scan_by_key(). In some cases, cases, so much so that it can dominate the rest of the decompress/decode work.

We should spend some time figuring out how to coalesce this into fewer (perhaps 1 each of reduce_by_key and exclusive_scan_by_key) calls.

@nvdbaranec nvdbaranec added feature request New feature or request Needs Triage Need team to review and classify cuIO cuIO issue improvement Improvement / enhancement to an existing function labels Oct 14, 2022
@nvdbaranec nvdbaranec self-assigned this Oct 14, 2022
@nvdbaranec
Copy link
Contributor Author

@abellina

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment proposal Change current process or code and removed feature request New feature or request Needs Triage Need team to review and classify labels Oct 21, 2022
@sameerz sameerz added the Performance Performance related issue label Nov 1, 2022
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
rapids-bot bot pushed a commit that referenced this issue Apr 7, 2023
Addresses #11922 

Currently in Parquet preprocessing a `thrust::reduce()` and `thrust::exclusive_scan_by_key()` is performed to compute the column size and offsets for each nested column. For complicated schemas this results in a large number of kernel invocations. This PR calculates the sizes and offsets of all columns in single calls to `thrust::reduce_by_key()` and `thrust::exclusive_scan_by_key()`. 

This change results in around 1.3x speedup when reading a complicated schema.
Before:
![image](https://user-images.githubusercontent.com/26264495/224823213-ae998654-274c-450a-8ad7-ea854541335e.png)

After:
![image](https://user-images.githubusercontent.com/26264495/224823108-cb91c380-5e35-4c77-a6f9-6703e321be05.png)

Authors:
  - Srikar Vanavasam (https://github.com/SrikarVanavasam)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #12931
@vyasr
Copy link
Contributor

vyasr commented Apr 11, 2023

@SrikarVanavasam FYI Github supports certain keywords to automate linking issues to PRs. If you had included Closes #11922 instead of Addresses #11922 in the PR description for #12931 Github would have automatically linked and closed this issue when the PR merged. For future reference 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue proposal Change current process or code
Projects
None yet
Development

No branches or pull requests

5 participants