Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Improve ORC reader performance with list columns and high row counts #13674

Closed
GregoryKimball opened this issue Jul 7, 2023 · 0 comments · Fixed by #13708
Closed

[BUG] Improve ORC reader performance with list columns and high row counts #13674

GregoryKimball opened this issue Jul 7, 2023 · 0 comments · Fixed by #13708
Assignees
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jul 7, 2023

Is your feature request related to a problem? Please describe.
We observe degenerate performance with the ORC reader with list columns when the row count is high. The Parquet reader does not have a similar issue.

Here is a cuDF-python code snippet for reproducing and profiling this issue:

    for num_rows in [1_000, 50_000_000]:
        df = cudf.DataFrame({'a': [[1]]*num_rows})

        df.to_parquet('/raid/tmp/tmp.pq')
        df.to_orc('/raid/tmp/tmp.orc')

        with nvtx.annotate(f'pq {num_rows}', color="blue"):
            t0 = time.time()
            df2 = cudf.read_parquet('/raid/tmp/tmp.pq')
            t1 = time.time()        

        print(f'parquet (numrows {num_rows})', t1-t0, 'seconds')

        with nvtx.annotate(f'orc {num_rows}', color="green"):
            t0 = time.time()
            df2 = cudf.read_orc('/raid/tmp/tmp.orc')
            t1 = time.time()

        print(f'orc (numrows {num_rows})', t1-t0, 'seconds')

With cuDF 23.06 and on my Epyc-A100 workstation, this script yields the following execution times:

parquet (numrows 1000) 0.0058746337890625 seconds
orc (numrows 1000) 0.003462076187133789 seconds
parquet (numrows 50000000) 0.012792587280273438 seconds
orc (numrows 50000000) 1.6445508003234863 seconds

Note how the Parquet and ORC times are similar for 1000 rows, but ORC is 100x slower for the 50M row case.

Nsys profiling shows that the bulk of the extra time is spent in a call to generate_offsets_for_list.
image

We missed this degenerate performance case previously because our benchmarks focus on wide tables with relatively large lists, rather than tall tables with short lists.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Jul 7, 2023
@GregoryKimball GregoryKimball added the bug Something isn't working label Jul 10, 2023
@GregoryKimball GregoryKimball changed the title [FEA] Improve ORC reader performance with list columns and high row counts [BUG] Improve ORC reader performance with list columns and high row counts Jul 10, 2023
@vyasr vyasr self-assigned this Jul 17, 2023
rapids-bot bot pushed a commit that referenced this issue Jul 25, 2023
For list types the ORC reader needs to generate offsets from the sizes of nested lists. This process was previously being parallelized over columns. In practice even with wide tables we have enough rows that parallelizing over rows always makes more sense, so this PR swaps the parallelization strategy. 

I also removed what appears to be an unnecessary stream synchronization. That likely won't affect performance in any microbenchmarks but is worthwhile in case it helps improve asynchronous execution overall.

There are still noticeable bottlenecks for deeply nested lists, but those are in the decode kernels so optimizing them is a separate task for future work.

Resolves #13674

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #13708
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants