[BUG] Improve ORC reader performance with list columns and high row counts #13674

GregoryKimball · 2023-07-07T20:41:22Z

Is your feature request related to a problem? Please describe.
We observe degenerate performance with the ORC reader with list columns when the row count is high. The Parquet reader does not have a similar issue.

Here is a cuDF-python code snippet for reproducing and profiling this issue:

    for num_rows in [1_000, 50_000_000]:
        df = cudf.DataFrame({'a': [[1]]*num_rows})

        df.to_parquet('/raid/tmp/tmp.pq')
        df.to_orc('/raid/tmp/tmp.orc')

        with nvtx.annotate(f'pq {num_rows}', color="blue"):
            t0 = time.time()
            df2 = cudf.read_parquet('/raid/tmp/tmp.pq')
            t1 = time.time()        

        print(f'parquet (numrows {num_rows})', t1-t0, 'seconds')

        with nvtx.annotate(f'orc {num_rows}', color="green"):
            t0 = time.time()
            df2 = cudf.read_orc('/raid/tmp/tmp.orc')
            t1 = time.time()

        print(f'orc (numrows {num_rows})', t1-t0, 'seconds')

With cuDF 23.06 and on my Epyc-A100 workstation, this script yields the following execution times:

parquet (numrows 1000) 0.0058746337890625 seconds
orc (numrows 1000) 0.003462076187133789 seconds
parquet (numrows 50000000) 0.012792587280273438 seconds
orc (numrows 50000000) 1.6445508003234863 seconds

Note how the Parquet and ORC times are similar for 1000 rows, but ORC is 100x slower for the 50M row case.

Nsys profiling shows that the bulk of the extra time is spent in a call to generate_offsets_for_list.

We missed this degenerate performance case previously because our benchmarks focus on wide tables with relatively large lists, rather than tall tables with short lists.

The text was updated successfully, but these errors were encountered:

For list types the ORC reader needs to generate offsets from the sizes of nested lists. This process was previously being parallelized over columns. In practice even with wide tables we have enough rows that parallelizing over rows always makes more sense, so this PR swaps the parallelization strategy. I also removed what appears to be an unnecessary stream synchronization. That likely won't affect performance in any microbenchmarks but is worthwhile in case it helps improve asynchronous execution overall. There are still noticeable bottlenecks for deeply nested lists, but those are in the decode kernels so optimizing them is a separate task for future work. Resolves #13674 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #13708

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Jul 7, 2023

GregoryKimball added this to the ORC continuous improvement milestone Jul 7, 2023

GregoryKimball added this to libcudf Jul 7, 2023

GregoryKimball added the bug Something isn't working label Jul 10, 2023

GregoryKimball changed the title ~~[FEA] Improve ORC reader performance with list columns and high row counts~~ [BUG] Improve ORC reader performance with list columns and high row counts Jul 10, 2023

vyasr mentioned this issue Jul 17, 2023

Optimize ORC reader performance for list data #13708

Merged

3 tasks

vyasr self-assigned this Jul 17, 2023

rapids-bot bot closed this as completed in #13708 Jul 25, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Improve ORC reader performance with list columns and high row counts #13674

[BUG] Improve ORC reader performance with list columns and high row counts #13674

GregoryKimball commented Jul 7, 2023 •

edited

Loading

[BUG] Improve ORC reader performance with list columns and high row counts #13674

[BUG] Improve ORC reader performance with list columns and high row counts #13674

Comments

GregoryKimball commented Jul 7, 2023 • edited Loading

GregoryKimball commented Jul 7, 2023 •

edited

Loading