[BUG] Improve ORC reader performance with list columns and high row counts #13674
Labels
0 - Backlog
In queue waiting for assignment
bug
Something isn't working
cuIO
cuIO issue
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Spark
Functionality that helps Spark RAPIDS
Milestone
Is your feature request related to a problem? Please describe.
We observe degenerate performance with the ORC reader with list columns when the row count is high. The Parquet reader does not have a similar issue.
Here is a cuDF-python code snippet for reproducing and profiling this issue:
With cuDF 23.06 and on my Epyc-A100 workstation, this script yields the following execution times:
Note how the Parquet and ORC times are similar for 1000 rows, but ORC is 100x slower for the 50M row case.
Nsys profiling shows that the bulk of the extra time is spent in a call to
generate_offsets_for_list
.We missed this degenerate performance case previously because our benchmarks focus on wide tables with relatively large lists, rather than tall tables with short lists.
The text was updated successfully, but these errors were encountered: