[FEA] Investigate the chunked parquet reader for Polars GPU engine #16818
Labels
cudf.polars
Issues specific to cudf.polars
feature request
New feature or request
Python
Affects Python cuDF API.
Some users experience out-of-memory errors during IO when loading datasets that they feel like should be fine for their given GPU. This is currently less significant for cudf.pandas, as we now enable a prefetch-optimized unified memory by default.
Because we don't currently have a similar UVM setup for the Polars GPU engine, this is an acute pain point that blocks usage for many workflows. We've developed chunked readers for Parquet and ORC files that may be able to help in this situation.
Initial testing suggests that a properly configured chunked parquet reader may be effective at reducing peak memory requirements without significantly impacting performance.
For example, running PDS-H q7 at SF200 immediately runs into an OOM with the default Parquet reader. With a
pass_read_limit
of 16GB for the chunked reader, we can smoothly finish the query and provide a speedup with an H100 vs. the CPU engine on a high-end CPU.Default CPU engine on a dual socket Intel 8480CL:
Default GPU engine behavior with cuda-async memory resource
GPU engine behavior with cuda-async memory resource and
"pass_read_limit": 16024000000
We should do a full evaluation of the chunked Parquet reader on the PDS-H benchmarks to empirically assess the potential opportunity and tradeoffs for chunked IO. Starting with Parquet makes sense, as it's a more common file format in the PyData world. We can expand from there, as needed.
cc @quasiben @brandon-b-miller (offline discussion)
The text was updated successfully, but these errors were encountered: