You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
libcudf provides a chunked_parquet_reader in its public API. This reader uses new reader options to process the data in a parquet file in sub-file units. The chunk_read_limit option limits the table size in bytes to be returned per read by only decoding a subset of pages per chunked read. The pass_read_limit option limits the memory used for reading and decompressing data by only decompressing a subset of pages per chunked read.
The chunked parquet reader allows cuDF-python to expose two types of useful functionality:
an API that acts as an iterator to yield dataframe chunks. This is similar to the iter_row_groups behavior in fastparquet. This approach would let users work with parquet files that contain more rows than 2.1B rows (see [FEA] Add 64-bit size type option at build-time for libcudf #13159 for more information about the row limit in libcudf).
a "low_memory" mode that reads the full file, but has a lower peak memory footprint thanks to the smaller sizes of intermediate allocations. This is similar to the the low_memory argument in polars. This approach would make it easier to read large parquet datasets with limited GPU memory.
Describe the solution you'd like
We should make chunked parquet reading available to cuDF-python users. Perhaps this functionality could be made available to cudf.pandas users as well.
Additional context
Pandas does not seem to have a method for chunking parquet reads, and I'm not sure if pandas makes use of the iter_row_groups behavior in fastparquet as a pass-through parameter.
Is your feature request related to a problem? Please describe.
libcudf provides a
chunked_parquet_reader
in its public API. This reader uses new reader options to process the data in a parquet file in sub-file units. Thechunk_read_limit
option limits the table size in bytes to be returned per read by only decoding a subset of pages per chunked read. Thepass_read_limit
option limits the memory used for reading and decompressing data by only decompressing a subset of pages per chunked read.The chunked parquet reader allows cuDF-python to expose two types of useful functionality:
iter_row_groups
behavior in fastparquet. This approach would let users work with parquet files that contain more rows than 2.1B rows (see [FEA] Add 64-bit size type option at build-time for libcudf #13159 for more information about the row limit in libcudf).low_memory
argument in polars. This approach would make it easier to read large parquet datasets with limited GPU memory.Describe the solution you'd like
We should make chunked parquet reading available to cuDF-python users. Perhaps this functionality could be made available to
cudf.pandas
users as well.Additional context
Pandas does not seem to have a method for chunking parquet reads, and I'm not sure if pandas makes use of the
iter_row_groups
behavior in fastparquet as a pass-through parameter.API docs references:
The text was updated successfully, but these errors were encountered: