Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Incorporate chunked parquet reading into cuDF-python #14966

Open
GregoryKimball opened this issue Feb 4, 2024 · 0 comments
Open

[FEA] Incorporate chunked parquet reading into cuDF-python #14966

GregoryKimball opened this issue Feb 4, 2024 · 0 comments
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 4, 2024

Is your feature request related to a problem? Please describe.
libcudf provides a chunked_parquet_reader in its public API. This reader uses new reader options to process the data in a parquet file in sub-file units. The chunk_read_limit option limits the table size in bytes to be returned per read by only decoding a subset of pages per chunked read. The pass_read_limit option limits the memory used for reading and decompressing data by only decompressing a subset of pages per chunked read.

The chunked parquet reader allows cuDF-python to expose two types of useful functionality:

  1. an API that acts as an iterator to yield dataframe chunks. This is similar to the iter_row_groups behavior in fastparquet. This approach would let users work with parquet files that contain more rows than 2.1B rows (see [FEA] Add 64-bit size type option at build-time for libcudf #13159 for more information about the row limit in libcudf).
  2. a "low_memory" mode that reads the full file, but has a lower peak memory footprint thanks to the smaller sizes of intermediate allocations. This is similar to the the low_memory argument in polars. This approach would make it easier to read large parquet datasets with limited GPU memory.

Describe the solution you'd like
We should make chunked parquet reading available to cuDF-python users. Perhaps this functionality could be made available to cudf.pandas users as well.

Additional context
Pandas does not seem to have a method for chunking parquet reads, and I'm not sure if pandas makes use of the iter_row_groups behavior in fastparquet as a pass-through parameter.

API docs references:

@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. labels Feb 4, 2024
@GregoryKimball GregoryKimball moved this to To be revisited in libcudf Feb 8, 2024
@galipremsagar galipremsagar self-assigned this May 10, 2024
@galipremsagar galipremsagar removed the 0 - Backlog In queue waiting for assignment label May 10, 2024
@galipremsagar galipremsagar moved this from To be revisited to In progress in libcudf May 14, 2024
@galipremsagar galipremsagar moved this from In progress to To be revisited in libcudf May 14, 2024
rapids-bot bot pushed a commit that referenced this issue Jun 6, 2024
Partially Addresses: #14966 

This PR implements chunked parquet bindings in python.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Thomas Li (https://github.com/lithomas1)

URL: #15728
@vyasr vyasr added this to cuDF Python Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
Status: Todo
Status: To be revisited
Development

No branches or pull requests

2 participants