Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement chunked Parquet reader #11867

Merged
merged 181 commits into from
Nov 18, 2022
Merged

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Oct 5, 2022

This adds chunked Parquet reader, which can perform chunked reading for accessing files by an iterative manner. Instead of reading the input file all at once, we can read it chunk by chunk, each chunk can be limited to be small enough to not exceed the cudf internal limit (2GB/2 billions rows):

auto reader = cudf::io::chunked_parquet_reader(byte_limit, read_opts);
do {
    auto const chunk = reader.read_chunk();
    // Process chunk
} while (reader.has_next());

nvdbaranec and others added 16 commits September 23, 2022 10:59
…taining a mix of nested and non-nested types would

result in incorrect row counts for the non-nested types. Also optimizes the preprocess path so that non-nested types
do not end up getting visited by the kernel.
…ists. Fixed an additional issue in the decoding where flat column types underneath

structs could end up ignoring skip_rows/num_rows.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Oct 5, 2022
@ttnghia ttnghia self-assigned this Oct 5, 2022
@ttnghia ttnghia changed the title Implement chunked Parquet reader Implement chunked Parquet reader [skip ci] Oct 5, 2022
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>

# Conflicts:
#	cpp/tests/CMakeLists.txt
@ttnghia ttnghia requested a review from vuule November 16, 2022 18:52
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid, just a few more nitpick-y suggestions.
Awesome PR!

cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
Comment on lines +132 to +137
[[nodiscard]] bool has_next() const;

/**
* @copydoc cudf::io::chunked_parquet_reader::read_chunk
*/
[[nodiscard]] table_with_metadata read_chunk() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if read_chunk should be const but we can leave it for now.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing most of the nitpicks!
We can address some remaining (riskier) clean up ideas in separate PRs.

@ttnghia
Copy link
Contributor Author

ttnghia commented Nov 17, 2022

@gpucibot merge

1 similar comment
@nvdbaranec
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 3fb09d1 into rapidsai:branch-22.12 Nov 18, 2022
@ttnghia ttnghia deleted the parquet_reader branch November 18, 2022 03:05
ajschmidt8 pushed a commit that referenced this pull request Nov 18, 2022
This adds JNI for chunked Parquet reader. It depends on the chunked Parquet reader implementation PR  (#11867).

Authors:
   - https://github.com/nvdbaranec
   - Nghia Truong (https://github.com/ttnghia)

Approvers:
   - MithunR (https://github.com/mythrocks)
   - Robert (Bobby) Evans (https://github.com/revans2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Split batches from parquet that are too large, and try to guess better before decompressing
5 participants