Implement chunked Parquet reader #11867

ttnghia · 2022-10-05T21:22:09Z

This adds chunked Parquet reader, which can perform chunked reading for accessing files by an iterative manner. Instead of reading the input file all at once, we can read it chunk by chunk, each chunk can be limited to be small enough to not exceed the cudf internal limit (2GB/2 billions rows):

auto reader = cudf::io::chunked_parquet_reader(byte_limit, read_opts);
do {
    auto const chunk = reader.read_chunk();
    // Process chunk
} while (reader.has_next());

…taining a mix of nested and non-nested types would result in incorrect row counts for the non-nested types. Also optimizes the preprocess path so that non-nested types do not end up getting visited by the kernel.

…ists. Fixed an additional issue in the decoding where flat column types underneath structs could end up ignoring skip_rows/num_rows.

Signed-off-by: Nghia Truong <[email protected]>

Signed-off-by: Nghia Truong <[email protected]> # Conflicts: # cpp/tests/CMakeLists.txt

vuule

Looks solid, just a few more nitpick-y suggestions.
Awesome PR!

cpp/src/io/parquet/page_data.cu

cpp/src/io/parquet/reader_impl_preprocess.cu

cpp/src/io/parquet/page_data.cu

vuule · 2022-11-16T23:46:45Z

cpp/include/cudf/io/detail/parquet.hpp

+  [[nodiscard]] bool has_next() const;
+
+  /**
+   * @copydoc cudf::io::chunked_parquet_reader::read_chunk
+   */
+  [[nodiscard]] table_with_metadata read_chunk() const;


Not sure if read_chunk should be const but we can leave it for now.

Signed-off-by: Nghia Truong <[email protected]>

vuule

Thanks for addressing most of the nitpicks!
We can address some remaining (riskier) clean up ideas in separate PRs.

ttnghia · 2022-11-17T23:35:01Z

@gpucibot merge

nvdbaranec · 2022-11-18T02:35:35Z

@gpucibot merge

This adds JNI for chunked Parquet reader. It depends on the chunked Parquet reader implementation PR (#11867). Authors: - https://github.com/nvdbaranec - Nghia Truong (https://github.com/ttnghia) Approvers: - MithunR (https://github.com/mythrocks) - Robert (Bobby) Evans (https://github.com/revans2)

nvdbaranec and others added 16 commits September 23, 2022 10:59

Merge branch 'branch-22.12' into reader_preprocess_fix_and_opt

f330431

Fixed an issue with the tests: input columns cannot have unsanitary l…

eadfd63

…ists. Fixed an additional issue in the decoding where flat column types underneath structs could end up ignoring skip_rows/num_rows.

Merge branch 'branch-22.12' into reader_preprocess_fix_and_opt

c4de038

Copy parquet_reader_* into chunked_parquet_reader_*

222c9fe

Signed-off-by: Nghia Truong <[email protected]>

Modify chunked_parquet_reader_options

f49cfed

Signed-off-by: Nghia Truong <[email protected]>

Exploit inheritance to extend the options and options_builder classes

dd39804

Signed-off-by: Nghia Truong <[email protected]>

Remove unnecessary variable

81bc68f

Signed-off-by: Nghia Truong <[email protected]>

Misc

f8126be

Signed-off-by: Nghia Truong <[email protected]>

Add docs

0e7692c

Signed-off-by: Nghia Truong <[email protected]>

PR feedback changes.

9f9eeb0

Merge branch 'branch-22.12' into reader_preprocess_fix_and_opt

9b3ea62

Fixed some compile errors from merging.

d2e409a

Add chunked_parquet_reader

ed41ac1

Signed-off-by: Nghia Truong <[email protected]>

Add empty implementation

be782f2

Signed-off-by: Nghia Truong <[email protected]>

Merge branch 'branch-22.12' into parquet_reader

7908b66

ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Oct 5, 2022

ttnghia self-assigned this Oct 5, 2022

ttnghia changed the title ~~Implement chunked Parquet reader~~ Implement chunked Parquet reader [skip ci] Oct 5, 2022

ttnghia added 6 commits October 5, 2022 15:21

Add a destructor and close

a7175c8

Signed-off-by: Nghia Truong <[email protected]>

Update docs

63a7bd6

Signed-off-by: Nghia Truong <[email protected]>

Fix comment

16c12d9

Signed-off-by: Nghia Truong <[email protected]>

Construct chunked_parquet_reader

cd85385

Signed-off-by: Nghia Truong <[email protected]>

Add comment

5944beb

Signed-off-by: Nghia Truong <[email protected]>

Rename function and implementing

7cfa72a

Signed-off-by: Nghia Truong <[email protected]>

ttnghia added 4 commits November 16, 2022 06:47

Address some review comments

7203c67

Signed-off-by: Nghia Truong <[email protected]>

Merge branch 'branch-22.12' into parquet_reader

520448b

Fix #endif

96eed8e

Signed-off-by: Nghia Truong <[email protected]>

Merge branch 'branch-22.12' into parquet_reader

6697e3b

Signed-off-by: Nghia Truong <[email protected]> # Conflicts: # cpp/tests/CMakeLists.txt

ttnghia requested a review from vuule November 16, 2022 18:52

vuule reviewed Nov 16, 2022

View reviewed changes

nvdbaranec and others added 7 commits November 17, 2022 10:20

PR review changes. Updated some incorrect/incomplete function docs.

70f4fde

Made the logic in the row_total_size functor much more readable.

4547483

Merge branch 'branch-22.12' into parquet_reader

fe7d6d1

Fix the tests

db21bc3

Signed-off-by: Nghia Truong <[email protected]>

Variable renaming for clarity.

36043d8

Merge branch 'chunked_reader_gpu' into parquet_reader

83f0703

Merge branch 'branch-22.12' into parquet_reader

3499bda

nvdbaranec requested a review from vuule November 17, 2022 20:51

ttnghia mentioned this pull request Nov 17, 2022

Implement JNI for chunked Parquet reader #11961

Merged

vuule approved these changes Nov 17, 2022

View reviewed changes

rapids-bot bot merged commit 3fb09d1 into rapidsai:branch-22.12 Nov 18, 2022

ttnghia deleted the parquet_reader branch November 18, 2022 03:05

mythrocks mentioned this pull request Nov 21, 2022

[RELEASE] cudf v22.12 #12200

Merged

revans2 mentioned this pull request Nov 22, 2022

[FEA] Supported chunked reading of ORC files #12228

Closed

vuule mentioned this pull request Dec 6, 2022

[BUG] read_parquet performance regression on V100 #12316

Closed

etseidl mentioned this pull request Mar 17, 2023

[WIP] POC add data total size information to Parquet file metadata #12974

Closed

3 tasks

This was referenced Sep 10, 2023

Remove read_orc options num_rows and skip_rows. #11519

Closed

[FEA] Improve ORC reader filtering and performance #13882

Open

GregoryKimball mentioned this pull request Feb 26, 2024

[FEA] Add python bindings in the parquet reader for num_rows/skiprows #15144

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement chunked Parquet reader #11867

Implement chunked Parquet reader #11867

ttnghia commented Oct 5, 2022 •

edited

Loading

vuule left a comment

vuule Nov 16, 2022

vuule left a comment

ttnghia commented Nov 17, 2022

nvdbaranec commented Nov 18, 2022

Implement chunked Parquet reader #11867

Implement chunked Parquet reader #11867

Conversation

ttnghia commented Oct 5, 2022 • edited Loading

vuule left a comment

Choose a reason for hiding this comment

vuule Nov 16, 2022

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

ttnghia commented Nov 17, 2022

nvdbaranec commented Nov 18, 2022

ttnghia commented Oct 5, 2022 •

edited

Loading