Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Parquet reader #12046

Merged
merged 172 commits into from
Nov 10, 2022
Merged

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Nov 2, 2022

This is a rather non-simple refactor of Parquet reader, no new features or changes in algorithms were made:

  • Rename some functions.
  • Moving a lot of declarations and definitions of functions/structs/classes around.
  • Extract out some functions/structs/classes and put them into new files.
  • Rewrite doxgen for some functions
  • Use aliases for member variables (to shorten their names), instead of passing them as function parameters
  • Etc.

Note that this is merely moving the current implementation around, preparing for adding chunked Parquet reader which is a fairly large implementation.

This is also a blocker for:

nvdbaranec and others added 30 commits September 23, 2022 10:59
…taining a mix of nested and non-nested types would

result in incorrect row counts for the non-nested types. Also optimizes the preprocess path so that non-nested types
do not end up getting visited by the kernel.
…ists. Fixed an additional issue in the decoding where flat column types underneath

structs could end up ignoring skip_rows/num_rows.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>

# Conflicts:
#	cpp/src/io/parquet/page_data.cu
#	cpp/src/io/parquet/reader_impl.cu
#	cpp/src/io/parquet/reader_impl.hpp
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia changed the title Refactor Parquet reader Refactor Parquet reader and writer Nov 3, 2022
@ttnghia ttnghia changed the title Refactor Parquet reader and writer Refactor Parquet reader Nov 3, 2022
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake LGTM

@karthikeyann
Copy link
Contributor

rerun tests

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits.

Looks good!

cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/include/cudf/io/detail/parquet.hpp Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.cuh Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.hpp Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very hard to review with all moved code, but (AFAICT) looks great!
Got some small suggestions, some are maybe not the kind of code improvements that this PR aims for.

cpp/src/io/parquet/reader_impl.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
Comment on lines +55 to +56
size_t const start_row; // TODO source index
size_type const source_index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why the TODO is still here?

Also, I don't think these members should be const, it only does harm by preventing moves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvdbaranec Do you have any idea why TODO?

cpp/src/io/parquet/reader_impl_helpers.hpp Show resolved Hide resolved
/**
* @brief Function that translates Parquet datatype to cuDF type enum
*/
type_id to_type_id(SchemaElement const& schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this and to_data_type should be SchemaElement members, but this is probably out of scope for this PR. Also, do we really need to_type_id when we have to_data_type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_type_id is used to interpret parquet schema information and determine the corresponding cudf type during reading. to_data_type takes a type_id and a parquet schema to determine the eventual column logical type.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots to process, but looking good.

cpp/src/io/parquet/reader_impl.cpp Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia
Copy link
Contributor Author

ttnghia commented Nov 10, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 70c7b7a into rapidsai:branch-22.12 Nov 10, 2022
@ttnghia ttnghia deleted the refactor_parquet_reader branch November 14, 2022 18:24
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants