Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use C++ to parse and filter parquet footers. #199

Merged
merged 13 commits into from
May 6, 2022

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Apr 25, 2022

We have seen on some parquet files where there are a lot of columns that reading the footer can be the bottleneck. This will help to fix it, but it is just a first step. It does not include predicate push down yet. Just range filtering for row groups that fall into a given split and column pruning. It still needs a lot of documentation and tests, but I wanted to get the code up sooner than later.

@revans2 revans2 marked this pull request as ready for review April 26, 2022 16:19
@revans2
Copy link
Collaborator Author

revans2 commented Apr 26, 2022

build

build/Dockerfile.centos7 Outdated Show resolved Hide resolved
build/build-in-docker Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/CMakeLists.txt Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Show resolved Hide resolved
std::vector<int> num_children_stack;
std::vector<column_pruner*> tree_stack;
tree_stack.push_back(this);
num_children_stack.push_back(schema[0].num_children);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any odd empty schemas in parquet we should guard against?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of the standard that there must be a root to the schema. If you want to add an extra check and throw a more clear exception I can.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice just because there doesn't seem to be bounds checking here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still pending.

src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
@revans2
Copy link
Collaborator Author

revans2 commented Apr 27, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented Apr 27, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented Apr 27, 2022

@pxLi could you take a look at the CI failures. I am in over my head with the errors at this point.

@revans2
Copy link
Collaborator Author

revans2 commented Apr 27, 2022

I am also running into some issues with the java parquet parser if there are no columns in the footer. I am going to have to make a special case bypass for the empty read schema case, where we just get out the number of rows from the matching row groups. But I still need to figure out how to fit it all together.

@pxLi
Copy link
Collaborator

pxLi commented Apr 27, 2022

build

@pxLi
Copy link
Collaborator

pxLi commented Apr 27, 2022

@pxLi could you take a look at the CI failures. I am in over my head with the errors at this point.

just realize this change includes some dockerfile change, I will adjust some internal setup to meet the requirements

ci/Dockerfile Outdated
@@ -27,7 +27,6 @@ FROM gpuci/cuda:$CUDA_VERSION-devel-centos7
RUN yum install -y centos-release-scl
RUN yum install -y devtoolset-9 rh-python38 epel-release
RUN yum install -y zlib-devel maven tar wget patch ninja-build
RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.7-1.x86_64.rpm && yum install -y git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CI also required git to do some work, otherwise it will fail

ci/premerge-build.sh: line 22: git: command not found

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing the original repo is invalid.
Our CI requires some newer git version instead of default yum pkg. Let me try find available one

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged yum repo fix #208. and also fixed internal pipeline to cover dockerfile change cases in this PR.

Please help upmerge your branch and try re-trigger the build, thanks~

@revans2 revans2 marked this pull request as draft April 28, 2022 14:07
@revans2
Copy link
Collaborator Author

revans2 commented Apr 28, 2022

Moving back to draft. I have found a number of incorrect assumptions I made about column pruning and the structure of the data. I am going to have to re-write a lot of that code to make it generic enough to match things the way we want them.

@revans2
Copy link
Collaborator Author

revans2 commented Apr 28, 2022

build

@revans2 revans2 marked this pull request as ready for review April 28, 2022 18:59
@revans2
Copy link
Collaborator Author

revans2 commented Apr 28, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented Apr 28, 2022

After talking this over with others I am going to put this in as is and then we can improve it later on.

@revans2
Copy link
Collaborator Author

revans2 commented Apr 28, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented Apr 28, 2022

build

src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
@revans2
Copy link
Collaborator Author

revans2 commented May 4, 2022

@abellina I think I have addressed most if not all of your review comments. Please take another look.

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All nits at this point that could also be handled later.

*/
std::string unicode_to_lower(std::string const& input) {
// get the size of the wide character result
std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0);
std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0);

// go back up the stack/tree removing children until we hit one with more children
bool done = false;
while (!done) {
int parent_children_left = num_children_stack.back() - 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit 2 space indentation.

*/
class column_pruner {
public:
/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit 2 space indentation.


static std::vector<parquet::format::RowGroup> filter_groups(parquet::format::FileMetaData const& meta,
int64_t part_offset, int64_t part_length) {
CUDF_FUNC_RANGE();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit indentation is off in this function

abellina
abellina previously approved these changes May 6, 2022
@revans2
Copy link
Collaborator Author

revans2 commented May 6, 2022

@jlowe could you please take another look?

build-libcudf.xml Outdated Show resolved Hide resolved
build/Dockerfile.centos7 Outdated Show resolved Hide resolved
@revans2
Copy link
Collaborator Author

revans2 commented May 6, 2022

build

@revans2 revans2 merged commit 4b8d8f8 into NVIDIA:branch-22.06 May 6, 2022
@revans2 revans2 deleted the cpp_parquet_footer_parse branch May 6, 2022 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants