Use C++ to parse and filter parquet footers. #199

revans2 · 2022-04-25T20:24:17Z

We have seen on some parquet files where there are a lot of columns that reading the footer can be the bottleneck. This will help to fix it, but it is just a first step. It does not include predicate push down yet. Just range filtering for row groups that fall into a given split and column pruning. It still needs a lot of documentation and tests, but I wanted to get the code up sooner than later.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2022-04-26T18:54:25Z

build

build/Dockerfile.centos7

build/build-in-docker

src/main/cpp/src/NativeParquetJni.cpp

src/main/cpp/CMakeLists.txt

src/main/cpp/src/NativeParquetJni.cpp

abellina · 2022-04-26T22:29:13Z

src/main/cpp/src/NativeParquetJni.cpp

+      std::vector<int> num_children_stack;
+      std::vector<column_pruner*> tree_stack;
+      tree_stack.push_back(this);
+      num_children_stack.push_back(schema[0].num_children);


are there any odd empty schemas in parquet we should guard against?

It is part of the standard that there must be a root to the schema. If you want to add an extra check and throw a more clear exception I can.

It would be nice just because there doesn't seem to be bounds checking here.

still pending.

src/main/cpp/src/NativeParquetJni.cpp

revans2 · 2022-04-27T15:27:07Z

build

revans2 · 2022-04-27T15:47:57Z

build

revans2 · 2022-04-27T16:03:00Z

@pxLi could you take a look at the CI failures. I am in over my head with the errors at this point.

revans2 · 2022-04-27T20:19:49Z

I am also running into some issues with the java parquet parser if there are no columns in the footer. I am going to have to make a special case bypass for the empty read schema case, where we just get out the number of rows from the matching row groups. But I still need to figure out how to fit it all together.

pxLi · 2022-04-27T23:50:18Z

build

pxLi · 2022-04-27T23:53:51Z

@pxLi could you take a look at the CI failures. I am in over my head with the errors at this point.

just realize this change includes some dockerfile change, I will adjust some internal setup to meet the requirements

pxLi · 2022-04-28T00:22:53Z

ci/Dockerfile

@@ -27,7 +27,6 @@ FROM gpuci/cuda:$CUDA_VERSION-devel-centos7
 RUN yum install -y centos-release-scl
 RUN yum install -y devtoolset-9 rh-python38 epel-release
 RUN yum install -y zlib-devel maven tar wget patch ninja-build
-RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.7-1.x86_64.rpm && yum install -y git


the CI also required git to do some work, otherwise it will fail

ci/premerge-build.sh: line 22: git: command not found

seeing the original repo is invalid.
Our CI requires some newer git version instead of default yum pkg. Let me try find available one

merged yum repo fix #208. and also fixed internal pipeline to cover dockerfile change cases in this PR.

Please help upmerge your branch and try re-trigger the build, thanks~

revans2 · 2022-04-28T14:08:21Z

Moving back to draft. I have found a number of incorrect assumptions I made about column pruning and the structure of the data. I am going to have to re-write a lot of that code to make it generic enough to match things the way we want them.

revans2 · 2022-04-28T15:04:44Z

build

revans2 · 2022-04-28T18:59:47Z

build

revans2 · 2022-04-28T19:03:11Z

After talking this over with others I am going to put this in as is and then we can improve it later on.

revans2 · 2022-04-28T19:03:17Z

build

revans2 · 2022-04-28T21:25:29Z

build

src/main/cpp/src/NativeParquetJni.cpp

revans2 · 2022-05-04T19:39:17Z

@abellina I think I have addressed most if not all of your review comments. Please take another look.

abellina

All nits at this point that could also be handled later.

abellina · 2022-05-06T04:59:31Z

src/main/cpp/src/NativeParquetJni.cpp

+ */
+std::string unicode_to_lower(std::string const& input) {
+  // get the size of the wide character result
+    std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0);


Suggested change

std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0);

std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0);

abellina · 2022-05-06T05:19:46Z

src/main/cpp/src/NativeParquetJni.cpp

+          // go back up the stack/tree removing children until we hit one with more children
+          bool done = false;
+          while (!done) {
+              int parent_children_left = num_children_stack.back() - 1;


nit 2 space indentation.

abellina · 2022-05-06T05:20:22Z

src/main/cpp/src/NativeParquetJni.cpp

+ */
+class column_pruner {
+public:
+    /**


nit 2 space indentation.

abellina · 2022-05-06T05:21:20Z

src/main/cpp/src/NativeParquetJni.cpp

+
+static std::vector<parquet::format::RowGroup> filter_groups(parquet::format::FileMetaData const& meta, 
+        int64_t part_offset, int64_t part_length) {
+    CUDF_FUNC_RANGE();


nit indentation is off in this function

revans2 · 2022-05-06T13:34:52Z

@jlowe could you please take another look?

build-libcudf.xml

build/Dockerfile.centos7

revans2 · 2022-05-06T17:17:15Z

build

Use C++ to parse and filter parquet footers.

3304b09

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 mentioned this pull request Apr 25, 2022

Use C++ to parse and filter parquet footers. NVIDIA/spark-rapids#5310

Merged

revans2 added 2 commits April 26, 2022 11:13

Merge branch 'branch-22.06' into cpp_parquet_footer_parse

4b79e1a

Cleanup

aa847aa

revans2 marked this pull request as ready for review April 26, 2022 16:19

jlowe reviewed Apr 26, 2022

View reviewed changes

abellina reviewed Apr 26, 2022

View reviewed changes

Some fixes

21bb144

revans2 mentioned this pull request Apr 27, 2022

[FEA] Fix case insensitive match on native parquet column pruning rapidsai/cudf#10747

Closed

Review comments

691dafc

Try to fix a bug in docker image

48151ed

pxLi reviewed Apr 28, 2022

View reviewed changes

revans2 marked this pull request as draft April 28, 2022 14:07

revans2 added 2 commits April 28, 2022 10:00

Some fixes

f4eeba0

Merge branch 'branch-22.06' into cpp_parquet_footer_parse

89920ec

revans2 added 2 commits April 28, 2022 13:55

Added some comments about what we want to do in the future.

335b185

Merge branch 'branch-22.06' into cpp_parquet_footer_parse

81f32ce

revans2 marked this pull request as ready for review April 28, 2022 18:59

abellina reviewed Apr 29, 2022

View reviewed changes

Addressed review comments

08b9fb7

abellina reviewed May 6, 2022

View reviewed changes

abellina previously approved these changes May 6, 2022

View reviewed changes

jlowe reviewed May 6, 2022

View reviewed changes

build-libcudf.xml Outdated Show resolved Hide resolved

build/Dockerfile.centos7 Outdated Show resolved Hide resolved

revans2 added 2 commits May 6, 2022 09:39

Merge branch 'branch-22.06' into cpp_parquet_footer_parse

d494746

Review Comments

51596da

revans2 dismissed abellina’s stale review via 51596da May 6, 2022 17:17

jlowe approved these changes May 6, 2022

View reviewed changes

abellina approved these changes May 6, 2022

View reviewed changes

revans2 merged commit 4b8d8f8 into NVIDIA:branch-22.06 May 6, 2022

revans2 deleted the cpp_parquet_footer_parse branch May 6, 2022 21:07

gerashegalov mentioned this pull request May 6, 2022

Revert "Use ccache in build-in-docker by default" #227

Merged

revans2 mentioned this pull request Jul 11, 2022

Adding AUTO native parquet support and legacy tests NVIDIA/spark-rapids#5983

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use C++ to parse and filter parquet footers. #199

Use C++ to parse and filter parquet footers. #199

revans2 commented Apr 25, 2022

revans2 commented Apr 26, 2022

abellina Apr 26, 2022

revans2 Apr 27, 2022

abellina Apr 27, 2022

abellina Apr 29, 2022

revans2 commented Apr 27, 2022

revans2 commented Apr 27, 2022

revans2 commented Apr 27, 2022

revans2 commented Apr 27, 2022

pxLi commented Apr 27, 2022

pxLi commented Apr 27, 2022

pxLi Apr 28, 2022

pxLi Apr 28, 2022

pxLi Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented May 4, 2022

abellina left a comment

abellina May 6, 2022

abellina May 6, 2022

abellina May 6, 2022

abellina May 6, 2022

revans2 commented May 6, 2022

revans2 commented May 6, 2022

	std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0);
	std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0);

Use C++ to parse and filter parquet footers. #199

Use C++ to parse and filter parquet footers. #199

Conversation

revans2 commented Apr 25, 2022

revans2 commented Apr 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented Apr 27, 2022

revans2 commented Apr 27, 2022

revans2 commented Apr 27, 2022

revans2 commented Apr 27, 2022

pxLi commented Apr 27, 2022

pxLi commented Apr 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented Apr 28, 2022

revans2 commented May 4, 2022

abellina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented May 6, 2022

revans2 commented May 6, 2022