Add in better native parquet footer implementation and remove the old one #365

revans2 · 2022-07-11T14:39:04Z

This uses the new API to provide a good implementation that fixes the issues older parquet files (legacy ones).

This depends on #362

I think this will fix NVIDIA/spark-rapids#5493 because it does not reorder the columns like the old implementation did.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

… one Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2022-07-11T20:35:13Z

build

revans2 · 2022-07-12T18:20:09Z

build

src/main/cpp/src/NativeParquetJni.cpp

hyperbolic2346

A few comments. I'll look at this more later, but wanted to get these down.

hyperbolic2346 · 2022-07-12T19:57:15Z

src/main/cpp/src/NativeParquetJni.cpp

+  const char * tag_name = "unknown";
+  if (tag == VALUE_TAG) {
+    tag_name = "value";
+  } else if (tag == STRUCT_TAG) {
+    tag_name = "struct";
+  } else if (tag == LIST_TAG) {
+    tag_name = "list";
+  } else if (tag == MAP_TAG) {
+    tag_name = "map";
+  }
+  return tag_name;


Suggested change

const char * tag_name = "unknown";

if (tag == VALUE_TAG) {

tag_name = "value";

} else if (tag == STRUCT_TAG) {

tag_name = "struct";

} else if (tag == LIST_TAG) {

tag_name = "list";

} else if (tag == MAP_TAG) {

tag_name = "map";

}

return tag_name;

switch(tag) {

case VALUE_TAG: return "value";

case STRUCT_TAG: return "struct";

case LIST_TAG: return "list";

case MAP_TAG: return "map";

default: return "unknown";

}

Maybe have an error on the default case? I'm not positive on the usage.

Humnn, maybe? So default: throw exception("Invalid tag");?

Get 3 C++ programmers in a room, you get 4 opinions. :p

std::string get_tag_name(const int tag) { static auto const names = std::vector{ "value", "struct", "list", "map" }; return ( i>=VALUE_TAG && i<=MAP_TAG ) ? names[tag] : "unknown"; }

I'm kidding, of course.

hyperbolic2346 · 2022-07-12T20:00:37Z

src/main/cpp/src/NativeParquetJni.cpp

+      if (normalize_case) {
+        return unicode_to_lower(elem.name);
+      } else {
+        return elem.name;


Suggested change

if (normalize_case) {

return unicode_to_lower(elem.name);

} else {

return elem.name;

return normalize_case ? unicode_to_lower(elem.name) : elem.name;

hyperbolic2346 · 2022-07-12T20:02:31Z

src/main/cpp/src/NativeParquetJni.cpp

+      if (found_it != children.end()) {
+        return &(found_it->second);
+      }
+      return nullptr;


Suggested change

if (found_it != children.end()) {

return &(found_it->second);

}

return nullptr;

return found_it != children.end() ? &(found_it->second) : nullptr;

Would this be safer to not return a pointer?

Yes, but I am not sure how to do it with the iterators in a clean way. Suggestions are welcome.

The solution here would be returning a reference instead of a pointer. But by doing so, this function must throw a not found exception if the element doesn't exist. As such, you can do something like:

const column_pruner& find_child(std::string name) { auto found_it = children.find(name); if (found_it == children.end()) { throw exception(...); } return found_it->second; } if(exist_child(name)) { auto const& child = find_child(name); // process child }

my first thought was to simply return the iterator, but then you have the issue of the caller having to know the .end() of children, which is an implementation detail. I'm leaning towards an optional string return value right now.

Humm, I just realized that my solution above has duplicate computation: the exist_child() check also needs to call children.find(name).

So how about this: using std::optional<std::reference_wrapper<column_pruner>>, like:

#include <optional> .... auto find_child(std::string name) { auto found_it = children.find(name); if (found_it == children.end()) { return std::nullopt; } return std::optional<std::reference_wrapper<column_pruner>>{std::ref(found_it->second)}; } ... // const auto child_ref = find_child(name); if(child_ref) { // the result is an optional, which can be checked like a boolean const column_pruner& child = *child_ref; // process child }

std::optional does not work for references. Apparently it is a union and you cannot store a reference inside a union. At least that is the error message I got when I tried to use it.

Yes, in order to use a reference in <optional> you need to wrap it in reference_wrapper. Here is an example: https://godbolt.org/z/o3MhMfczY

src/main/cpp/src/NativeParquetJni.cpp

ttnghia · 2022-07-12T20:09:12Z

src/main/cpp/src/NativeParquetJni.cpp

+      if (current_input_schema_index >= schema.size()) {
+        throw std::runtime_error("walked off the end of the schema some how...");


I'm not sure if this bound check is still relevant, as it will be checked again later when using that index to access.

ttnghia · 2022-07-12T20:10:35Z

src/main/cpp/src/NativeParquetJni.cpp

+    void filter_schema(std::vector<parquet::format::SchemaElement> & schema, const bool ignore_case,
+            std::size_t & current_input_schema_index, std::size_t & next_input_chunk_index,
+            std::vector<int> & chunk_map, std::vector<int> & schema_map, std::vector<int> & schema_num_children) {


A "good" practice is to split such giant function into multiple smaller functions (like filter_schema_value, filter_schema_struct, etc.). Not necessarily required here but it's recommended to do so.

ttnghia · 2022-07-12T20:12:06Z

In addition to my left comments, I have no idea if we can have unit tests in this repo?

hyperbolic2346

c++ code could in general get a smattering of const. I can add some comments in places if desired. All told, this seems reasonable. I agree with Nghia's comments as well.

As for testing, we have tests for cpp and java. I would assume this would be tested via the java side only and those are setup to run in CI I believe.

revans2 · 2022-07-14T21:19:51Z

The problem with tests is that the only way to test it is with parquet files. We could do some round trip tests and see that it works properly, but I found it was much simpler to test it in the plugin itself. I know there are some big downsides to this and if we really want to be alerted earlier if there is a regression I can add tests here.

hyperbolic2346 · 2022-07-14T21:22:29Z

I know there are some big downsides to this and if we really want to be alerted earlier if there is a regression I can add tests here.

The critical thing for me is that it is tested regularly and hopefully in an automated way. If we don't know about a breakage until the spark level, I'm not that concerned as long as testing is done.

revans2 · 2022-07-14T21:25:53Z

Just pushed another commit that addresses everything except the ongoing discussion about getting the child and splitting up the one large function. I'll take a look at them tomorrow morning but I wanted to get something posted for what I had done so far.

revans2 · 2022-07-15T16:50:44Z

build

revans2 · 2022-07-15T16:56:49Z

@hyperbolic2346 and @ttnghia could you take another look

revans2 · 2022-07-18T16:16:38Z

build

revans2 · 2022-07-18T17:57:37Z

@hyperbolic2346 can you please take a look at the latest changes?

hyperbolic2346

Few comments, nothing major. Looking good!

src/main/cpp/src/NativeParquetJni.cpp

src/main/java/com/nvidia/spark/rapids/jni/ParquetFooter.java

revans2 · 2022-07-18T20:30:37Z

@hyperbolic2346 thanks for the new review I think I have addressed all of your comments. Please take another look

revans2 · 2022-07-18T20:30:42Z

build

revans2 added 2 commits July 8, 2022 13:32

Add new API that and deprecate the old one

9041069

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Add in better native parquet footer implementation and remove the old…

39263a6

… one Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 added the bug Something isn't working label Jul 11, 2022

revans2 self-assigned this Jul 11, 2022

revans2 added 2 commits July 11, 2022 09:43

Addressed review comments

1396203

Merge branch 'new_api_parquet_footer' into general_native_parquet

ec87ba5

revans2 mentioned this pull request Jul 11, 2022

Adding AUTO native parquet support and legacy tests NVIDIA/spark-rapids#5983

Merged

Merge branch 'branch-22.08' into general_native_parquet

2c5fdcc

revans2 marked this pull request as ready for review July 11, 2022 20:33

Merge branch 'branch-22.08' into general_native_parquet

ce9f140

ttnghia reviewed Jul 12, 2022

View reviewed changes

src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved

ttnghia reviewed Jul 12, 2022

View reviewed changes

src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved

ttnghia reviewed Jul 12, 2022

View reviewed changes

src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved

ttnghia reviewed Jul 12, 2022

View reviewed changes

src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved

hyperbolic2346 requested changes Jul 12, 2022

View reviewed changes

ttnghia reviewed Jul 12, 2022

View reviewed changes

hyperbolic2346 reviewed Jul 12, 2022

View reviewed changes

revans2 added 2 commits July 14, 2022 09:48

Merge branch 'branch-22.08' into general_native_parquet

7ee749e

Addressed review comments

acd9be3

revans2 added 3 commits July 15, 2022 11:08

Refactoring and cleanup of the code

6af9401

Final cleanup

0d17fca

missed a delete

74ac2d3

ttnghia previously approved these changes Jul 15, 2022

View reviewed changes

Merge branch 'branch-22.08' into general_native_parquet

8d5c443

hyperbolic2346 reviewed Jul 18, 2022

View reviewed changes

src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved

src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved

src/main/java/com/nvidia/spark/rapids/jni/ParquetFooter.java Show resolved Hide resolved

Addressed more comments

955abaa

revans2 dismissed ttnghia’s stale review via 955abaa July 18, 2022 20:29

hyperbolic2346 approved these changes Jul 18, 2022

View reviewed changes

revans2 merged commit 80573b8 into NVIDIA:branch-22.08 Jul 18, 2022

revans2 deleted the general_native_parquet branch July 18, 2022 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in better native parquet footer implementation and remove the old one #365

Add in better native parquet footer implementation and remove the old one #365

revans2 commented Jul 11, 2022

revans2 commented Jul 11, 2022

revans2 commented Jul 12, 2022

hyperbolic2346 left a comment

hyperbolic2346 Jul 12, 2022

ttnghia Jul 12, 2022 •

edited

Loading

mythrocks Jul 13, 2022

hyperbolic2346 Jul 12, 2022

hyperbolic2346 Jul 12, 2022

revans2 Jul 14, 2022

ttnghia Jul 14, 2022 •

edited

Loading

hyperbolic2346 Jul 14, 2022

ttnghia Jul 14, 2022 •

edited

Loading

revans2 Jul 15, 2022

ttnghia Jul 15, 2022

ttnghia Jul 12, 2022

ttnghia Jul 12, 2022

ttnghia commented Jul 12, 2022

hyperbolic2346 left a comment

revans2 commented Jul 14, 2022

hyperbolic2346 commented Jul 14, 2022

revans2 commented Jul 14, 2022

revans2 commented Jul 15, 2022

revans2 commented Jul 15, 2022

revans2 commented Jul 18, 2022

revans2 commented Jul 18, 2022

hyperbolic2346 left a comment

revans2 commented Jul 18, 2022

revans2 commented Jul 18, 2022

		if (current_input_schema_index >= schema.size()) {
		throw std::runtime_error("walked off the end of the schema some how...");

Add in better native parquet footer implementation and remove the old one #365

Add in better native parquet footer implementation and remove the old one #365

Conversation

revans2 commented Jul 11, 2022

revans2 commented Jul 11, 2022

revans2 commented Jul 12, 2022

hyperbolic2346 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Jul 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Jul 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia commented Jul 12, 2022

hyperbolic2346 left a comment

Choose a reason for hiding this comment

revans2 commented Jul 14, 2022

hyperbolic2346 commented Jul 14, 2022

revans2 commented Jul 14, 2022

revans2 commented Jul 15, 2022

revans2 commented Jul 15, 2022

revans2 commented Jul 18, 2022

revans2 commented Jul 18, 2022

hyperbolic2346 left a comment

Choose a reason for hiding this comment

revans2 commented Jul 18, 2022

revans2 commented Jul 18, 2022

ttnghia Jul 12, 2022 •

edited

Loading

ttnghia Jul 14, 2022 •

edited

Loading

ttnghia Jul 14, 2022 •

edited

Loading