Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in better native parquet footer implementation and remove the old one #365

Merged
merged 13 commits into from
Jul 18, 2022

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Jul 11, 2022

This uses the new API to provide a good implementation that fixes the issues older parquet files (legacy ones).

This depends on #362

I think this will fix NVIDIA/spark-rapids#5493 because it does not reorder the columns like the old implementation did.

@revans2 revans2 added the bug Something isn't working label Jul 11, 2022
@revans2 revans2 self-assigned this Jul 11, 2022
@revans2 revans2 marked this pull request as ready for review July 11, 2022 20:33
@revans2
Copy link
Collaborator Author

revans2 commented Jul 11, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented Jul 12, 2022

build

Copy link
Collaborator

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. I'll look at this more later, but wanted to get these down.

Comment on lines 102 to 112
const char * tag_name = "unknown";
if (tag == VALUE_TAG) {
tag_name = "value";
} else if (tag == STRUCT_TAG) {
tag_name = "struct";
} else if (tag == LIST_TAG) {
tag_name = "list";
} else if (tag == MAP_TAG) {
tag_name = "map";
}
return tag_name;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const char * tag_name = "unknown";
if (tag == VALUE_TAG) {
tag_name = "value";
} else if (tag == STRUCT_TAG) {
tag_name = "struct";
} else if (tag == LIST_TAG) {
tag_name = "list";
} else if (tag == MAP_TAG) {
tag_name = "map";
}
return tag_name;
switch(tag) {
case VALUE_TAG: return "value";
case STRUCT_TAG: return "struct";
case LIST_TAG: return "list";
case MAP_TAG: return "map";
default: return "unknown";
}

Maybe have an error on the default case? I'm not positive on the usage.

Copy link
Collaborator

@ttnghia ttnghia Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humnn, maybe? So default: throw exception("Invalid tag");?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get 3 C++ programmers in a room, you get 4 opinions. :p

std::string get_tag_name(const int tag)
{
  static auto const names = std::vector{ "value", "struct", "list", "map" };
  return ( i>=VALUE_TAG && i<=MAP_TAG ) ? names[tag] : "unknown";
}

I'm kidding, of course.

Comment on lines 161 to 164
if (normalize_case) {
return unicode_to_lower(elem.name);
} else {
return elem.name;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (normalize_case) {
return unicode_to_lower(elem.name);
} else {
return elem.name;
return normalize_case ? unicode_to_lower(elem.name) : elem.name;

Comment on lines 170 to 173
if (found_it != children.end()) {
return &(found_it->second);
}
return nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (found_it != children.end()) {
return &(found_it->second);
}
return nullptr;
return found_it != children.end() ? &(found_it->second) : nullptr;

Would this be safer to not return a pointer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I am not sure how to do it with the iterators in a clean way. Suggestions are welcome.

Copy link
Collaborator

@ttnghia ttnghia Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution here would be returning a reference instead of a pointer. But by doing so, this function must throw a not found exception if the element doesn't exist. As such, you can do something like:

const column_pruner& find_child(std::string name) {
      auto found_it = children.find(name);
      if (found_it == children.end()) {
        throw exception(...);
      }
      return found_it->second;
}

if(exist_child(name)) {
  auto const& child = find_child(name);
  // process child
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my first thought was to simply return the iterator, but then you have the issue of the caller having to know the .end() of children, which is an implementation detail. I'm leaning towards an optional string return value right now.

Copy link
Collaborator

@ttnghia ttnghia Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, I just realized that my solution above has duplicate computation: the exist_child() check also needs to call children.find(name).

So how about this: using std::optional<std::reference_wrapper<column_pruner>>, like:

#include <optional>
....
auto find_child(std::string name) {
      auto found_it = children.find(name);
      if (found_it == children.end()) {
        return std::nullopt;
      }
      return std::optional<std::reference_wrapper<column_pruner>>{std::ref(found_it->second)};
}

...
//
const auto child_ref = find_child(name);
if(child_ref) { // the result is an optional, which can be checked like a boolean
 const column_pruner& child = *child_ref;
// process child
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::optional does not work for references. Apparently it is a union and you cannot store a reference inside a union. At least that is the error message I got when I tried to use it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in order to use a reference in <optional> you need to wrap it in reference_wrapper. Here is an example: https://godbolt.org/z/o3MhMfczY

src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
Comment on lines 200 to 201
if (current_input_schema_index >= schema.size()) {
throw std::runtime_error("walked off the end of the schema some how...");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this bound check is still relevant, as it will be checked again later when using that index to access.

Comment on lines 197 to 199
void filter_schema(std::vector<parquet::format::SchemaElement> & schema, const bool ignore_case,
std::size_t & current_input_schema_index, std::size_t & next_input_chunk_index,
std::vector<int> & chunk_map, std::vector<int> & schema_map, std::vector<int> & schema_num_children) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "good" practice is to split such giant function into multiple smaller functions (like filter_schema_value, filter_schema_struct, etc.). Not necessarily required here but it's recommended to do so.

@ttnghia
Copy link
Collaborator

ttnghia commented Jul 12, 2022

In addition to my left comments, I have no idea if we can have unit tests in this repo?

Copy link
Collaborator

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c++ code could in general get a smattering of const. I can add some comments in places if desired. All told, this seems reasonable. I agree with Nghia's comments as well.

As for testing, we have tests for cpp and java. I would assume this would be tested via the java side only and those are setup to run in CI I believe.

@revans2
Copy link
Collaborator Author

revans2 commented Jul 14, 2022

The problem with tests is that the only way to test it is with parquet files. We could do some round trip tests and see that it works properly, but I found it was much simpler to test it in the plugin itself. I know there are some big downsides to this and if we really want to be alerted earlier if there is a regression I can add tests here.

@hyperbolic2346
Copy link
Collaborator

I know there are some big downsides to this and if we really want to be alerted earlier if there is a regression I can add tests here.

The critical thing for me is that it is tested regularly and hopefully in an automated way. If we don't know about a breakage until the spark level, I'm not that concerned as long as testing is done.

@revans2
Copy link
Collaborator Author

revans2 commented Jul 14, 2022

Just pushed another commit that addresses everything except the ongoing discussion about getting the child and splitting up the one large function. I'll take a look at them tomorrow morning but I wanted to get something posted for what I had done so far.

@revans2
Copy link
Collaborator Author

revans2 commented Jul 15, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented Jul 15, 2022

@hyperbolic2346 and @ttnghia could you take another look

ttnghia
ttnghia previously approved these changes Jul 15, 2022
@revans2
Copy link
Collaborator Author

revans2 commented Jul 18, 2022

build

@revans2
Copy link
Collaborator Author

revans2 commented Jul 18, 2022

@hyperbolic2346 can you please take a look at the latest changes?

Copy link
Collaborator

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, nothing major. Looking good!

src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/NativeParquetJni.cpp Outdated Show resolved Hide resolved
@revans2
Copy link
Collaborator Author

revans2 commented Jul 18, 2022

@hyperbolic2346 thanks for the new review I think I have addressed all of your comments. Please take another look

@revans2
Copy link
Collaborator Author

revans2 commented Jul 18, 2022

build

@revans2 revans2 merged commit 80573b8 into NVIDIA:branch-22.08 Jul 18, 2022
@revans2 revans2 deleted the general_native_parquet branch July 18, 2022 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] test_parquet_read_merge_schema failed w/ TITAN V
4 participants