-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add in better native parquet footer implementation and remove the old one #365
Conversation
Signed-off-by: Robert (Bobby) Evans <[email protected]>
… one Signed-off-by: Robert (Bobby) Evans <[email protected]>
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. I'll look at this more later, but wanted to get these down.
const char * tag_name = "unknown"; | ||
if (tag == VALUE_TAG) { | ||
tag_name = "value"; | ||
} else if (tag == STRUCT_TAG) { | ||
tag_name = "struct"; | ||
} else if (tag == LIST_TAG) { | ||
tag_name = "list"; | ||
} else if (tag == MAP_TAG) { | ||
tag_name = "map"; | ||
} | ||
return tag_name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const char * tag_name = "unknown"; | |
if (tag == VALUE_TAG) { | |
tag_name = "value"; | |
} else if (tag == STRUCT_TAG) { | |
tag_name = "struct"; | |
} else if (tag == LIST_TAG) { | |
tag_name = "list"; | |
} else if (tag == MAP_TAG) { | |
tag_name = "map"; | |
} | |
return tag_name; | |
switch(tag) { | |
case VALUE_TAG: return "value"; | |
case STRUCT_TAG: return "struct"; | |
case LIST_TAG: return "list"; | |
case MAP_TAG: return "map"; | |
default: return "unknown"; | |
} |
Maybe have an error on the default case? I'm not positive on the usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humnn, maybe? So default: throw exception("Invalid tag");
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get 3 C++ programmers in a room, you get 4 opinions. :p
std::string get_tag_name(const int tag)
{
static auto const names = std::vector{ "value", "struct", "list", "map" };
return ( i>=VALUE_TAG && i<=MAP_TAG ) ? names[tag] : "unknown";
}
I'm kidding, of course.
if (normalize_case) { | ||
return unicode_to_lower(elem.name); | ||
} else { | ||
return elem.name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (normalize_case) { | |
return unicode_to_lower(elem.name); | |
} else { | |
return elem.name; | |
return normalize_case ? unicode_to_lower(elem.name) : elem.name; |
if (found_it != children.end()) { | ||
return &(found_it->second); | ||
} | ||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (found_it != children.end()) { | |
return &(found_it->second); | |
} | |
return nullptr; | |
return found_it != children.end() ? &(found_it->second) : nullptr; |
Would this be safer to not return a pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I am not sure how to do it with the iterators in a clean way. Suggestions are welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution here would be returning a reference instead of a pointer. But by doing so, this function must throw a not found exception if the element doesn't exist. As such, you can do something like:
const column_pruner& find_child(std::string name) {
auto found_it = children.find(name);
if (found_it == children.end()) {
throw exception(...);
}
return found_it->second;
}
if(exist_child(name)) {
auto const& child = find_child(name);
// process child
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my first thought was to simply return the iterator, but then you have the issue of the caller having to know the .end() of children, which is an implementation detail. I'm leaning towards an optional string return value right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, I just realized that my solution above has duplicate computation: the exist_child()
check also needs to call children.find(name)
.
So how about this: using std::optional<std::reference_wrapper<column_pruner>>
, like:
#include <optional>
....
auto find_child(std::string name) {
auto found_it = children.find(name);
if (found_it == children.end()) {
return std::nullopt;
}
return std::optional<std::reference_wrapper<column_pruner>>{std::ref(found_it->second)};
}
...
//
const auto child_ref = find_child(name);
if(child_ref) { // the result is an optional, which can be checked like a boolean
const column_pruner& child = *child_ref;
// process child
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::optional does not work for references. Apparently it is a union and you cannot store a reference inside a union. At least that is the error message I got when I tried to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in order to use a reference in <optional>
you need to wrap it in reference_wrapper
. Here is an example: https://godbolt.org/z/o3MhMfczY
if (current_input_schema_index >= schema.size()) { | ||
throw std::runtime_error("walked off the end of the schema some how..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this bound check is still relevant, as it will be checked again later when using that index to access.
void filter_schema(std::vector<parquet::format::SchemaElement> & schema, const bool ignore_case, | ||
std::size_t & current_input_schema_index, std::size_t & next_input_chunk_index, | ||
std::vector<int> & chunk_map, std::vector<int> & schema_map, std::vector<int> & schema_num_children) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A "good" practice is to split such giant function into multiple smaller functions (like filter_schema_value
, filter_schema_struct
, etc.). Not necessarily required here but it's recommended to do so.
In addition to my left comments, I have no idea if we can have unit tests in this repo? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c++ code could in general get a smattering of const. I can add some comments in places if desired. All told, this seems reasonable. I agree with Nghia's comments as well.
As for testing, we have tests for cpp and java. I would assume this would be tested via the java side only and those are setup to run in CI I believe.
The problem with tests is that the only way to test it is with parquet files. We could do some round trip tests and see that it works properly, but I found it was much simpler to test it in the plugin itself. I know there are some big downsides to this and if we really want to be alerted earlier if there is a regression I can add tests here. |
The critical thing for me is that it is tested regularly and hopefully in an automated way. If we don't know about a breakage until the spark level, I'm not that concerned as long as testing is done. |
Just pushed another commit that addresses everything except the ongoing discussion about getting the child and splitting up the one large function. I'll take a look at them tomorrow morning but I wanted to get something posted for what I had done so far. |
build |
@hyperbolic2346 and @ttnghia could you take another look |
build |
@hyperbolic2346 can you please take a look at the latest changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, nothing major. Looking good!
@hyperbolic2346 thanks for the new review I think I have addressed all of your comments. Please take another look |
build |
This uses the new API to provide a good implementation that fixes the issues older parquet files (legacy ones).
This depends on #362
I think this will fix NVIDIA/spark-rapids#5493 because it does not reorder the columns like the old implementation did.