-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use C++ to parse and filter parquet footers. #199
Use C++ to parse and filter parquet footers. #199
Conversation
Signed-off-by: Robert (Bobby) Evans <[email protected]>
build |
std::vector<int> num_children_stack; | ||
std::vector<column_pruner*> tree_stack; | ||
tree_stack.push_back(this); | ||
num_children_stack.push_back(schema[0].num_children); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any odd empty schemas in parquet we should guard against?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is part of the standard that there must be a root to the schema. If you want to add an extra check and throw a more clear exception I can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice just because there doesn't seem to be bounds checking here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still pending.
build |
build |
@pxLi could you take a look at the CI failures. I am in over my head with the errors at this point. |
I am also running into some issues with the java parquet parser if there are no columns in the footer. I am going to have to make a special case bypass for the empty read schema case, where we just get out the number of rows from the matching row groups. But I still need to figure out how to fit it all together. |
build |
just realize this change includes some dockerfile change, I will adjust some internal setup to meet the requirements |
ci/Dockerfile
Outdated
@@ -27,7 +27,6 @@ FROM gpuci/cuda:$CUDA_VERSION-devel-centos7 | |||
RUN yum install -y centos-release-scl | |||
RUN yum install -y devtoolset-9 rh-python38 epel-release | |||
RUN yum install -y zlib-devel maven tar wget patch ninja-build | |||
RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.7-1.x86_64.rpm && yum install -y git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the CI also required git to do some work, otherwise it will fail
ci/premerge-build.sh: line 22: git: command not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seeing the original repo is invalid.
Our CI requires some newer git version instead of default yum pkg. Let me try find available one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged yum repo fix #208. and also fixed internal pipeline to cover dockerfile change cases in this PR.
Please help upmerge your branch and try re-trigger the build, thanks~
Moving back to draft. I have found a number of incorrect assumptions I made about column pruning and the structure of the data. I am going to have to re-write a lot of that code to make it generic enough to match things the way we want them. |
build |
build |
After talking this over with others I am going to put this in as is and then we can improve it later on. |
build |
build |
@abellina I think I have addressed most if not all of your review comments. Please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All nits at this point that could also be handled later.
*/ | ||
std::string unicode_to_lower(std::string const& input) { | ||
// get the size of the wide character result | ||
std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0); | |
std::size_t wide_size = std::mbstowcs(nullptr, input.data(), 0); |
// go back up the stack/tree removing children until we hit one with more children | ||
bool done = false; | ||
while (!done) { | ||
int parent_children_left = num_children_stack.back() - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit 2 space indentation.
*/ | ||
class column_pruner { | ||
public: | ||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit 2 space indentation.
|
||
static std::vector<parquet::format::RowGroup> filter_groups(parquet::format::FileMetaData const& meta, | ||
int64_t part_offset, int64_t part_length) { | ||
CUDF_FUNC_RANGE(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit indentation is off in this function
@jlowe could you please take another look? |
build |
We have seen on some parquet files where there are a lot of columns that reading the footer can be the bottleneck. This will help to fix it, but it is just a first step. It does not include predicate push down yet. Just range filtering for row groups that fall into a given split and column pruning. It still needs a lot of documentation and tests, but I wanted to get the code up sooner than later.