-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle array of map and array of array case accurately in parquet reader #9728
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@@ -328,6 +328,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo( | |||
// the "repeated info" which we are ignoring here, hence we set the | |||
// isRepeated to true eg | |||
// https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists | |||
if (schemaElement.converted_type == thrift::ConvertedType::LIST && | |||
(child->type()->kind() == TypeKind::MAP || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you share more details why we need
|| schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED
? I don't see it in original diff? https://github.com/facebookincubator/velox/pull/9278/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh this extra check is needed to handle nested array cases. So the first check handles array(map) and the second check handles array(array). When I created the previous diff we had not yet fixed array(array) case. Let me add a test case for that as well.
21345ae
to
ba79c25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -323,6 +323,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo( | |||
const auto& child = children[0]; | |||
auto type = child->type(); | |||
isRepeated = true; | |||
if (schemaElement.converted_type == thrift::ConvertedType::LIST && | |||
(child->type()->kind() == TypeKind::MAP || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question. Why did you test schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED
instead of
if (schemaElement.converted_type == thrift::ConvertedType::LIST && ((child->type()->kind() == TypeKind::MAP || (child->type()->kind() == TypeKind::ARRAY))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot check for child->type()->kind() == TypeKind::ARRAY)
and instead we need to check for schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED
because of backward compatibility rule number mentioned here where we have a leaf repeated node and its parent being an optional element of type LIST.
For example :
// List<Integer> (nullable list, non-null elements)
optional group my_list (LIST) {
repeated int32 element;
}
In such cases we create an ARRAY Type for leaf node ( repeated int32 element
) since it is repeated here in the code. So during schema parsing when we parsing the schema element ( optional group my_list (LIST)
) , even though it is an element of type LIST and its child is of Type LIST we don't want to create another ARRAY type node here.
So for above example with child->type()->kind() == TypeKind::ARRAY
as condition it would return schema as array(array(int))
which would be incorrect. But with schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED
we would get the right schema back as array(int)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hitarth Thanks for explaining. It seems the check should be saying "If the current schemaElement is thrift::ConvertedType::LIST, and (either its repetition_type is REPEATED, or its only child schemaElement is REPEATED), then this node should be a single level or ARRAY". The checking of the child element is MAP should be contained in the condition that the child schemaElement is REPEATED. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to check for child schemaElement being REPEATED which could have taken care of both MAP and LIST, it did work for MAP but failing for different LIST case. Hence I had to keep the exclusive child MAP check along with REPEATED check for current element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hitarth I am wondering if we could simplify the logic a little bit, since we already did some similar checking in the if (schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED) {}
section. But we don't need to try it now.
@@ -324,6 +324,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo( | |||
const auto& child = children[0]; | |||
auto type = child->type(); | |||
isRepeated = true; | |||
if (schemaElement.converted_type == thrift::ConvertedType::LIST && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to read if we move this section to just the case thrift::ConvertedType::LIST:
section and remove the schemaElement.converted_type == thrift::ConvertedType::LIST
check there, since it's only for lists. Add a [[fallthrough]];
at the end in that section so that the more generic scenario will still be handled with existing code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have rearranged the code as suggested please take a look.
@@ -323,6 +323,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo( | |||
const auto& child = children[0]; | |||
auto type = child->type(); | |||
isRepeated = true; | |||
if (schemaElement.converted_type == thrift::ConvertedType::LIST && | |||
(child->type()->kind() == TypeKind::MAP || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hitarth Thanks for explaining. It seems the check should be saying "If the current schemaElement is thrift::ConvertedType::LIST, and (either its repetition_type is REPEATED, or its only child schemaElement is REPEATED), then this node should be a single level or ARRAY". The checking of the child element is MAP should be contained in the condition that the child schemaElement is REPEATED. Is this correct?
908bc14
to
37a9b2e
Compare
isRepeated = true; | ||
// In case the child is a MAP or current element is repeated then | ||
// wrap child around additional ARRAY | ||
if (child->type()->kind() == TypeKind::MAP || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see the check is testing if the child is a MAP here. Do you think it should not be checking the child's repeatetion type is repeated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to check for child schemaElement being REPEATED which could have taken care of both MAP and LIST, it did work for MAP but failing for different LIST case. Hence I had to keep the exclusive child MAP check along with REPEATED check for current element. I think this is because we create a special ARRAY element when it is leaf node here . In such cases we don't want to create new ARRAY element at this level, hence exclusive MAP check here. The backward compatibility rules are tricky for LIST and MAP making this piece of code harder to fathom. There is a scope to refactor this code in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hitarth Actually, have you tried to move the the leaf node Array construction to here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I had tried to do that. But it isn't trivial and would involve changes which are not related with this PR. Hence was thinking of addressing this in future refactor.
@Yuhta can you please help merge this in ? |
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Handle array of map and array of array case accurately in parquet reader while parsing schema. This should fix #9238