Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle array of map and array of array case accurately in parquet reader #9728

Closed
wants to merge 1 commit into from

Conversation

hitarth
Copy link
Collaborator

@hitarth hitarth commented May 6, 2024

Handle array of map and array of array case accurately in parquet reader while parsing schema. This should fix #9238

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 6, 2024
Copy link

netlify bot commented May 6, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 796090a
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66bf996a7a6e140008e4cda3

@hitarth hitarth changed the title Fix array of map kind Handle array of map case accurately in parquet reader May 6, 2024
@hitarth hitarth marked this pull request as ready for review May 6, 2024 22:19
@hitarth hitarth added the parquet label May 6, 2024
@@ -328,6 +328,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo(
// the "repeated info" which we are ignoring here, hence we set the
// isRepeated to true eg
// https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
if (schemaElement.converted_type == thrift::ConvertedType::LIST &&
(child->type()->kind() == TypeKind::MAP ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you share more details why we need

|| schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED

? I don't see it in original diff? https://github.com/facebookincubator/velox/pull/9278/files

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh this extra check is needed to handle nested array cases. So the first check handles array(map) and the second check handles array(array). When I created the previous diff we had not yet fixed array(array) case. Let me add a test case for that as well.

@hitarth hitarth force-pushed the 9238_1 branch 2 times, most recently from 21345ae to ba79c25 Compare May 9, 2024 22:35
@hitarth hitarth changed the title Handle array of map case accurately in parquet reader Handle array of map and array of array case accurately in parquet reader May 9, 2024
@yingsu00 yingsu00 self-requested a review May 14, 2024 01:12
Copy link
Collaborator

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hitarth Can you please move the comment as @qqibrow suggested and rebase? The test failures might have been resolved.

@@ -323,6 +323,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo(
const auto& child = children[0];
auto type = child->type();
isRepeated = true;
if (schemaElement.converted_type == thrift::ConvertedType::LIST &&
(child->type()->kind() == TypeKind::MAP ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question. Why did you test schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED instead of
if (schemaElement.converted_type == thrift::ConvertedType::LIST && ((child->type()->kind() == TypeKind::MAP || (child->type()->kind() == TypeKind::ARRAY))?

Copy link
Collaborator Author

@hitarth hitarth Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot check for child->type()->kind() == TypeKind::ARRAY) and instead we need to check for schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED because of backward compatibility rule number mentioned here where we have a leaf repeated node and its parent being an optional element of type LIST.

For example :

// List<Integer> (nullable list, non-null elements)
optional group my_list (LIST) {
  repeated int32 element;
}

In such cases we create an ARRAY Type for leaf node ( repeated int32 element ) since it is repeated here in the code. So during schema parsing when we parsing the schema element ( optional group my_list (LIST) ) , even though it is an element of type LIST and its child is of Type LIST we don't want to create another ARRAY type node here.

So for above example with child->type()->kind() == TypeKind::ARRAY as condition it would return schema as array(array(int)) which would be incorrect. But with schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED we would get the right schema back as array(int)

Copy link
Collaborator

@yingsu00 yingsu00 Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hitarth Thanks for explaining. It seems the check should be saying "If the current schemaElement is thrift::ConvertedType::LIST, and (either its repetition_type is REPEATED, or its only child schemaElement is REPEATED), then this node should be a single level or ARRAY". The checking of the child element is MAP should be contained in the condition that the child schemaElement is REPEATED. Is this correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to check for child schemaElement being REPEATED which could have taken care of both MAP and LIST, it did work for MAP but failing for different LIST case. Hence I had to keep the exclusive child MAP check along with REPEATED check for current element

Copy link
Collaborator

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hitarth I am wondering if we could simplify the logic a little bit, since we already did some similar checking in the if (schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED) {} section. But we don't need to try it now.

@@ -324,6 +324,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo(
const auto& child = children[0];
auto type = child->type();
isRepeated = true;
if (schemaElement.converted_type == thrift::ConvertedType::LIST &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to read if we move this section to just the case thrift::ConvertedType::LIST: section and remove the schemaElement.converted_type == thrift::ConvertedType::LIST check there, since it's only for lists. Add a [[fallthrough]]; at the end in that section so that the more generic scenario will still be handled with existing code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rearranged the code as suggested please take a look.

@@ -323,6 +323,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo(
const auto& child = children[0];
auto type = child->type();
isRepeated = true;
if (schemaElement.converted_type == thrift::ConvertedType::LIST &&
(child->type()->kind() == TypeKind::MAP ||
Copy link
Collaborator

@yingsu00 yingsu00 Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hitarth Thanks for explaining. It seems the check should be saying "If the current schemaElement is thrift::ConvertedType::LIST, and (either its repetition_type is REPEATED, or its only child schemaElement is REPEATED), then this node should be a single level or ARRAY". The checking of the child element is MAP should be contained in the condition that the child schemaElement is REPEATED. Is this correct?

@hitarth hitarth force-pushed the 9238_1 branch 2 times, most recently from 908bc14 to 37a9b2e Compare June 12, 2024 20:47
isRepeated = true;
// In case the child is a MAP or current element is repeated then
// wrap child around additional ARRAY
if (child->type()->kind() == TypeKind::MAP ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the check is testing if the child is a MAP here. Do you think it should not be checking the child's repeatetion type is repeated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to check for child schemaElement being REPEATED which could have taken care of both MAP and LIST, it did work for MAP but failing for different LIST case. Hence I had to keep the exclusive child MAP check along with REPEATED check for current element. I think this is because we create a special ARRAY element when it is leaf node here . In such cases we don't want to create new ARRAY element at this level, hence exclusive MAP check here. The backward compatibility rules are tricky for LIST and MAP making this piece of code harder to fathom. There is a scope to refactor this code in future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hitarth Actually, have you tried to move the the leaf node Array construction to here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I had tried to do that. But it isn't trivial and would involve changes which are not related with this PR. Hence was thinking of addressing this in future refactor.

@majetideepak majetideepak self-assigned this Jul 12, 2024
@hitarth
Copy link
Collaborator Author

hitarth commented Aug 16, 2024

@Yuhta can you please help merge this in ?

@Yuhta Yuhta added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Aug 16, 2024
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kgpai merged this pull request in 74a3183.

Copy link

Conbench analyzed the 1 benchmark run on commit 74a31830.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged parquet ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet reader error: presetNullsConsumed_ == presetNullsSize_
6 participants