Handle array of map and array of array case accurately in parquet reader #9728

hitarth · 2024-05-06T21:20:36Z

Handle array of map and array of array case accurately in parquet reader while parsing schema. This should fix #9238

netlify · 2024-05-06T21:20:53Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`796090a`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/66bf996a7a6e140008e4cda3

qqibrow · 2024-05-09T20:13:46Z

velox/dwio/parquet/reader/ParquetReader.cpp

@@ -328,6 +328,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo(
          // the "repeated info" which we are ignoring here, hence we set the
          // isRepeated to true eg
          // https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
+          if (schemaElement.converted_type == thrift::ConvertedType::LIST &&
+              (child->type()->kind() == TypeKind::MAP ||


could you share more details why we need

|| schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED

? I don't see it in original diff? https://github.com/facebookincubator/velox/pull/9278/files

Ahh this extra check is needed to handle nested array cases. So the first check handles array(map) and the second check handles array(array). When I created the previous diff we had not yet fixed array(array) case. Let me add a test case for that as well.

velox/dwio/parquet/reader/ParquetReader.cpp

yingsu00

Hi @hitarth Can you please move the comment as @qqibrow suggested and rebase? The test failures might have been resolved.

yingsu00 · 2024-05-23T22:50:20Z

velox/dwio/parquet/reader/ParquetReader.cpp

@@ -323,6 +323,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo(
          const auto& child = children[0];
          auto type = child->type();
          isRepeated = true;
+          if (schemaElement.converted_type == thrift::ConvertedType::LIST &&
+              (child->type()->kind() == TypeKind::MAP ||


I have the same question. Why did you test schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED instead of
if (schemaElement.converted_type == thrift::ConvertedType::LIST && ((child->type()->kind() == TypeKind::MAP || (child->type()->kind() == TypeKind::ARRAY))?

We cannot check for child->type()->kind() == TypeKind::ARRAY) and instead we need to check for schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED because of backward compatibility rule number mentioned here where we have a leaf repeated node and its parent being an optional element of type LIST.

For example :

// List<Integer> (nullable list, non-null elements) optional group my_list (LIST) { repeated int32 element; }

In such cases we create an ARRAY Type for leaf node ( repeated int32 element ) since it is repeated here in the code. So during schema parsing when we parsing the schema element ( optional group my_list (LIST) ) , even though it is an element of type LIST and its child is of Type LIST we don't want to create another ARRAY type node here.

So for above example with child->type()->kind() == TypeKind::ARRAY as condition it would return schema as array(array(int)) which would be incorrect. But with schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED we would get the right schema back as array(int)

@hitarth Thanks for explaining. It seems the check should be saying "If the current schemaElement is thrift::ConvertedType::LIST, and (either its repetition_type is REPEATED, or its only child schemaElement is REPEATED), then this node should be a single level or ARRAY". The checking of the child element is MAP should be contained in the condition that the child schemaElement is REPEATED. Is this correct?

I tried to check for child schemaElement being REPEATED which could have taken care of both MAP and LIST, it did work for MAP but failing for different LIST case. Hence I had to keep the exclusive child MAP check along with REPEATED check for current element

yingsu00

@hitarth I am wondering if we could simplify the logic a little bit, since we already did some similar checking in the if (schemaElement.repetition_type == thrift::FieldRepetitionType::REPEATED) {} section. But we don't need to try it now.

yingsu00 · 2024-06-08T02:13:55Z

velox/dwio/parquet/reader/ParquetReader.cpp

@@ -324,6 +324,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo(
          const auto& child = children[0];
          auto type = child->type();
          isRepeated = true;
+          if (schemaElement.converted_type == thrift::ConvertedType::LIST &&


I think it would be easier to read if we move this section to just the case thrift::ConvertedType::LIST: section and remove the schemaElement.converted_type == thrift::ConvertedType::LIST check there, since it's only for lists. Add a [[fallthrough]]; at the end in that section so that the more generic scenario will still be handled with existing code

I have rearranged the code as suggested please take a look.

velox/dwio/parquet/tests/reader/ParquetReaderTest.cpp

yingsu00 · 2024-06-08T02:37:08Z

velox/dwio/parquet/reader/ParquetReader.cpp

@@ -323,6 +323,25 @@ std::unique_ptr<ParquetTypeWithId> ReaderBase::getParquetColumnInfo(
          const auto& child = children[0];
          auto type = child->type();
          isRepeated = true;
+          if (schemaElement.converted_type == thrift::ConvertedType::LIST &&
+              (child->type()->kind() == TypeKind::MAP ||


@hitarth Thanks for explaining. It seems the check should be saying "If the current schemaElement is thrift::ConvertedType::LIST, and (either its repetition_type is REPEATED, or its only child schemaElement is REPEATED), then this node should be a single level or ARRAY". The checking of the child element is MAP should be contained in the condition that the child schemaElement is REPEATED. Is this correct?

yingsu00 · 2024-07-12T15:53:20Z

velox/dwio/parquet/reader/ParquetReader.cpp

+          isRepeated = true;
+          // In case the child is a MAP or current element is repeated then
+          // wrap child around additional ARRAY
+          if (child->type()->kind() == TypeKind::MAP ||


I still see the check is testing if the child is a MAP here. Do you think it should not be checking the child's repeatetion type is repeated?

I tried to check for child schemaElement being REPEATED which could have taken care of both MAP and LIST, it did work for MAP but failing for different LIST case. Hence I had to keep the exclusive child MAP check along with REPEATED check for current element. I think this is because we create a special ARRAY element when it is leaf node here . In such cases we don't want to create new ARRAY element at this level, hence exclusive MAP check here. The backward compatibility rules are tricky for LIST and MAP making this piece of code harder to fathom. There is a scope to refactor this code in future.

@hitarth Actually, have you tried to move the the leaf node Array construction to here?

Yes I had tried to do that. But it isn't trivial and would involve changes which are not related with this PR. Hence was thinking of addressing this in future refactor.

hitarth · 2024-08-16T20:08:50Z

@Yuhta can you please help merge this in ?

facebook-github-bot · 2024-08-20T17:54:46Z

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-08-21T18:33:22Z

@kgpai merged this pull request in 74a3183.

conbench-facebook · 2024-08-21T19:00:51Z

Conbench analyzed the 1 benchmark run on commit 74a31830.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 6, 2024

hitarth changed the title ~~Fix array of map kind~~ Handle array of map case accurately in parquet reader May 6, 2024

This was referenced May 6, 2024

Summary of Parquet reader Issues #9560

Open

Handle list of map case accurately #9278

Closed

hitarth marked this pull request as ready for review May 6, 2024 22:19

hitarth added the parquet label May 6, 2024

qqibrow reviewed May 9, 2024

View reviewed changes

hitarth force-pushed the 9238_1 branch 2 times, most recently from 21345ae to ba79c25 Compare May 9, 2024 22:35

hitarth changed the title ~~Handle array of map case accurately in parquet reader~~ Handle array of map and array of array case accurately in parquet reader May 9, 2024

yingsu00 self-requested a review May 14, 2024 01:12

yingsu00 reviewed May 23, 2024

View reviewed changes

hitarth force-pushed the 9238_1 branch from ba79c25 to 021f343 Compare June 7, 2024 17:17

yingsu00 reviewed Jun 8, 2024

View reviewed changes

hitarth force-pushed the 9238_1 branch 2 times, most recently from 908bc14 to 37a9b2e Compare June 12, 2024 20:47

yingsu00 reviewed Jul 12, 2024

View reviewed changes

majetideepak self-assigned this Jul 12, 2024

yingsu00 approved these changes Aug 2, 2024

View reviewed changes

Fix array of map and array of array kind in parquet reader

796090a

hitarth force-pushed the 9238_1 branch from 37a9b2e to 796090a Compare August 16, 2024 18:24

Yuhta approved these changes Aug 16, 2024

View reviewed changes

Yuhta added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Aug 16, 2024

facebook-github-bot closed this in 74a3183 Aug 21, 2024

facebook-github-bot added the Merged label Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle array of map and array of array case accurately in parquet reader #9728

Handle array of map and array of array case accurately in parquet reader #9728

hitarth commented May 6, 2024 •

edited

Loading

netlify bot commented May 6, 2024 •

edited

Loading

qqibrow May 9, 2024

hitarth May 9, 2024

yingsu00 left a comment

yingsu00 May 23, 2024

hitarth Jun 7, 2024 •

edited

Loading

yingsu00 Jun 8, 2024 •

edited

Loading

hitarth Jun 12, 2024

yingsu00 left a comment

yingsu00 Jun 8, 2024

hitarth Jun 12, 2024

yingsu00 Jun 8, 2024 •

edited

Loading

yingsu00 Jul 12, 2024

hitarth Jul 12, 2024

yingsu00 Jul 22, 2024

hitarth Jul 25, 2024

hitarth commented Aug 16, 2024

facebook-github-bot commented Aug 20, 2024

facebook-github-bot commented Aug 21, 2024

conbench-facebook bot commented Aug 21, 2024

Handle array of map and array of array case accurately in parquet reader #9728

Handle array of map and array of array case accurately in parquet reader #9728

Conversation

hitarth commented May 6, 2024 • edited Loading

netlify bot commented May 6, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingsu00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hitarth Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

yingsu00 Jun 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingsu00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingsu00 Jun 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hitarth commented Aug 16, 2024

facebook-github-bot commented Aug 20, 2024

facebook-github-bot commented Aug 21, 2024

conbench-facebook bot commented Aug 21, 2024

hitarth commented May 6, 2024 •

edited

Loading

netlify bot commented May 6, 2024 •

edited

Loading

hitarth Jun 7, 2024 •

edited

Loading

yingsu00 Jun 8, 2024 •

edited

Loading

yingsu00 Jun 8, 2024 •

edited

Loading