Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41055: [C++] Support flatten for combining nested list related types #41092

Merged
merged 6 commits into from
Apr 15, 2024

Conversation

ZhangHuiGui
Copy link
Collaborator

@ZhangHuiGui ZhangHuiGui commented Apr 9, 2024

Rationale for this change

Support flatten for combining nested list related types.

What changes are included in this PR?

Add the recursively flatten function for auto detect and flatten the combining nested list types.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, user can flatten a combining nested-list or related array by use Flatten API.

Copy link

github-actions bot commented Apr 9, 2024

⚠️ GitHub issue #41055 has been automatically assigned in GitHub to PR creator.

@@ -925,10 +969,12 @@ TYPED_TEST(TestListArray, BuilderPreserveFieldName) {
TYPED_TEST(TestListArray, FlattenSimple) { this->TestFlattenSimple(); }
TYPED_TEST(TestListArray, FlattenNulls) { this->TestFlattenNulls(); }
TYPED_TEST(TestListArray, FlattenAllEmpty) { this->TestFlattenAllEmpty(); }
TYPED_TEST(TestListArray, FlattenSliced) { this->TestFlattenSliced(); }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TEST was forgotten to be added.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 9, 2024
@ZhangHuiGui
Copy link
Collaborator Author

@felipecrv PTAL, thanks!

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this better a new API?

return FlattenListArray(checked_cast<const FixedSizeListArray&>(*varr),
memory_pool);
default:
return Status::Invalid("Unknown or unsupported arrow nested type: ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about list view?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List view related types are in FlattenListViewArray path.
You're right, it's better to extract them into an independent function.

@mapleFU
Copy link
Member

mapleFU commented Apr 9, 2024

apache/datafusion#6995

Seems ClickHouse flatten will flat nested list, other db in this issue doesn't flatten nested list

@ZhangHuiGui
Copy link
Collaborator Author

ZhangHuiGui commented Apr 9, 2024

apache/arrow-datafusion#6995

Seems ClickHouse flatten will flat nested list, other db in this issue doesn't flatten nested list

Yes, i've thought about this just now.
I'm planing to add an option for Flatten api, let user decide which kind of flatten they want?

And also, this is actually a general requirement. If we allow users to generate such nested types as a general data platform(rather than db), we should provide corresponding flattening functions. At the same time, for some of our built-in compute functions, such as the scalar_hash kernel function we are working on recently , need to calculate hash values of list type, then the nested-array need to be completely flattened.

@mapleFU
Copy link
Member

mapleFU commented Apr 9, 2024

I don't know, I think if we want to implement that:

  1. For function, we can support option
  2. For underlying impl, a new interface

@ZhangHuiGui
Copy link
Collaborator Author

ZhangHuiGui commented Apr 9, 2024

I don't know, I think if we want to implement that:

  1. For function, we can support option
  2. For underlying impl, a new interface

For 1, adder a new FlattenRecursion function with same underlying implementation as Flatten and keep the origin function for compatible with pyarrow python interface.

@ZhangHuiGui ZhangHuiGui force-pushed the add-flatten-for-nested-lists branch from c8fc9c1 to 3d5a2ec Compare April 9, 2024 14:44
cpp/src/arrow/array/array_nested.h Outdated Show resolved Hide resolved
Comment on lines 192 to 193
/// \brief Flatten all level recursively until reach a non-list type, and return a
/// non-list type Array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be explicit and list the types that we consider as logical lists here. It should include lists, list-views, fixed-size-lists, and map.

Screenshot 2024-04-09 at 15 06 26

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you only have to declare it on VarLengthListLikeArray<> and FixedSizeListArray -- they can all simply call the same non-inline function that dispatches based on type->id(). That extra check won't be significant for a function that scans the entire array and allocates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! It's much clear.

cpp/src/arrow/array/array_nested.cc Outdated Show resolved Hide resolved
Comment on lines 380 to 381
return FlattenListArray(checked_cast<const ListArray&>(array), with_recursion,
memory_pool);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this, you should have a while loop that successfully calls FlattenListArray. No bool with_recursion should be necessary.

The while loop means we don't stack overflow if someone passes a very deeply nested logical list array. In the future, we should come to this function and optimize it specifically for the recursive use-case. If depth==1, it's trivial to delegate to the non-recursive version, otherwise there are many interesting special cases that can be very efficient to flatten. But for now the while loop with successive flatten calls will do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with_recursion is unnecessary. And also, it seems that we can easily complete flatten in one cycle with depth=1.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 9, 2024
@felipecrv
Copy link
Contributor

I don't know, I think if we want to implement that:

  1. For function, we can support option
  2. For underlying impl, a new interface

For 1, adder a new FlattenRecursion function with same underlying implementation as Flatten and keep the origin function for compatible with pyarrow python interface.

At the builder_nested.h level, I don't think bool parameters controlling recursion are necessary/desirable. Perhaps at the kernels API level (with the addition of an options_class for the "list_flatten" function) or at the pyarrow bindings. The default value for bool recursive on these 2 should be false so old behavior is preserved unless users opt-in to the recursive behavior explicitly.

const FunctionDoc list_flatten_doc(
"Flatten list values",
("`lists` must have a list-like type.\n"
"Return an array with the top list level flattened.\n"
"Top-level null values in `lists` do not emit anything in the input."),
{"lists"});

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 10, 2024
@ZhangHuiGui ZhangHuiGui force-pushed the add-flatten-for-nested-lists branch from addce49 to 2450c60 Compare April 10, 2024 14:53
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much simpler now! But some small changes needed.

cpp/src/arrow/array/array_nested.h Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_nested.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_nested.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_nested.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_nested.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_nested.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_nested.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_nested.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_nested.cc Outdated Show resolved Hide resolved
in_array = out;
kind = in_array->type_id();
}
return std::move(in_array);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't move returned values. At this position, the compiler automatically turns the returned value into a && and passes that to the Result constructor. It took me a while to learn this!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this!

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 11, 2024
@ZhangHuiGui
Copy link
Collaborator Author

@felipecrv Thank you so much for providing so many helpful suggestions to make codes cleaner!

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 11, 2024
Result<std::shared_ptr<Array>> FlattenLogicalListRecursively(const Array& in_array,
MemoryPool* memory_pool) {
std::shared_ptr<Array> array = in_array.Slice(0, in_array.length());
for (auto kind = array->type_id(); is_list(kind) || is_list_view(kind);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_list and is_list_view would exclude MAP type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 11, 2024
cpp/src/arrow/array/array_list_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_list_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_list_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/array/array_list_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 11, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 11, 2024
@felipecrv
Copy link
Contributor

I will merge this soon if no one has concerns.

@ZhangHuiGui please create an issue (Enhancement Request) about improving the implementation of this function by making it create less intermediate array values. And another one about exposing it on the kernels API and Python API (cc @jorisvandenbossche).

@ZhangHuiGui
Copy link
Collaborator Author

@felipecrv Thanks for your review!
Two independent issues have been added to facilitate our subsequent tracking.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice as a first implementation ( without considering the performance )

@felipecrv felipecrv merged commit b98763a into apache:main Apr 15, 2024
37 checks passed
@felipecrv felipecrv removed the awaiting merge Awaiting merge label Apr 15, 2024
@ZhangHuiGui ZhangHuiGui deleted the add-flatten-for-nested-lists branch April 16, 2024 09:06
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit b98763a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…ed types (apache#41092)

### Rationale for this change
Support flatten for combining nested list related types.

### What changes are included in this PR?
Add the recursively flatten function for auto detect and flatten the combining nested list types.

### Are these changes tested?
Yes

### Are there any user-facing changes?
Yes, user can flatten a combining nested-list or related array by use `Flatten` API.

* GitHub Issue: apache#41055

Authored-by: ZhangHuiGui <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…ed types (apache#41092)

### Rationale for this change
Support flatten for combining nested list related types.

### What changes are included in this PR?
Add the recursively flatten function for auto detect and flatten the combining nested list types.

### Are these changes tested?
Yes

### Are there any user-facing changes?
Yes, user can flatten a combining nested-list or related array by use `Flatten` API.

* GitHub Issue: apache#41055

Authored-by: ZhangHuiGui <[email protected]>
Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants