-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41055: [C++] Support flatten for combining nested list related types #41092
GH-41055: [C++] Support flatten for combining nested list related types #41092
Conversation
|
@@ -925,10 +969,12 @@ TYPED_TEST(TestListArray, BuilderPreserveFieldName) { | |||
TYPED_TEST(TestListArray, FlattenSimple) { this->TestFlattenSimple(); } | |||
TYPED_TEST(TestListArray, FlattenNulls) { this->TestFlattenNulls(); } | |||
TYPED_TEST(TestListArray, FlattenAllEmpty) { this->TestFlattenAllEmpty(); } | |||
TYPED_TEST(TestListArray, FlattenSliced) { this->TestFlattenSliced(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TEST was forgotten to be added.
@felipecrv PTAL, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this better a new API?
cpp/src/arrow/array/array_nested.cc
Outdated
return FlattenListArray(checked_cast<const FixedSizeListArray&>(*varr), | ||
memory_pool); | ||
default: | ||
return Status::Invalid("Unknown or unsupported arrow nested type: ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about list view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List view related types are in FlattenListViewArray
path.
You're right, it's better to extract them into an independent function.
Seems ClickHouse |
Yes, i've thought about this just now. And also, this is actually a general requirement. If we allow users to generate such nested types as a general data platform(rather than db), we should provide corresponding flattening functions. At the same time, for some of our built-in compute functions, such as the scalar_hash kernel function we are working on recently , need to calculate hash values of list type, then the nested-array need to be completely flattened. |
I don't know, I think if we want to implement that:
|
For 1, adder a new |
c8fc9c1
to
3d5a2ec
Compare
cpp/src/arrow/array/array_nested.h
Outdated
/// \brief Flatten all level recursively until reach a non-list type, and return a | ||
/// non-list type Array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you only have to declare it on VarLengthListLikeArray<>
and FixedSizeListArray
-- they can all simply call the same non-inline function that dispatches based on type->id()
. That extra check won't be significant for a function that scans the entire array and allocates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree! It's much clear.
cpp/src/arrow/array/array_nested.cc
Outdated
return FlattenListArray(checked_cast<const ListArray&>(array), with_recursion, | ||
memory_pool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this, you should have a while
loop that successfully calls FlattenListArray
. No bool with_recursion
should be necessary.
The while loop means we don't stack overflow if someone passes a very deeply nested logical list array. In the future, we should come to this function and optimize it specifically for the recursive use-case. If depth==1
, it's trivial to delegate to the non-recursive version, otherwise there are many interesting special cases that can be very efficient to flatten. But for now the while
loop with successive flatten calls will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with_recursion
is unnecessary. And also, it seems that we can easily complete flatten in one cycle with depth=1.
At the arrow/cpp/src/arrow/compute/kernels/vector_nested.cc Lines 108 to 113 in dbedcfc
|
addce49
to
2450c60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking much simpler now! But some small changes needed.
cpp/src/arrow/array/array_nested.cc
Outdated
in_array = out; | ||
kind = in_array->type_id(); | ||
} | ||
return std::move(in_array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't move returned values. At this position, the compiler automatically turns the returned value into a &&
and passes that to the Result
constructor. It took me a while to learn this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this!
@felipecrv Thank you so much for providing so many helpful suggestions to make codes cleaner! |
Result<std::shared_ptr<Array>> FlattenLogicalListRecursively(const Array& in_array, | ||
MemoryPool* memory_pool) { | ||
std::shared_ptr<Array> array = in_array.Slice(0, in_array.length()); | ||
for (auto kind = array->type_id(); is_list(kind) || is_list_view(kind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_list
and is_list_view
would exclude MAP
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
I will merge this soon if no one has concerns. @ZhangHuiGui please create an issue (Enhancement Request) about improving the implementation of this function by making it create less intermediate array values. And another one about exposing it on the kernels API and Python API (cc @jorisvandenbossche). |
@felipecrv Thanks for your review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice as a first implementation ( without considering the performance )
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit b98763a. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…ed types (apache#41092) ### Rationale for this change Support flatten for combining nested list related types. ### What changes are included in this PR? Add the recursively flatten function for auto detect and flatten the combining nested list types. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, user can flatten a combining nested-list or related array by use `Flatten` API. * GitHub Issue: apache#41055 Authored-by: ZhangHuiGui <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
…ed types (apache#41092) ### Rationale for this change Support flatten for combining nested list related types. ### What changes are included in this PR? Add the recursively flatten function for auto detect and flatten the combining nested list types. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, user can flatten a combining nested-list or related array by use `Flatten` API. * GitHub Issue: apache#41055 Authored-by: ZhangHuiGui <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Rationale for this change
Support flatten for combining nested list related types.
What changes are included in this PR?
Add the recursively flatten function for auto detect and flatten the combining nested list types.
Are these changes tested?
Yes
Are there any user-facing changes?
Yes, user can flatten a combining nested-list or related array by use
Flatten
API.