Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] lists::detail::sort_lists returns wrong result for sliced lists column containing structs entries #9213

Closed
ttnghia opened this issue Sep 10, 2021 · 1 comment · Fixed by #9218
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Sep 10, 2021

During working on drop_list_duplicates for structs entries, a bug shows up indicating that sliced columns of structs are not properly handled in lists::detail::sort_lists:

  • With cudf::test::print(lists_original):
List<Struct<int8_t,cudf::string_view,>>:
Length : 3
Offsets : 0, 8, 16, 24
   Struct<int8_t,cudf::string_view,>:
   Length : 24:
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 3, 2, 3, 3
      Banana, Mango, Apple, Cherry, Kiwi, Banana, Cherry, Kiwi, Bear, Duck, Cat, Dog, Panda, Bear, Cat, Panda, ÁÁÁ, ÉÉÉÉÉ, ÍÍÍÍÍ, ÁBC, XYZ, ÁÁÁ, ÁBC, XYZ
  • With auto const lists = cudf::slice(lists_original, {1, 3})[0];:
List<Struct<int8_t,cudf::string_view,>>(sliced):
Length : 2
Offsets : 0, 8, 16
   Struct<int8_t,cudf::string_view,>:
   Length : 16:
      1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 3, 2, 3, 3
      Bear, Duck, Cat, Dog, Panda, Bear, Cat, Panda, ÁÁÁ, ÉÉÉÉÉ, ÍÍÍÍÍ, ÁBC, XYZ, ÁÁÁ, ÁBC, XYZ
  • When calling detail::sort_lists(lists_column_view(lists), order::ASCENDING, null_order::AFTER);:
List<Struct<int8_t,cudf::string_view,>>:
Length : 2
Offsets : 0, 8, 16
   Struct<int8_t,cudf::string_view,>:
   Length : 16:
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2
      Banana, Banana, Apple, Cherry, Mango, Cherry, Kiwi, Kiwi, Dog, Bear, Bear, Duck, Cat, Panda, Panda, Cat

Note that sort_lists works just fine with sliced columns of basic types. I quickly look at the implementation of sort_lists and see that it calls segmented_sorted_order and gather so maybe the bug is related to one of these APIs.

@ttnghia ttnghia added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Sep 10, 2021
@ttnghia ttnghia self-assigned this Sep 10, 2021
@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 10, 2021

Today I did a quick test:

    auto gather_map = IntsCol{0}.release();
    auto const gathered =
      cudf::gather(cudf::table_view{std::vector<cudf::column_view>{lists}}, gather_map->view())
        ->get_column(0);
    cudf::test::print(gathered.view());

And the result clearly shows that the bug is due to gather (it is the first list of lists_original column, not of lists column that gather is operating on):

List<Struct<int8_t,cudf::string_view,>>:
Length : 1
Offsets : 0, 8
   Struct<int8_t,cudf::string_view,>:
   Length : 8:
      1, 1, 1, 1, 1, 1, 1, 1
      Banana, Mango, Apple, Cherry, Kiwi, Banana, Cherry, Kiwi

rapids-bot bot pushed a commit that referenced this issue Sep 17, 2021
This PR fixes the `gather` API for structs columns when the input is a sliced column. Previously, `gather` calls `child_begin()` and `child_end()` to access the children column so if the input structs column is sliced then the output is incorrect.

This closes #9213, and is blocked by #9194 due to conflict work.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Mark Harris (https://github.com/harrism)

URL: #9218
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant