Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] explode_outer and explode_outer_position do not push down validity into child structs #9003

Closed
revans2 opened this issue Aug 9, 2021 · 3 comments
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Aug 9, 2021

Describe the bug
explode_outer and explode_outer_position both appear to not propagate the null value to child struct columns when the input column is an empty list or null.

If I have data like LIST<STRUCT<KEY: INT32, VALUE: INT32>> that is.

  1. []// Empty List
  2. null // Null Value for the list

And then I do a an explode_outer_position just because it shows where the problems are better than an explode_outer, followed by getting the key and value children from the struct I get something that looks like.

pos key value
null 12849 99
null 0 0

but the key and value change each time I run it because it looks like the memory for that value is not initialized and it looks like the validity of the parent struct is not pushed down into the children.

I thought that generally for structs the validity of the parent was going to always be pushed down into the children. But that is not the case here, and from reading the code it appears to actually be an issue with gather. So this problem may be a lot more wide-spread than just explode_outer*.

Steps/Code to reproduce bug
See above

Expected behavior
I would expect the validity of the parent struct to be pushed down into the child for data that we computed ourselves.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Aug 9, 2021
@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Aug 24, 2021
@jrhemstad
Copy link
Contributor

I thought that generally for structs the validity of the parent was going to always be pushed down into the children

From my recollection, that was only true of the struct column factories. In general, this is outside of the Arrow spec and so we don't currently guarantee this everywhere.

@jrhemstad
Copy link
Contributor

@mythrocks just spent a lot of time thinking and asking around about this. He can probably speak best about it.

@mythrocks
Copy link
Contributor

This took a little bit of effort to verify. TLDR: #9194 fixes this problem.

Repro case:

TYPED_TEST(TypedStructGatherTest, ExplodeTest)
{
  auto list_of_structs = [] {

    auto keys   = numerics<TypeParam>{0,1, 4,5, 6,7};
    auto values = numerics<TypeParam>{9,8, 5,4, 3,2};

    auto struct_col = structs{{keys, values}};
    auto list_validity = null_at(1);

    auto get_null_mask_null_at_1 = [&]{
      return numerics<TypeParam>{{1,2,3,4}, null_at(1)}.release()->release().null_mask;
    };
    return make_lists_column(4, offsets{0,2,2,4,6}.release(), struct_col.release(), 1, std::move(*get_null_mask_null_at_1()));
  }();

  std::cout << "Lists input: " << std::endl;
  print(list_of_structs->view());

  auto exploded = cudf::explode_outer(table_view{{list_of_structs->view()}}, 0);
  std::cout << "Exploded output: " << std::endl;
  print(exploded->view().column(0));
}

Output:

Lists input:
List<Struct<int32_t,int32_t,>>:
Length : 4
Offsets : 0, 2, 2, 4, 6
Null count: 1
1101
   Struct<int32_t,int32_t,>:
   Length : 6:
      0, 1, 4, 5, 6, 7
      9, 8, 5, 4, 3, 2


Exploded output:
Struct<int32_t,int32_t,>:
Length : 7:
Null count: 1
1111011
   1111011
   0, 1, NULL, 4, 5, 6, 7
   1111011
   9, 8, NULL, 5, 4, 3, 2

In short, explode_outer() on a List<Struct<Key,Value>> column whose second row is null produces a STRUCT column with the row corresponding to the null row also set to null. The nulls are pushed down to the children.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

5 participants