Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Structs in CUDF #5700

Closed
mythrocks opened this issue Jul 15, 2020 · 17 comments · Fixed by #5807
Closed

[FEA] Structs in CUDF #5700

mythrocks opened this issue Jul 15, 2020 · 17 comments · Fixed by #5807
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS

Comments

@mythrocks
Copy link
Contributor

mythrocks commented Jul 15, 2020

It would be good to have support for Struct in CUDF, with arbitrarily complex child fields. E.g.

struct <name:string, age:int8, gpa:float, addresses:list<string>>

This issue aims to document how Struct support might be added to CUDF, the behaviour of Struct columns in different scenarios, and possibly any related APIs. (The description of this issue is a work in progress, and will be updated based on discussion and feedback.)

Goals

  1. First class support for user-defined Struct types, having possibly multiple child field columns with possibly different data-types
  2. Support for null values in Struct columns, just as other column types
  3. Must distinguish between nulls in child columns, and nulls for the Struct
  4. Support construction via a corresponding column_factory.
  5. Support composition within other complex types. E.g.
    1. struct<name:string, address:struct<street:string, city:string, state:string, po:int8>>
    2. list<coordinates:struct<x:int8, y:int8, z:int8>>
    3. struct <name:string, age:int8, gpa:float, addresses:list<string>>
  6. Add support in cudf::gather(), to gather Struct rows.
  7. Ideally, support construction from enumerated literals, via a suitable cudf::test::column_wrapper.
    1. The following would be ideal, but the feasibility of this syntax is yet to be evaluated:
    cudf::struct_column_wrapper<string, int8_t, float, vector<string>> city_watch_members {
      {"Carrot Ironfoundersson", 23, 3.75, {"Copperhead Mountains", "Ankh-Morpork"}},
      {"Samuel Vimes",           48, null, {"Cockbill Street, Ankh-Morpork", "Scoone Avenue"}}
    };

Non-goals

  • Support for field names and field types, for self-describing columns.
    • At least for the first pass, Structs will only allow access to fields by their index.
  • Support for schema evolution.
    • While it is commonplace for Struct schemas to evolve, it is assumed that the schema resolution is done at a higher level of abstraction (above libcudf).

Background

A Struct is a nested data-type containing an ordered sequence of child members (“fields”). It is similar to a plain-old-data struct in C++.
Each struct field could have a different data type, and can be arbitrarily complex or nested. E.g.:

struct <name:string, age:int8, gpa:float, addresses:list<string>>

Each of name, age, gpa, and addresses has a different data-type. addresses is itself a nested datatype.
Struct columns might contain lists/structs, or be themselves elements of a list, or members of a struct.

The in-memory layout of Struct columns in Apache Arrow is described in the Apache Arrow Format Specification. It is expected that the in-memory layout for CUDF Structs will be near identical.

Since the cudf::column implementation already provides support for child columns, it is expected that implementing Struct fields as child columns should be straightforward.

Assumptions and Risks

  1. Child columns: The design for cudf::column already accommodates the notion of nested types via child columns. String and List column types already use child columns to store metadata (offset information) to manage the underlying data buffers. Using child columns to implement fields appears straightforward and logical.
  2. Gather: It is assumed that adding support in cudf::gather() to gather Structs will simply involve delegation to the field columns.

Alternatives considered

None, really. The availability of child-columns in cudf::column makes this the straightforward approach.

Design

The in-memory layout of Struct columns in Apache Arrow is described in the Apache Arrow Format Specification. The layout of CUDF Structs will be near identical:

  1. Children: A Struct column will store its field columns in its list of children.
    1. In CUDF, this corresponds to cudf::column::children.
    2. The number of rows in each field member column must be identical. This determines the number of rows in the Struct column.
    3. The data-type of field columns will not be stored in the Struct column. This will instead be delegated to the child columns themselves.
  2. Data: A Struct column stores no data of its own. The interpretation of a Struct row depends on collecting the corresponding rows of each child column, in their declared field order.
    1. For a CUDF Struct column, cudf::column::data will not point to any data.
  3. Validity: A Struct column will store a validity buffer that is distinct from the validity buffers of the child columns. When reading a row from a Struct column, the Struct’s validity supersedes those of its fields.
    1. If a row in a Struct column is marked null, its value is returned as null, even though the field columns might have non-null values.
    2. If a Struct entry is marked non-null, but any (or all) of its field columns are marked null, a non-null Struct row is returned, with the appropriate fields set to null.
    3. E.g. For a struct column defined as struct<name:string, score:int8>, the following rows are valid:
      1. {“Alpha”, 1} // (Non-null name and score)
      2. {“Bravo”, null} // (Null score)
      3. {null, 3} // (Null name)
      4. null // (Null Struct value)

Memory Layout Examples

Struct<f:float, i:int>:

Data:   [ {1.0, 2},
          {4.0, 5},
          null,
          {8.0, null} ]
{
    type = struct
    validity = [1, 1, 0, 1]
    children = {
        {   
            type = float
            values   = [1.0, 4.0, X, 8.0]
            validity =  [ 1,   1, 1,   1]  
            children = null
    },  
    {   
      type = int
      values   = [2, 5, X, X]
      validity = [0, 0, 1, 1]
      children = null
    }  
   }   
}

Struct<f:float, ilist:list<int>>:

Data:   [ {1.0, (1,2,3)},
          {4.0, (5,6,null)},
          null,
          {8.0, null} ]
{
    type = struct

    validity = [1, 1, 0, 1] // Third row is null.

    children = {
        {   
            type = float
            values   = [1.0, 4.0, X, 8.0]
            validity = [  1,   1, 1,   1]  
            children = null
        },  
        {   
            type = list
            values = null
            validity = [1, 1, 1, 0]
            offsets =  [0, 3, 6, 6, 6]
            children = {
                {   
                    type = int
                    values   = [ 1, 2, 3, 5, 6, X ]
                    validity = [ 1, 1, 1, 1, 1, 0 ]
                    children = null
                }   
            }
        }  
    }   
}

This example illustrates how the Struct’s validity buffer overrides that of the child column:

  1. Struct row#0 is straightforward. Both fields have valid values, and the list field contains no null ints.
  2. Struct row#1 is valid, and should return {4.0, (5,6,null)} because:
    1. float field has a valid value: 4.0
    2. list field has a valid value: Offsets 3-6 in the underlying int column.
    3. int sub-column has an invalid value at offset 5, and therefore returns (5,6,null).
  3. Struct row#2's validity indicates that the row at index 2 is null. The float field might have a valid value at index 2 (indicated here by X). But the Struct value at row 2 still returns null.
  4. Struct row#3 is similar to row#2. The list field’s validity indicates a null value at index 3, while the parent Struct does not. As a result, the Struct row at index 3 is read as {8.0,null}.

List< Struct< f:float, i:int > >:

Data: [ [{1.0, 1}, {2.0, 2}],
        [{3.0, 3}],
        [{4.0, null}],
        [null],
        null ]

{
    type = list
    values = null
    validity = [ 1, 1, 1, 1, 0 ]
    offsets  = [ 0, 2, 3, 4, 4, 4]
    children = {
        {
            type = struct
            values = null
            validity = [ 1, 1, 1, 1, 0, 1 ]
            children = {
                {
                    type = float
                    values   = [ 1.0, 2.0, 3.0, 4.0, 5.0 ]
                    validity = [   1,   1,   1,   1,   1 ]
                    children = null
                },
                {   
                    type = int
                    values   = [   1,   2,   3,   X,   5 ]
                    validity = [   1,   1,   1,   0,   1 ]
                    children = null
                }
            }
        }
    }
}

This example illustrates how list<struct<int,...>> is analogous to list<int> columns. A List column’s offset vector applies to a Struct element similarly to any primitive element. For instance, when composing the first list row as offsets 0-2 of the child column, the Struct column in turn collects elements 0-2 of each of its child columns (i.e. fields). Thus, the first List row is:
[ { 1.0, 1}, {2.0, 2} ].

Open questions

  1. A naive approach to implementing a map data-type could be in terms of a list of struct< key:KeyType, value:ValueType >. A Map column might simply have a single child list column, which holds a single struct column, which in turn has two child columns: key, and value.
    1. Must check if it is acceptable that Maps could contain repeated keys (a la MultiMaps), and that look-up is O(n). These semantics would mimic those of SparkSQL, Hive, Presto, and similar Big Data systems.
@mythrocks mythrocks added feature request New feature or request Needs Triage Need team to review and classify labels Jul 15, 2020
@mythrocks mythrocks self-assigned this Jul 15, 2020
@jrhemstad
Copy link
Contributor

Something I've thought about in the past is that a struct column is really just a table with it's own row-wise null mask. It seems like it would be advantageous to lean on that similarity for reducing redundant code.

@harrism harrism changed the title [WIP] [FEA] Structs in CUDF [FEA] Structs in CUDF Jul 17, 2020
@harrism
Copy link
Member

harrism commented Jul 17, 2020

@jrhemstad but that means nesting applies equally to tables as well as columns, rather than just columns. e.g. you would need to be able to have a column of tables. Or a column of lists of tables. Right?

@jrhemstad
Copy link
Contributor

@jrhemstad but that means nesting applies equally to tables as well as columns, rather than just columns. e.g. you would need to be able to have a column of tables. Or a column of lists of tables. Right?

Yeah, perhaps the similarity is shallower than I originally thought.

@mythrocks
Copy link
Contributor Author

mythrocks commented Jul 21, 2020

I should probably ask this on Slack.

@jrhemstad , @jrhemstad: Might I solicit your advice on the following?

cudf::struct_column_wrapper<string, int8_t, float, vector<string>> city_watch_members {
  {"Carrot Ironfoundersson", 23, 3.75, {"Copperhead Mountains", "Ankh-Morpork"}},
  {"Samuel Vimes",           48, null, {"Cockbill Street, Ankh-Morpork", "Scoone Avenue"}}
};

My C++ skills... aren't, evidently. std::initializer_list<> requires a single type T. So what might the signature for this constructor look like? The type-list might be user-defined. I've been referring to std::tuple as a guide. I'd really appreciate your advice.

As an aside, @nvdbaranec, the third template parameter does not gel well with lists.
cudf::test::list_column_wrapper<T> takes a single type argument, regardless of the nesting level. (I found that very clever.) Should the third type argument in struct_column_wrapper be string, not vector<string>?

@jrhemstad
Copy link
Contributor

So what might the signature for this constructor look like? The type-list might be user-defined. I've been referring to std::tuple as a guide. I'd really appreciate your advice.

Drawing again on the similarities between struct columns and tables, you basically want the table_wrapper I described here: #3203

It may be easier to just construct a struct_column_wrapper from other *_column_wrappers and construct each member column individually in the same way we do for tables today in tests.

Your example is a nice goal, and should be possible in theory, but it will be quite complex to make it work in practice, especially to allow indicating individual members of the struct are null in addition to indicating the entire struct is null.

@mythrocks
Copy link
Contributor Author

It may be easier to just construct a struct_column_wrapper from other *_column_wrappers and construct each member column individually in the same way we do for tables today in tests.

I was afraid that might be the case, @jrhemstad. I'm afraid this will imply that constructing list<struct<...>> will be verbose.

I'll try continue. Thank you for confirming.

@mythrocks
Copy link
Contributor Author

mythrocks commented Jul 22, 2020

Another question, regarding a Struct's validity/null-mask. (This pertains to something that @jlowe mentioned in a prior discussion.)

In systems like Apache Spark SQL or Apache Hive, when a Struct's members are accessed, the value is deemed null if either Struct or that specific member is null.
In CUDF, a Struct may be constructed from child columns whose validity may be set independently of the Structs validity.

Are there any circumstances where a child column needs to be read using its original validity setting? Or should we always mask out the child columns at the indices where the parent Struct is null?

@jrhemstad
Copy link
Contributor

Or should we always mask out the child columns at the indices where the parent Struct is null?

What does Arrow do? AFAIK, the children and parent have independent null masks.

@jlowe
Copy link
Member

jlowe commented Jul 22, 2020

This pertains to something that @jlowe mentioned

My comment referred to operating on a field within a struct. For example, take a struct S of three fields: x, y, z, and I want to do a SQL SELECT S.x + S.y. One way to support this is to update libcudf binops so it understands how to properly access structure fields (and somehow specify in the input parameters which field in the struct column is being referenced for each input). It would not be correct to pass the x and y columns directly to the binop function since some structure rows may be null and not reflected in the validity masks of either the x or y columns.

At least in the short term, I think it would be nice if libcudf had a method to manifest a "struct field view" that allows a structure field column to be passed to existing, struct-oblivious libcudf functions. Building such a view would build a new validity mask by merging the validity of the field column with the validity of the parent struct column (via bitwise-AND), so the libcudf operation knows which rows in the column are null. I'm hoping we're able to re-use the data (not validity) of the original field column since we'd be passing a column_view for the input which doesn't need to own memory. But even if for some reason we have to make a full copy of the field column data to pull this off, at least we can open up a lot of libcudf functionality for queries that operate on structure fields.

@mythrocks
Copy link
Contributor Author

It would not be correct to pass the x and y columns directly to the binop function since some structure rows may be null and not reflected in the validity masks of either the x or y columns.

Thanks, @jlowe. Permit me to rephrase: In the binop case (maybe even most cases), the column needs to be evaluated in the context of its parent struct column. But am I missing a counterpoint?

I think it would be nice if libcudf had a method to manifest a "struct field view" that allows a structure field column to be passed to existing, struct-oblivious libcudf functions.

This is brilliant. A struct field-view would be an elegant solution. (I'll have to figure out how to handle the column lifetimes, if I can't superimpose the struct's null-mask.)
But I wonder if this might be required if there are no circumstances where the struct's underlying column need to be evaluated independently of the struct's validity.

@mythrocks
Copy link
Contributor Author

mythrocks commented Jul 22, 2020

What does Arrow do? AFAIK, the children and parent have independent null masks.

The relevant section from the Arrow spec:

Any of the child field arrays can have null values according to their respective independent validity bitmaps. 
This implies that for a particular struct slot the validity bitmap for the struct array might indicate a null slot 
when one or more of its child arrays has a non-null value in their corresponding slot. 
When reading the struct array the parent validity bitmap takes priority. 

This is illustrated in the example above, the child arrays have valid entries for the null struct
 but are ‘hidden’ from the consumer by the parent array’s validity bitmap. 
However, when treated independently corresponding values of the children array will be non-null.

In the context of a struct row, the struct's validity takes precedence. The examples listed in this issue's description illustrate the case.

Re-reading the last sentence last night gave me pause. I can't think of a situation where a Struct's children are "treated independently". It is akin to extracting the child column of a CUDF list column.

In implementation, I am considering superimposing the struct's validity on top of that of the child columns, at construction. This obviates having to do the same on each read. I'd like to be sure this wouldn't run afoul of what the CUDF team had in mind.

@mythrocks
Copy link
Contributor Author

mythrocks commented Jul 23, 2020

I have been advised to also seek guidance from @trxcllnt and @andygrove, who are more knowledgeable about Apache Arrow than I am. The question at hand is whether it would be incorrect to superimpose the parent (i.e. Struct) validity buffer over all its child columns, at construction:

// For each child of the struct
child.validity &= struct.validity;

Effect: A child column's row-value is null if either parent struct or the child column itself has that index set to null.
Advantage: The null mask does not need recomputation every time it is read. Also, all the rest of CUDF doesn't need special handling for struct fields.
Disadvantage: The column's original null-mask is lost, once the struct not preserved in the context of a struct.

This might be acceptable, since the child column is adopted by the struct, and is (AFAICT) not processed outside the struct's context.

@trxcllnt
Copy link
Contributor

Yes, in Arrow the struct parent and child nullmasks are independent. They're also independent in Lists + the other nested types (except Unions, but that was a recent change). @jrhemstad's observation that structs are a sort of table-within-a-table is more or less correct, format-wise. I vote not to deviate from Arrow in this way unless absolutely necessary, since deviations make the interop story more difficult.

@kkraus14
Copy link
Collaborator

To also add, there's a logical and semantic difference between an element in a struct being NULL versus the elements in the child columns being NULL and Arrow captures this:

import pyarrow as pa

test =  pa.array(
    [
        {'a': 1, 'b': 2},
        None,
        {'a': None, 'b': 4},
        {'a': 5, 'b': None},
        {'a': None, 'b': None}
    ]
)

print(test[1])
# NULL

print(test[4])
# {'a': None, 'b': None}

@revans2
Copy link
Contributor

revans2 commented Jul 23, 2020

So to be clear we are not deviating from the arrow spec. With the example from @kkraus14 above the arrow spec states that the validity looks like the following (expanded out for simplicity).

struct_validity = [VALID, INVALID, VALID  , VALID  , VALID]
a_validity =      [VALID, ???    , INVALID, VALID  , INVALID]
b_validity =      [VALID, ???    , VALID  , INVALID, INVALID]

The ??? in there are for entries that the spec says can be anything. It does not matter if they are VALID or INVALID because the parent's validity for the row says it is INVALID.

The issue is that if ??? is set to VALID, which the arrow spec says it can be, and someone tries to pull the column a out of the struct what should we do? Do we keep the validity as is for column a? Do we need to merge the validity for the parent with the validity for column a? For Spark there is no ambiguity. struct.a is NULL if struct is NULL. If that is true for pandas also, or pandas will never get to that point because an exception would be thrown ahead of time because None.a is illegal, then we should be able to remove the ambiguity early on, which is what @mythrocks is proposing.

If we remove the ambiguity early on then pulling a out of struct is a zero copy operation that returns a column_view. If we decide to just live with the ambiguity then any group that does not want that ambiguity needs to resolve it themselves. This would likely involve multiple kernels pulling different validity masks out of the data turning them into boolean columns so we can and them together and then doing a copy_if_else to finally get the fixed up data before doing the final computation. That would mean that doing struct.a + struct.b would likely involve 9 cudf calls involving kernels instead of 1. Deeper nesting makes it even worse.

@jrhemstad
Copy link
Contributor

Has anyone tried just asking on the Arrow mailing list?

@mythrocks
Copy link
Contributor Author

Just closing the loop here with the results of yesterday's parley.

The current plan is to superimpose (or "flatten") the struct's validity mask to all applicable children. In this initial run, the original child masks will not be preserved. The Struct owns the children, and uses them to represent struct members.

If a use-case crops up where a user supplies the child-columns and expects them to be unmodified, we will revisit this decision:

  1. This might involve reworking the semantics of structs_column_view.child(index), which might need to return the flattened index alongside the column.
  2. This will also imply having to do one of the following:
    1. Flatten indices at access time: This involves running a bitmask_and() kernel for all children, for each column batch, per access. Performance implications should be studied.
    2. Pre-flatten indices at struct construction, but store them separately from the original validity masks of the child columns. Lifetime management for the buffers has to be worked out.

For the moment, either might be a bridge too far.

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Aug 5, 2020
mythrocks added a commit to mythrocks/cudf that referenced this issue Aug 30, 2021
Per rapidsai#5700, when a STRUCT column is constructed, the null mask of the parent
column is bitwise-ANDed with that of all its children, such that a null row
in the parent column corresponds to nulls in all its children. This is done
recursively, allowing grand-child columns to also have nulls at the same
row positions.

`superimpose_parent_nulls()` makes this functionality available for columns
that might not have been constructed through `make_struct_column()`, e.g.
with columns received directly from Arrow. It does not require that the
`column_view` is modifiable. For a STRUCT `column_view` argument, a new
equivalent instance is created, with all its children's null masks modified
to account for the parent nulls.

`superimpose_parent_nulls()` can be used for all code that assumes that the
child null masks account for the nulls in the parents (and grandparents,
ad infinitum).
rapids-bot bot pushed a commit that referenced this issue Sep 2, 2021
Per #5700, when a STRUCT column is constructed, the null mask of the parent
column is bitwise-ANDed with that of all its children, such that a null row
in the parent column corresponds to nulls in all its children. This is done
recursively, allowing grand-child columns to also have nulls at the same
row positions.

`superimpose_parent_nulls()` makes this functionality available for columns
that might not have been constructed through `make_struct_column()`, e.g.
with columns received directly from Arrow. It does not require that the
`column_view` is modifiable. For a STRUCT `column_view` argument, a new
equivalent instance is created, with all its children's null masks modified
to account for the parent nulls.

`superimpose_parent_nulls()` can be used for all code that assumes that the
child null masks account for the nulls in the parents (and grandparents,
ad infinitum).

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Conor Hoekstra (https://github.com/codereport)

URL: #9144
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants