-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Structs in CUDF #5700
Comments
Something I've thought about in the past is that a |
@jrhemstad but that means nesting applies equally to tables as well as columns, rather than just columns. e.g. you would need to be able to have a column of tables. Or a column of lists of tables. Right? |
Yeah, perhaps the similarity is shallower than I originally thought. |
I should probably ask this on Slack. @jrhemstad , @jrhemstad: Might I solicit your advice on the following? cudf::struct_column_wrapper<string, int8_t, float, vector<string>> city_watch_members {
{"Carrot Ironfoundersson", 23, 3.75, {"Copperhead Mountains", "Ankh-Morpork"}},
{"Samuel Vimes", 48, null, {"Cockbill Street, Ankh-Morpork", "Scoone Avenue"}}
}; My C++ skills... aren't, evidently. As an aside, @nvdbaranec, the third template parameter does not gel well with lists. |
Drawing again on the similarities between It may be easier to just construct a Your example is a nice goal, and should be possible in theory, but it will be quite complex to make it work in practice, especially to allow indicating individual members of the struct are null in addition to indicating the entire struct is null. |
I was afraid that might be the case, @jrhemstad. I'm afraid this will imply that constructing I'll try continue. Thank you for confirming. |
Another question, regarding a Struct's validity/null-mask. (This pertains to something that @jlowe mentioned in a prior discussion.) In systems like Apache Spark SQL or Apache Hive, when a Struct's members are accessed, the value is deemed null if either Struct or that specific member is null. Are there any circumstances where a child column needs to be read using its original validity setting? Or should we always mask out the child columns at the indices where the parent Struct is null? |
What does Arrow do? AFAIK, the children and parent have independent null masks. |
My comment referred to operating on a field within a struct. For example, take a struct S of three fields: x, y, z, and I want to do a SQL SELECT S.x + S.y. One way to support this is to update libcudf binops so it understands how to properly access structure fields (and somehow specify in the input parameters which field in the struct column is being referenced for each input). It would not be correct to pass the x and y columns directly to the binop function since some structure rows may be null and not reflected in the validity masks of either the x or y columns. At least in the short term, I think it would be nice if libcudf had a method to manifest a "struct field view" that allows a structure field column to be passed to existing, struct-oblivious libcudf functions. Building such a view would build a new validity mask by merging the validity of the field column with the validity of the parent struct column (via bitwise-AND), so the libcudf operation knows which rows in the column are null. I'm hoping we're able to re-use the data (not validity) of the original field column since we'd be passing a column_view for the input which doesn't need to own memory. But even if for some reason we have to make a full copy of the field column data to pull this off, at least we can open up a lot of libcudf functionality for queries that operate on structure fields. |
Thanks, @jlowe. Permit me to rephrase: In the
This is brilliant. A struct field-view would be an elegant solution. (I'll have to figure out how to handle the column lifetimes, if I can't superimpose the struct's null-mask.) |
The relevant section from the Arrow spec:
In the context of a struct row, the struct's validity takes precedence. The examples listed in this issue's description illustrate the case. Re-reading the last sentence last night gave me pause. I can't think of a situation where a Struct's children are "treated independently". It is akin to extracting the child column of a CUDF list column. In implementation, I am considering superimposing the struct's validity on top of that of the child columns, at construction. This obviates having to do the same on each read. I'd like to be sure this wouldn't run afoul of what the CUDF team had in mind. |
I have been advised to also seek guidance from @trxcllnt and @andygrove, who are more knowledgeable about Apache Arrow than I am. The question at hand is whether it would be incorrect to superimpose the parent (i.e. Struct) validity buffer over all its child columns, at construction: // For each child of the struct
child.validity &= struct.validity; Effect: A child column's row-value is null if either parent struct or the child column itself has that index set to null. This might be acceptable, since the child column is adopted by the struct, and is (AFAICT) not processed outside the struct's context. |
Yes, in Arrow the struct parent and child nullmasks are independent. They're also independent in Lists + the other nested types (except Unions, but that was a recent change). @jrhemstad's observation that structs are a sort of table-within-a-table is more or less correct, format-wise. I vote not to deviate from Arrow in this way unless absolutely necessary, since deviations make the interop story more difficult. |
To also add, there's a logical and semantic difference between an element in a struct being
|
So to be clear we are not deviating from the arrow spec. With the example from @kkraus14 above the arrow spec states that the validity looks like the following (expanded out for simplicity).
The The issue is that if If we remove the ambiguity early on then pulling |
Has anyone tried just asking on the Arrow mailing list? |
Just closing the loop here with the results of yesterday's parley. The current plan is to superimpose (or "flatten") the struct's validity mask to all applicable children. In this initial run, the original child masks will not be preserved. The Struct owns the children, and uses them to represent struct members. If a use-case crops up where a user supplies the child-columns and expects them to be unmodified, we will revisit this decision:
For the moment, either might be a bridge too far. |
Per rapidsai#5700, when a STRUCT column is constructed, the null mask of the parent column is bitwise-ANDed with that of all its children, such that a null row in the parent column corresponds to nulls in all its children. This is done recursively, allowing grand-child columns to also have nulls at the same row positions. `superimpose_parent_nulls()` makes this functionality available for columns that might not have been constructed through `make_struct_column()`, e.g. with columns received directly from Arrow. It does not require that the `column_view` is modifiable. For a STRUCT `column_view` argument, a new equivalent instance is created, with all its children's null masks modified to account for the parent nulls. `superimpose_parent_nulls()` can be used for all code that assumes that the child null masks account for the nulls in the parents (and grandparents, ad infinitum).
Per #5700, when a STRUCT column is constructed, the null mask of the parent column is bitwise-ANDed with that of all its children, such that a null row in the parent column corresponds to nulls in all its children. This is done recursively, allowing grand-child columns to also have nulls at the same row positions. `superimpose_parent_nulls()` makes this functionality available for columns that might not have been constructed through `make_struct_column()`, e.g. with columns received directly from Arrow. It does not require that the `column_view` is modifiable. For a STRUCT `column_view` argument, a new equivalent instance is created, with all its children's null masks modified to account for the parent nulls. `superimpose_parent_nulls()` can be used for all code that assumes that the child null masks account for the nulls in the parents (and grandparents, ad infinitum). Authors: - MithunR (https://github.com/mythrocks) Approvers: - Nghia Truong (https://github.com/ttnghia) - Conor Hoekstra (https://github.com/codereport) URL: #9144
It would be good to have support for
Struct
in CUDF, with arbitrarily complex child fields. E.g.This issue aims to document how
Struct
support might be added to CUDF, the behaviour of Struct columns in different scenarios, and possibly any related APIs. (The description of this issue is a work in progress, and will be updated based on discussion and feedback.)Goals
column_factory
.struct<name:string, address:struct<street:string, city:string, state:string, po:int8>>
list<coordinates:struct<x:int8, y:int8, z:int8>>
struct <name:string, age:int8, gpa:float, addresses:list<string>>
cudf::gather()
, to gather Struct rows.cudf::test::column_wrapper
.Non-goals
Background
A Struct is a nested data-type containing an ordered sequence of child members (“fields”). It is similar to a plain-old-data
struct
in C++.Each struct field could have a different data type, and can be arbitrarily complex or nested. E.g.:
Each of
name
,age
,gpa
, andaddresses
has a different data-type.addresses
is itself a nested datatype.Struct columns might contain lists/structs, or be themselves elements of a list, or members of a struct.
The in-memory layout of Struct columns in Apache Arrow is described in the Apache Arrow Format Specification. It is expected that the in-memory layout for CUDF Structs will be near identical.
Since the
cudf::column
implementation already provides support forchild
columns, it is expected that implementing Struct fields as child columns should be straightforward.Assumptions and Risks
cudf::column
already accommodates the notion of nested types via child columns. String and List column types already use child columns to store metadata (offset information) to manage the underlying data buffers. Using child columns to implement fields appears straightforward and logical.cudf::gather()
to gather Structs will simply involve delegation to the field columns.Alternatives considered
None, really. The availability of child-columns in
cudf::column
makes this the straightforward approach.Design
The in-memory layout of Struct columns in Apache Arrow is described in the Apache Arrow Format Specification. The layout of CUDF Structs will be near identical:
children
.cudf::column::children
.Struct
row depends on collecting the corresponding rows of each child column, in their declared field order.cudf::column::data
will not point to any data.struct<name:string, score:int8>
, the following rows are valid:{“Alpha”, 1} // (Non-null name and score)
{“Bravo”, null} // (Null score)
{null, 3} // (Null name)
null // (Null Struct value)
Memory Layout Examples
Struct<f:float, i:int>
:Struct<f:float, ilist:list<int>>
:This example illustrates how the Struct’s
validity
buffer overrides that of the child column:{4.0, (5,6,null)}
because:float
field has a valid value: 4.0list
field has a valid value: Offsets 3-6 in the underlyingint
column.int
sub-column has an invalid value at offset 5, and therefore returns (5,6,null).X
). But the Struct value at row 2 still returns null.list
field’s validity indicates a null value at index 3, while the parent Struct does not. As a result, the Struct row at index 3 is read as {8.0,null}.List< Struct< f:float, i:int > >
:This example illustrates how
list<struct<int,...>>
is analogous tolist<int>
columns. A List column’s offset vector applies to a Struct element similarly to any primitive element. For instance, when composing the first list row as offsets 0-2 of the child column, the Struct column in turn collects elements 0-2 of each of its child columns (i.e. fields). Thus, the first List row is:[ { 1.0, 1}, {2.0, 2} ]
.Open questions
map
data-type could be in terms of alist
ofstruct< key:KeyType, value:ValueType >
. A Map column might simply have a single childlist
column, which holds a singlestruct
column, which in turn has two child columns:key
, andvalue
.The text was updated successfully, but these errors were encountered: