[Discussion] Requirements for schema/column names #3225

jrhemstad · 2019-10-26T14:33:05Z

There have been a number of requests related to adding column names, either to the column's themselves and/or to tables and their views.

libcudf internals don't use column names, so we need requirements to be driven by users that will make use of the names (cuIO/Spark/cuDF).

For those who need column names, please discuss what you would like to see for column names.

CC @kkraus14 @revans2 @jlowe @j-ieong @shwina

The text was updated successfully, but these errors were encountered:

OlivierNV · 2019-10-26T22:42:05Z

I suppose it depends a lot on how structs are being implemented: if the relationship between columns (eg: is a column a field of a struct) is stored in a separate schema structure while data is stored in a flat array of columns, then it would make sense to store the name in the schema structure instead of the columns. However, if a column can have child columns (a tree, where fields of a struct are children of another column), then the name should be part of the column (for the name itself, I don't think we need anything besides a simple std::string).

jlowe · 2019-10-28T12:56:09Z

Spark tracks names in a schema structure that is separate from the data. Names are associated with structs (i.e.: a struct knows the names of the fields within it, but leaf types do not know nor care what their name is.)

That being said, the Spark plugin should be able to adapt pretty easily to wherever the names are placed (separate schema struct, only in struct columns, or columns know their own names). We just need some way for the loaders to convey the schema that was loaded so we can align that with the schema Spark expects.

kkraus14 · 2019-10-28T19:38:44Z

From the Python side, columns can be named outside of tables, and those names are expected to be maintained through libcudf function calls. We could maintain this ourselves, but it would be more ideal if columns had the ability to have a name outside of a separate struct.

jrhemstad · 2019-10-28T20:12:52Z

those names are expected to be maintained through libcudf function call

Can you explain more about what you mean about "expected to be maintained"?

kkraus14 · 2019-10-28T21:31:19Z

Basically if I call something like gather against a cudf::column with name a, I'd expect the output column to have its name set to a.

harrism · 2019-10-30T04:00:58Z

What does SQL do?

harrism · 2019-10-30T04:01:46Z

This is a blocker for cuIO porting to libcudf++ currently, so we need to come to an agreement.

jlowe · 2019-10-30T14:30:07Z

What does SQL do?

It's not specified. SQL has names associated with table columns but the SQL implementation can choose how to track it. Spark happens to track it in a separate schema.

If it's easier for the Python side, I'm fine with keeping the name of a column associated directly with a cudf::column. The Spark plugin will likely not examine these names except after calling a loader to map what was loaded to Spark's expected schema.

If we go with names in columns, only issue will be making sure libcudf knows when to propagate column names. For example, does the column name propagate on a filter? groupby? join? All the places where the code was copying names when allocating gdf_column outputs need to be covered in the corresponding libcudf methods. Otherwise the Cython code would be responsible for propagating the name into the output column after calling the libcudf method.

jrhemstad · 2019-10-30T14:45:36Z

If we go with names in columns, only issue will be making sure libcudf knows when to propagate column names. For example, does the column name propagate on a filter? groupby? join? All the places where the code was copying names when allocating gdf_column outputs need to be covered in the corresponding libcudf methods. Otherwise the Cython code would be responsible for propagating the name into the output column after calling the libcudf method.

This is exactly my concern with putting the name directly inside cudf::column. It will introduce a lot of new complexity in libcudf around when and if to propagate the name from input to output. Likewise, the name propagation behavior may be different between Spark/Python. I'd really prefer this to be in the domain of user responsibility.

My vote would be for cuIO to return a simple wrapper around a cudf::column like:

class named_column{
   std::string name;
   std::unique_ptr<column> col;
};

Then it's up to the user to do whatever they want to do with the name.

OlivierNV · 2019-10-30T15:09:36Z

I'm not sure I understand how adding a name in the column introduces new problems: don't we have the column name in the column today (in the legacy gdf_column structure) ?
I definitely would prefer having an array of column names in a table schema over a named_column WAR for cudf shortcomings.

jrhemstad · 2019-10-30T15:42:58Z

I'm not sure I understand how adding a name in the column introduces new problems: don't we have the column name in the column today (in the legacy gdf_column structure) ?

Yes, but it's 100% ignored in every libcudf compute function.

It introduces new problems if we're expected to do anything with the name, like propagate it from input to output.

kkraus14 · 2019-10-30T22:03:15Z

@shwina and I had a discussion about this on the Python side and decided it's probably a better bet if we just handle this on the Python side. Columns can have arbitrary hashable Python objects as names (unfortunately) so it will likely be easier for us to just manage it on our own.

@j-ieong @OlivierNV for the io readers / writers would it suffice to just pass a std::vector<string> for column names in addition to the cudf::Table?

jlowe · 2019-10-30T22:13:08Z

I think a string vector will be problematic when we support nested types. For example, if one of the loaded columns is named 'A' and is a structure of fields 'B', 'C', and 'D', I assume we would want a way to convey these nested names under 'A' back to the caller. There would need to be a tree-like schema descriptor returned to describe what was loaded in the general case. Or will we only be interested in top-level column names?

kkraus14 · 2019-10-30T22:18:33Z

For example, if one of the loaded columns is named 'A' and is a structure of fields 'B', 'C', and 'D', I assume we would want a way to convey these nested names under 'A' back to the caller.

Good point, I didn't really think through this other than we'd be happy with a separate object as opposed to baking the names into the column objects.

OlivierNV · 2019-10-30T23:43:41Z

Personally, I'd vote for just adding a std::string to the new column class, and have compute functions just return unnamed column (except perhaps for the copy constructor), where an unnamed column is just a column with an empty string as its name. (from writer point of view, it's better to have empty strings than duplicate made-up names). We could always change this later if we somehow need all column names consolidated in one object, and it pretty much matches the current behavior, so one less surprise for people trying to switch from cudf-0.10 to cudf-0.11

jrhemstad · 2019-10-30T23:51:12Z

Personally, I'd vote for just adding a std::string to the new column class, and have compute functions just return unnamed column

I'm not a fan of this approach. If a column is going to have a name, it should be meaningful. If most columns are just going to have empty names, then it isn't meaningful and muddies the design of column/column_view for convenience rather than because it is the right thing to do.

pretty much matches the current behavior

Very little of libcudf's current behavior is worth emulating :)

harrism · 2019-10-31T00:12:15Z

A cuIO loader could return a schema corresponding to the loaded column(s). The schema node object would just have a string name and a vector of child nodes, mirroring how children are handled in cudf::column.

One approach would be to return a separate cudf::table and a cudf::table_schema:

struct table_schema {
   std::string name;
   std::vector<schema_node> columns
};

struct schema_node {
   std::string name;
   std::vector<schema_node*> children;
};

Or if tables don't need names, could just return a vector of schema_node.

OlivierNV · 2019-10-31T00:25:21Z

Isn't that structure eerily similar to the cudf columns that can have child columns for nesting ?
(the schema is basically just the column relationship and (optional) names afaik).

The client or python side can handle fancy name additions, but I would think that read_orc/write_orc would preserve the name without asking the client to enumerate and keep track of schema separately from columns.

harrism · 2019-10-31T01:32:44Z

Yes, it is intentionally similar. The concern expressed in above discussion is that if we put the name in the column that libcudf's computation algorithms will be required to propagate the name, and the semantics of that propagation would then need to be clearly defined in a way that is agreeable to all clients (which may or may not be possible?).

I suppose we could store the name in the column but ignore it in libcudf, but if we are doing that, what's the point?

I was attempting to suggest a compromise solution: since the only APIs in cuDF that need to set the name are file loader APIs, why not define a structure that only cuIO APIs produce?

OlivierNV · 2019-10-31T01:55:53Z

I'm not sure what are the benefits of adding the extra complexity of an additional table/column class hierarchy to manage (the way I see it, the name would either need to go in the cudf table or in cudf column).
For an application that say wants to read a csv, do some processing, then write back a parquet, it would then have to get the cudf table from the cuio table, do some processing, getting a cudf table result, then call some other conversion to create a new cuio table from a cudf table so that it can be pass to the cuio writer (I'm not trying to make it sound ridiculous on purpose).

harrism · 2019-10-31T02:12:07Z

The advantage is not having to deal with column names inside libcudf, since both Python and Spark have said they need to track the names separately from the data.

@jlowe:

Spark tracks names in a schema structure that is separate from the data.

@kkraus14:

@shwina and I had a discussion about this on the Python side and decided it's probably a better bet if we just handle this on the Python side. Columns can have arbitrary hashable Python objects as names (unfortunately) so it will likely be easier for us to just manage it on our own.

OlivierNV · 2019-10-31T03:10:48Z

both Python and Spark have said they need to track the names separately from the data.

That's nice, but the whole point of libcudf++ is to present a C++ API (at least that's my understanding), so it seems we need a minimum amount of schema support within libcudf.
The way I see it, the name is part of the schema, and the two options for the name is to have it either in the column or in the table.
However, a decision has already been made in libcudf++ to have the schema stored within the column classes themselves (children members) rather than in the table, so it seems to be simpler to stick the name there as well. Python and Spark layers can always track their own name hierarchy separately if it's easier there. FWIW, in both ORC & parquet, a dataframe is basically a single root struct column whose children are the df columns, which seems quite similar to libcudf++ (writers will typically name first-level unnamed columns as "_col0", "_col1" etc).

OlivierNV · 2019-10-31T19:17:14Z

The cudf column is not an internal type to libcudf since it is exposed through the API, so this "I don't need it therefore nobody needs it" argument does not apply.
Requiring client applications to maintain a completely separate schema hierarchy just to keeping track of names is a problem. So is requiring applications that do care about column names to use a different column structure depending if their dataframe came from a file source or was generated from scratch.
Interoperability with arrow would presumably require column names as well (arrow columns have names)

One alternative solution that does not require column names in libcudf++ and can be easily implemented for 0.11 while remaining simple for applications migrating from cudf-0.10 would be to keep io readers/writers using the old legacy interface and provide conversion from/to libcudf++ column types (also saves use a lot of duplicate code paths for "legacy"/"non-legacy")

jrhemstad · 2019-10-31T19:25:44Z

One alternative solution that does not require column names in libcudf++ and can be easily implemented for 0.11 while remaining simple for applications migrating from cudf-0.10 would be to keep io readers/writers using the old legacy interface and provide conversion from/to libcudf++ column types (also saves use a lot of duplicate code paths for "legacy"/"non-legacy")

This is 100% a non-starter.

jrhemstad · 2019-10-31T19:33:03Z

maintain a completely separate schema hierarchy just to keeping track of names is a problem.

This is literally what Arrow does.

https://arrow.apache.org/docs/cpp/classarrow_1_1_schema.html

All we are suggesting is to make a class like this:

class schema_and_table{
  Schema schema;
  cudf::table t;
};

OlivierNV · 2019-10-31T20:18:16Z

This is literally what Arrow does.

Not quite: you'll notice that unlike libcudf++, arrow keeps the column schema hierarchy separate from the columns (Somebody in libcudf++ already made the decision to not do that and put the schema hierarchy within the columns). But even with a separate schema, arrow still stores the column names within the columns themselves.
https://arrow.apache.org/docs/cpp/classarrow_1_1_column.html
This is the arrow column constructor:
arrow::Column::Column(const std::string &name, const std::shared_ptr &data)

jrhemstad · 2019-10-31T20:27:43Z

arrow::Column was removed entirely in favor of arrow::Array and arrow::ChunkedArray

Notice that they don't have any names.

Somebody in libcudf++ already made the decision to not do that and put the schema hierarchy within the columns

Seeing as though libucdf++ doesn't have a schema, I have no idea what you are talking about.

cudf::column is most similar to arrow::ArrayData, which has no schema, no name. Just size, type, data, nullmask, and children.

OlivierNV · 2019-10-31T20:35:29Z

Seeing as though libucdf++ doesn't have a schema, I have no idea what you are talking about.

There is: you have columns with child columns: that tree structure is the schema.

jrhemstad · 2019-10-31T20:37:49Z

arrow keeps the column schema hierarchy separate from the columns

arrow::ArrayData, which has no schema, no name. Just size, type, data, nullmask, and children.

There is: you have columns with child columns: that tree structure is the schema.

Something isn't adding up here.

OlivierNV · 2019-10-31T20:54:17Z

If the goal is to make libcudf++ cudf::column/cudf::table similar to low-level arrow::ChunkedArray instead of arrow::column/arrow::table then the decision to not have any column names in cudf has already been made, not sure what's up for discussion.

jrhemstad · 2019-10-31T21:08:09Z

not sure what's up for discussion.

In the same way that Arrow has more than just an arrow::ArrayData class, libcudf/cuIO is allowed to have more than just cudf::table/cudf::column.

What's up for discussion is creating a new data structure something like this:

class schema_and_table{
  Schema schema;
  cudf::table t;
};

OlivierNV · 2019-10-31T21:24:27Z

I'll let people more involved with these things than I am comment, but I'm not sure what would be left in the "schema" field of that structure besides column names if nesting information is already contained within cudf::table.

harrism · 2019-11-26T23:34:15Z

I think the conclusion (after some off-github discussion) is to store the column name in cudf::column, and make it accessible from there as well as in cudf::column_view.

My reasoning is that this is the simplest way forward. libcudf does not have plans currently to make any guarantees to propagate column names through its algorithms, other than in cuIO readers and writers.

harrism · 2019-12-06T02:58:36Z

Had further discussion with @jrhemstad on this today, and he convinced me against my previous conclusion. Basically, the name is information that does not need to be carried in the columns. Analogy: if an application of cudf::column needed to always know the min and max elements in each column, we would not want to add those fields to the column itself, since they are not needed by the internals of libcudf. Instead, they would be stored in some wrapper data structure. Names are similar.

We hear the argument about the fact that the structure of nested columns already mirrors the schema, but that structure is a necessary part of operating on the data in the columns. The name isn't.

Moreover, multiple libcudf clients have told us that they need to maintain an external schema anyway, and that having the names in the column would cause them extra work to maintain consistency between two sources of truth.

We think it's best to define a schema structure that stores the names, and use this in the cuIO readers/writers.

OlivierNV · 2019-12-06T18:40:22Z

We'll just return an array of strings for the names in readers and take in an array of names for the writer. No need to require the client to maintain a third schema structure.

harrism · 2019-12-07T00:05:21Z

When we support nested columns this will not be sufficient.

mjsamoht · 2019-12-07T00:27:43Z

Yes, once we support nested columns we will have to revisit.

OlivierNV · 2019-12-07T18:09:06Z

A std::string vector will in fact be sufficient to return column names regardless of nesting support (the [unnecessary] complexity of having to handle a separate flattened schema introduced by not having names within the cudf column tree structure falls on the user/bindings).

harrism · 2019-12-08T22:07:08Z

How is it sufficient to have a flat vector of names if the columns themselves contain columns?

OlivierNV · 2019-12-08T22:22:03Z

@harrism The same way it is done in parquet/orc format, where you need to serialize a tree into a file: a 'flattened' version of the column tree, with a left-to-right traversal convention (it will have to be clearly documented once we add nesting), though this depends on how cudf will implement nesting (with a separate schema, one would expect the top-level column array in the table to already be a flattened tree, though it looks like cudf is likely to go for the tree schema representation where fields of a struct are children of the parent struct column)

mjsamoht · 2019-12-09T19:29:52Z

For example, for the following tree the column names could be stored in a vector as {"1", "2", "4", "6", "7", "5", "3"}.

mjsamoht · 2019-12-09T22:13:42Z

I don't know who and how the column names are being used. But does anyone have objections to storing the names in a flat vector? If so can you please explain what the problem is with a vector?

jlowe · 2019-12-09T22:32:45Z

As long as it's clearly documented how to map the nested structure to a flat list of names (e.g.: pre-order traversal as in the example above), I'm cool with using a vector of names rather than an explicit tree-like structure.

OlivierNV · 2019-12-10T00:14:51Z

The ORC specification has some nice graphics in the "Type Information' section.

jrhemstad added feature request New feature or request Python Affects Python cuDF API. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS libcudf++ labels Oct 26, 2019

jrhemstad assigned j-ieong Oct 26, 2019

jrhemstad mentioned this issue Oct 26, 2019

[REVIEW] Define and implement new join APIs #3224

Merged

harrism unassigned j-ieong Dec 6, 2019

harrism assigned OlivierNV Dec 11, 2019

OlivierNV closed this as completed Dec 12, 2019

hyperbolic2346 mentioned this issue Dec 4, 2020

[FEA] Add Apache Arrow schema for parquet writing #6862

Closed

[Discussion] Requirements for schema/column names #3225

[Discussion] Requirements for schema/column names #3225

Comments

jrhemstad commented Oct 26, 2019

OlivierNV commented Oct 26, 2019

jlowe commented Oct 28, 2019

kkraus14 commented Oct 28, 2019

jrhemstad commented Oct 28, 2019

kkraus14 commented Oct 28, 2019

harrism commented Oct 30, 2019

harrism commented Oct 30, 2019

jlowe commented Oct 30, 2019

jrhemstad commented Oct 30, 2019

OlivierNV commented Oct 30, 2019

jrhemstad commented Oct 30, 2019 • edited Loading

kkraus14 commented Oct 30, 2019

jlowe commented Oct 30, 2019

kkraus14 commented Oct 30, 2019

OlivierNV commented Oct 30, 2019

jrhemstad commented Oct 30, 2019

harrism commented Oct 31, 2019 • edited Loading

OlivierNV commented Oct 31, 2019

harrism commented Oct 31, 2019

OlivierNV commented Oct 31, 2019

harrism commented Oct 31, 2019 • edited Loading

OlivierNV commented Oct 31, 2019

OlivierNV commented Oct 31, 2019

jrhemstad commented Oct 31, 2019

jrhemstad commented Oct 31, 2019 • edited Loading

OlivierNV commented Oct 31, 2019

jrhemstad commented Oct 31, 2019

OlivierNV commented Oct 31, 2019

jrhemstad commented Oct 31, 2019 • edited Loading

OlivierNV commented Oct 31, 2019

jrhemstad commented Oct 31, 2019

OlivierNV commented Oct 31, 2019

harrism commented Nov 26, 2019 • edited Loading

harrism commented Dec 6, 2019

OlivierNV commented Dec 6, 2019

harrism commented Dec 7, 2019

mjsamoht commented Dec 7, 2019

OlivierNV commented Dec 7, 2019

harrism commented Dec 8, 2019

OlivierNV commented Dec 8, 2019 • edited Loading

mjsamoht commented Dec 9, 2019

mjsamoht commented Dec 9, 2019

jlowe commented Dec 9, 2019

OlivierNV commented Dec 10, 2019 • edited Loading

jrhemstad commented Oct 30, 2019 •

edited

Loading

harrism commented Oct 31, 2019 •

edited

Loading

harrism commented Oct 31, 2019 •

edited

Loading

jrhemstad commented Oct 31, 2019 •

edited

Loading

jrhemstad commented Oct 31, 2019 •

edited

Loading

harrism commented Nov 26, 2019 •

edited

Loading

OlivierNV commented Dec 8, 2019 •

edited

Loading

OlivierNV commented Dec 10, 2019 •

edited

Loading