Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Requirements for schema/column names #3225

Closed
jrhemstad opened this issue Oct 26, 2019 · 50 comments
Closed

[Discussion] Requirements for schema/column names #3225

jrhemstad opened this issue Oct 26, 2019 · 50 comments
Assignees
Labels
cuIO cuIO issue feature request New feature or request Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS

Comments

@jrhemstad
Copy link
Contributor

There have been a number of requests related to adding column names, either to the column's themselves and/or to tables and their views.

libcudf internals don't use column names, so we need requirements to be driven by users that will make use of the names (cuIO/Spark/cuDF).

For those who need column names, please discuss what you would like to see for column names.

CC @kkraus14 @revans2 @jlowe @j-ieong @shwina

@jrhemstad jrhemstad added feature request New feature or request Python Affects Python cuDF API. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS libcudf++ labels Oct 26, 2019
@OlivierNV
Copy link
Contributor

I suppose it depends a lot on how structs are being implemented: if the relationship between columns (eg: is a column a field of a struct) is stored in a separate schema structure while data is stored in a flat array of columns, then it would make sense to store the name in the schema structure instead of the columns. However, if a column can have child columns (a tree, where fields of a struct are children of another column), then the name should be part of the column (for the name itself, I don't think we need anything besides a simple std::string).

@jlowe
Copy link
Member

jlowe commented Oct 28, 2019

Spark tracks names in a schema structure that is separate from the data. Names are associated with structs (i.e.: a struct knows the names of the fields within it, but leaf types do not know nor care what their name is.)

That being said, the Spark plugin should be able to adapt pretty easily to wherever the names are placed (separate schema struct, only in struct columns, or columns know their own names). We just need some way for the loaders to convey the schema that was loaded so we can align that with the schema Spark expects.

@kkraus14
Copy link
Collaborator

From the Python side, columns can be named outside of tables, and those names are expected to be maintained through libcudf function calls. We could maintain this ourselves, but it would be more ideal if columns had the ability to have a name outside of a separate struct.

@jrhemstad
Copy link
Contributor Author

those names are expected to be maintained through libcudf function call

Can you explain more about what you mean about "expected to be maintained"?

@kkraus14
Copy link
Collaborator

Basically if I call something like gather against a cudf::column with name a, I'd expect the output column to have its name set to a.

@harrism
Copy link
Member

harrism commented Oct 30, 2019

What does SQL do?

@harrism
Copy link
Member

harrism commented Oct 30, 2019

This is a blocker for cuIO porting to libcudf++ currently, so we need to come to an agreement.

@jlowe
Copy link
Member

jlowe commented Oct 30, 2019

What does SQL do?

It's not specified. SQL has names associated with table columns but the SQL implementation can choose how to track it. Spark happens to track it in a separate schema.

If it's easier for the Python side, I'm fine with keeping the name of a column associated directly with a cudf::column. The Spark plugin will likely not examine these names except after calling a loader to map what was loaded to Spark's expected schema.

If we go with names in columns, only issue will be making sure libcudf knows when to propagate column names. For example, does the column name propagate on a filter? groupby? join? All the places where the code was copying names when allocating gdf_column outputs need to be covered in the corresponding libcudf methods. Otherwise the Cython code would be responsible for propagating the name into the output column after calling the libcudf method.

@jrhemstad
Copy link
Contributor Author

If we go with names in columns, only issue will be making sure libcudf knows when to propagate column names. For example, does the column name propagate on a filter? groupby? join? All the places where the code was copying names when allocating gdf_column outputs need to be covered in the corresponding libcudf methods. Otherwise the Cython code would be responsible for propagating the name into the output column after calling the libcudf method.

This is exactly my concern with putting the name directly inside cudf::column. It will introduce a lot of new complexity in libcudf around when and if to propagate the name from input to output. Likewise, the name propagation behavior may be different between Spark/Python. I'd really prefer this to be in the domain of user responsibility.

My vote would be for cuIO to return a simple wrapper around a cudf::column like:

class named_column{
   std::string name;
   std::unique_ptr<column> col;
};

Then it's up to the user to do whatever they want to do with the name.

@OlivierNV
Copy link
Contributor

I'm not sure I understand how adding a name in the column introduces new problems: don't we have the column name in the column today (in the legacy gdf_column structure) ?
I definitely would prefer having an array of column names in a table schema over a named_column WAR for cudf shortcomings.

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Oct 30, 2019

I'm not sure I understand how adding a name in the column introduces new problems: don't we have the column name in the column today (in the legacy gdf_column structure) ?

Yes, but it's 100% ignored in every libcudf compute function.

It introduces new problems if we're expected to do anything with the name, like propagate it from input to output.

@kkraus14
Copy link
Collaborator

@shwina and I had a discussion about this on the Python side and decided it's probably a better bet if we just handle this on the Python side. Columns can have arbitrary hashable Python objects as names (unfortunately) so it will likely be easier for us to just manage it on our own.

@j-ieong @OlivierNV for the io readers / writers would it suffice to just pass a std::vector<string> for column names in addition to the cudf::Table?

@jlowe
Copy link
Member

jlowe commented Oct 30, 2019

I think a string vector will be problematic when we support nested types. For example, if one of the loaded columns is named 'A' and is a structure of fields 'B', 'C', and 'D', I assume we would want a way to convey these nested names under 'A' back to the caller. There would need to be a tree-like schema descriptor returned to describe what was loaded in the general case. Or will we only be interested in top-level column names?

@kkraus14
Copy link
Collaborator

For example, if one of the loaded columns is named 'A' and is a structure of fields 'B', 'C', and 'D', I assume we would want a way to convey these nested names under 'A' back to the caller.

Good point, I didn't really think through this other than we'd be happy with a separate object as opposed to baking the names into the column objects.

@OlivierNV
Copy link
Contributor

Personally, I'd vote for just adding a std::string to the new column class, and have compute functions just return unnamed column (except perhaps for the copy constructor), where an unnamed column is just a column with an empty string as its name. (from writer point of view, it's better to have empty strings than duplicate made-up names). We could always change this later if we somehow need all column names consolidated in one object, and it pretty much matches the current behavior, so one less surprise for people trying to switch from cudf-0.10 to cudf-0.11

@jrhemstad
Copy link
Contributor Author

Personally, I'd vote for just adding a std::string to the new column class, and have compute functions just return unnamed column

I'm not a fan of this approach. If a column is going to have a name, it should be meaningful. If most columns are just going to have empty names, then it isn't meaningful and muddies the design of column/column_view for convenience rather than because it is the right thing to do.

pretty much matches the current behavior

Very little of libcudf's current behavior is worth emulating :)

@harrism
Copy link
Member

harrism commented Oct 31, 2019

A cuIO loader could return a schema corresponding to the loaded column(s). The schema node object would just have a string name and a vector of child nodes, mirroring how children are handled in cudf::column.

One approach would be to return a separate cudf::table and a cudf::table_schema:

struct table_schema {
   std::string name;
   std::vector<schema_node> columns
};

struct schema_node {
   std::string name;
   std::vector<schema_node*> children;
};

Or if tables don't need names, could just return a vector of schema_node.

@OlivierNV
Copy link
Contributor

Isn't that structure eerily similar to the cudf columns that can have child columns for nesting ?
(the schema is basically just the column relationship and (optional) names afaik).

The client or python side can handle fancy name additions, but I would think that read_orc/write_orc would preserve the name without asking the client to enumerate and keep track of schema separately from columns.

@harrism
Copy link
Member

harrism commented Oct 31, 2019

Yes, it is intentionally similar. The concern expressed in above discussion is that if we put the name in the column that libcudf's computation algorithms will be required to propagate the name, and the semantics of that propagation would then need to be clearly defined in a way that is agreeable to all clients (which may or may not be possible?).

I suppose we could store the name in the column but ignore it in libcudf, but if we are doing that, what's the point?

I was attempting to suggest a compromise solution: since the only APIs in cuDF that need to set the name are file loader APIs, why not define a structure that only cuIO APIs produce?

@OlivierNV
Copy link
Contributor

I'm not sure what are the benefits of adding the extra complexity of an additional table/column class hierarchy to manage (the way I see it, the name would either need to go in the cudf table or in cudf column).
For an application that say wants to read a csv, do some processing, then write back a parquet, it would then have to get the cudf table from the cuio table, do some processing, getting a cudf table result, then call some other conversion to create a new cuio table from a cudf table so that it can be pass to the cuio writer (I'm not trying to make it sound ridiculous on purpose).

@harrism
Copy link
Member

harrism commented Oct 31, 2019

The advantage is not having to deal with column names inside libcudf, since both Python and Spark have said they need to track the names separately from the data.

@jlowe:

Spark tracks names in a schema structure that is separate from the data.

@kkraus14:

@shwina and I had a discussion about this on the Python side and decided it's probably a better bet if we just handle this on the Python side. Columns can have arbitrary hashable Python objects as names (unfortunately) so it will likely be easier for us to just manage it on our own.

@OlivierNV
Copy link
Contributor

both Python and Spark have said they need to track the names separately from the data.

That's nice, but the whole point of libcudf++ is to present a C++ API (at least that's my understanding), so it seems we need a minimum amount of schema support within libcudf.
The way I see it, the name is part of the schema, and the two options for the name is to have it either in the column or in the table.
However, a decision has already been made in libcudf++ to have the schema stored within the column classes themselves (children members) rather than in the table, so it seems to be simpler to stick the name there as well. Python and Spark layers can always track their own name hierarchy separately if it's easier there. FWIW, in both ORC & parquet, a dataframe is basically a single root struct column whose children are the df columns, which seems quite similar to libcudf++ (writers will typically name first-level unnamed columns as "_col0", "_col1" etc).

@OlivierNV
Copy link
Contributor

The cudf column is not an internal type to libcudf since it is exposed through the API, so this "I don't need it therefore nobody needs it" argument does not apply.
Requiring client applications to maintain a completely separate schema hierarchy just to keeping track of names is a problem. So is requiring applications that do care about column names to use a different column structure depending if their dataframe came from a file source or was generated from scratch.
Interoperability with arrow would presumably require column names as well (arrow columns have names)

One alternative solution that does not require column names in libcudf++ and can be easily implemented for 0.11 while remaining simple for applications migrating from cudf-0.10 would be to keep io readers/writers using the old legacy interface and provide conversion from/to libcudf++ column types (also saves use a lot of duplicate code paths for "legacy"/"non-legacy")

@jrhemstad
Copy link
Contributor Author

One alternative solution that does not require column names in libcudf++ and can be easily implemented for 0.11 while remaining simple for applications migrating from cudf-0.10 would be to keep io readers/writers using the old legacy interface and provide conversion from/to libcudf++ column types (also saves use a lot of duplicate code paths for "legacy"/"non-legacy")

This is 100% a non-starter.

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Oct 31, 2019

maintain a completely separate schema hierarchy just to keeping track of names is a problem.

This is literally what Arrow does.

https://arrow.apache.org/docs/cpp/classarrow_1_1_schema.html

All we are suggesting is to make a class like this:

class schema_and_table{
  Schema schema;
  cudf::table t;
};

@OlivierNV
Copy link
Contributor

This is literally what Arrow does.

Not quite: you'll notice that unlike libcudf++, arrow keeps the column schema hierarchy separate from the columns (Somebody in libcudf++ already made the decision to not do that and put the schema hierarchy within the columns). But even with a separate schema, arrow still stores the column names within the columns themselves.
https://arrow.apache.org/docs/cpp/classarrow_1_1_column.html
This is the arrow column constructor:
arrow::Column::Column(const std::string &name, const std::shared_ptr &data)

@jrhemstad
Copy link
Contributor Author

arrow::Column was removed entirely in favor of arrow::Array and arrow::ChunkedArray

Notice that they don't have any names.

Somebody in libcudf++ already made the decision to not do that and put the schema hierarchy within the columns

Seeing as though libucdf++ doesn't have a schema, I have no idea what you are talking about.

cudf::column is most similar to arrow::ArrayData, which has no schema, no name. Just size, type, data, nullmask, and children.

@OlivierNV
Copy link
Contributor

Seeing as though libucdf++ doesn't have a schema, I have no idea what you are talking about.

There is: you have columns with child columns: that tree structure is the schema.

@jrhemstad
Copy link
Contributor Author

jrhemstad commented Oct 31, 2019

arrow keeps the column schema hierarchy separate from the columns

arrow::ArrayData, which has no schema, no name. Just size, type, data, nullmask, and children.

There is: you have columns with child columns: that tree structure is the schema.

Something isn't adding up here.

@OlivierNV
Copy link
Contributor

If the goal is to make libcudf++ cudf::column/cudf::table similar to low-level arrow::ChunkedArray instead of arrow::column/arrow::table then the decision to not have any column names in cudf has already been made, not sure what's up for discussion.

@jrhemstad
Copy link
Contributor Author

not sure what's up for discussion.

In the same way that Arrow has more than just an arrow::ArrayData class, libcudf/cuIO is allowed to have more than just cudf::table/cudf::column.

What's up for discussion is creating a new data structure something like this:

class schema_and_table{
  Schema schema;
  cudf::table t;
};

@OlivierNV
Copy link
Contributor

I'll let people more involved with these things than I am comment, but I'm not sure what would be left in the "schema" field of that structure besides column names if nesting information is already contained within cudf::table.

@harrism
Copy link
Member

harrism commented Nov 26, 2019

I think the conclusion (after some off-github discussion) is to store the column name in cudf::column, and make it accessible from there as well as in cudf::column_view.

My reasoning is that this is the simplest way forward. libcudf does not have plans currently to make any guarantees to propagate column names through its algorithms, other than in cuIO readers and writers.

@harrism
Copy link
Member

harrism commented Dec 6, 2019

Had further discussion with @jrhemstad on this today, and he convinced me against my previous conclusion. Basically, the name is information that does not need to be carried in the columns. Analogy: if an application of cudf::column needed to always know the min and max elements in each column, we would not want to add those fields to the column itself, since they are not needed by the internals of libcudf. Instead, they would be stored in some wrapper data structure. Names are similar.

We hear the argument about the fact that the structure of nested columns already mirrors the schema, but that structure is a necessary part of operating on the data in the columns. The name isn't.

Moreover, multiple libcudf clients have told us that they need to maintain an external schema anyway, and that having the names in the column would cause them extra work to maintain consistency between two sources of truth.

We think it's best to define a schema structure that stores the names, and use this in the cuIO readers/writers.

@OlivierNV
Copy link
Contributor

We'll just return an array of strings for the names in readers and take in an array of names for the writer. No need to require the client to maintain a third schema structure.

@harrism
Copy link
Member

harrism commented Dec 7, 2019

When we support nested columns this will not be sufficient.

@mjsamoht
Copy link
Contributor

mjsamoht commented Dec 7, 2019

Yes, once we support nested columns we will have to revisit.

@OlivierNV
Copy link
Contributor

A std::string vector will in fact be sufficient to return column names regardless of nesting support (the [unnecessary] complexity of having to handle a separate flattened schema introduced by not having names within the cudf column tree structure falls on the user/bindings).

@harrism
Copy link
Member

harrism commented Dec 8, 2019

How is it sufficient to have a flat vector of names if the columns themselves contain columns?

@OlivierNV
Copy link
Contributor

OlivierNV commented Dec 8, 2019

@harrism The same way it is done in parquet/orc format, where you need to serialize a tree into a file: a 'flattened' version of the column tree, with a left-to-right traversal convention (it will have to be clearly documented once we add nesting), though this depends on how cudf will implement nesting (with a separate schema, one would expect the top-level column array in the table to already be a flattened tree, though it looks like cudf is likely to go for the tree schema representation where fields of a struct are children of the parent struct column)

@mjsamoht
Copy link
Contributor

mjsamoht commented Dec 9, 2019

For example, for the following tree the column names could be stored in a vector as {"1", "2", "4", "6", "7", "5", "3"}.

           1
          / \
         /   \
        2     3
       / \
      /   \
     4     5
    / \
   /   \
  6     7

@mjsamoht
Copy link
Contributor

mjsamoht commented Dec 9, 2019

I don't know who and how the column names are being used. But does anyone have objections to storing the names in a flat vector? If so can you please explain what the problem is with a vector?

@jlowe
Copy link
Member

jlowe commented Dec 9, 2019

As long as it's clearly documented how to map the nested structure to a flat list of names (e.g.: pre-order traversal as in the example above), I'm cool with using a vector of names rather than an explicit tree-like structure.

@OlivierNV
Copy link
Contributor

OlivierNV commented Dec 10, 2019

The ORC specification has some nice graphics in the "Type Information' section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

7 participants