-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Arrow and Orc file formats #169
feat: add Arrow and Orc file formats #169
Conversation
Rather than adding to the message FileOrFiles {
oneof path_type {
...
}
// Original file format enum, superseded by the file_format oneof.
reserved 5;
reserved "format";
// the index of the partition this item belongs to
uint64 partition_index = 6;
// the start position in byte to read from this item
uint64 start = 7;
// the length in byte to read from this item
uint64 length = 8;
oneof file_format {
ParquetFormat parquet = 9;
ArrowIpcFormat arrow_ipc = 10;
google.protobuf.Any extension = 31;
}
message ParquetFormat {
}
message ArrowIpcFormat {
ArrowIpcSubformat subformat = 1;
enum ArrowIpcSubformat {
ARROW_IPC_SUBFORMAT_UNSPECIFIED = 0;
ARROW_IPC_SUBFORMAT_STREAM = 1;
ARROW_IPC_SUBFORMAT_FILE = 2;
}
}
} File formats like CSV, for example, will really need more information than just "this is a CSV file" to be correctly parsed if they are added at some point. Currently the enum only has one valid value, so it's better for backwards and forwards compatibility to think about this now than later. ETA: looks like this also came up in #138. Also added |
I think your proposal is fine. That was originally kind of what I was aiming for with #138 but the response I got there seemed to suggest that "having extra stuff" is a bit of a special case and we will just use the extension for that. I don't really have any strong opinion whether we create dedicated messages or not for the simple self-describing protocols. So, barring any further input, I will adopt your suggestion and update my PR in a few days. |
There can also just be a catch-all "empty message" type somewhere. That's what I did for the validator output format. Then you can also do oneof file_format {
Empty parquet = 9;
ArrowIpcFormat arrow_ipc = 10;
google.protobuf.Any extension = 31;
} to save some clutter. The oneof variant is determined by the field index, so I'm pretty sure it doesn't matter that the message types would be the same if more self-describing formats are added. In fact, I wouldn't be surprised if protobuf defines an empty message type in its standard library. I just never bothered to look for one, since it's so easy to define your own and it doesn't cost anything (well, maybe a tiny amount of compile time I suppose). The data could indeed be attached via an advanced extension too, but it'd be rather awkward with the current structure. The nearest extension field is in |
I'm very supportive of a change like what @jvanstraten is suggesting above. I agree that many formats need specific properties. I think the less supportive part of my response in #138 was the proposal of something like a string with opaque consumption. We should work to try to keep things well specified (or use the any object at whatever level is appropriate). |
be67609
to
284c2a7
Compare
I updated this PR to a form similar to what @jvanstraten suggested. I dropped the "arrow streaming" format option since that doesn't make sense for local files anyways. I added Orc inspired by #202 . That being said, I'm not so certain we need to worry about reserving |
@jacques-n Perhaps we should disable the |
proto/substrait/algebra.proto
Outdated
FILE_FORMAT_PARQUET = 1; | ||
// The format of the files | ||
oneof file_format { | ||
google.protobuf.Empty parquet = 9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we create empty message types for each of these. That would make more sense to people, I think. E.g. message ParquetFormat {}
Let's update format to be marked as deprecated rather than deleting (for now). We can batch up removal of deprecations when we want to bump the format version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we create empty message types for each of these. That would make more sense to people, I think. E.g. message ParquetFormat {}
Done. Please check this to make sure I understood your point.
Let's update format to be marked as deprecated rather than deleting (for now). We can batch up removal of deprecations when we want to bump the format version.
I was hoping that "breaking changes == minor" means that we no longer need to worry about batching breaking changes. Let me know if something further is needed here.
#210 is now merged. @westonpace and @jvanstraten, have you two come to a consensus on this? Getting this merged should also help us get some solution for #174 and #202 merged as well. |
I think the conensus was "yes, we need to reserve fields". For this PR in particular on the question of "do we need a legacy option?" I think the answer is "no". So the last remaining question is: Can we garbage collect old (now-unused) enums? This PR does deprecate the My interpretation of @jvanstraten 's was that yes, we can garbage collect unused enums. But I'm not entirely sure if we're thinking of the same definition of enum (e.g. protobuf enum vs substrait enum). So I'd appreciate one more review from @jvanstraten on this PR. I'll update the title / description / commits to semantic release commits. |
e5698bc
to
99ef6fa
Compare
Actually, as I was updating the commits, I also added a new default behavior in the comments to handle the case where no oneof is specified. In addition I addressed Jacques point above, adding empty message types instead of using google.protobuf.Empty. |
99ef6fa
to
da122a6
Compare
Yes, this captures my request. Thanks. Will wait on @jvanstraten confirmation of agreement to merge. |
Oh. Based on this context I guess we weren't, but I don't see why we shouldn't remove things that are no longer reachable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than that default behavior.
ETA: pfft, I guess I didn't submit the change request. w.r.t. the default behavior comment: a oneof should never have default behavior other than to reject the plan. With the suggested behavior, if we add a new option later and pass a plan with that option to an old consumer, it would silently incorrectly default to parquet. We'd never be able to add to the oneof without imposing a (semantic) breaking change.
BREAKING CHANGE: The `substrait/ReadRel/LocalFiles/format` field is deprecated. This will cause a hard break in compatibility. Newer consumers will not be able to read older files. Older consumers will not be able to read newer files.
da122a6
to
b03a072
Compare
I wish you weren't right but it looks like you are. This will completely break backwards compatibility then. |
In the future I suppose we could avoid the break by keeping the old field and changing the enum...
Consumers would then only check the But I do not think that change is justified at this point. |
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of. Stuff that broke: - `relations.proto` and `expressions.proto` were merged into `algebra.proto` in substrait-io/substrait#136 - Breaking change in how file formats are specified in read relations: substrait-io/substrait#169 - Deprecation in specification of function arguments, switched to the new format (supporting the old one as well would be a bit more work, which I'm not sure is worthwhile at this stage): substrait-io/substrait#161 - Deprecation of `user_defined_type_reference` in `Type`, replacing it with a message that also supports nullability: substrait-io/substrait#217 Authored-by: Jeroen van Straten <[email protected]> Signed-off-by: Weston Pace <[email protected]>
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of. Stuff that broke: - `relations.proto` and `expressions.proto` were merged into `algebra.proto` in substrait-io/substrait#136 - Breaking change in how file formats are specified in read relations: substrait-io/substrait#169 - Deprecation in specification of function arguments, switched to the new format (supporting the old one as well would be a bit more work, which I'm not sure is worthwhile at this stage): substrait-io/substrait#161 - Deprecation of `user_defined_type_reference` in `Type`, replacing it with a message that also supports nullability: substrait-io/substrait#217 Authored-by: Jeroen van Straten <[email protected]> Signed-off-by: Weston Pace <[email protected]>
BREAKING CHANGE: The
substrait/ReadRel/LocalFiles/format
field is deprecated. Newer consumers should intrepet older files as parquet. Older consumers should interpret newer files as FILE_FORMAT_UNSPECIFIED.