Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Arrow and Orc file formats #169

Merged

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented Apr 18, 2022

BREAKING CHANGE: The substrait/ReadRel/LocalFiles/format field is deprecated. Newer consumers should intrepet older files as parquet. Older consumers should interpret newer files as FILE_FORMAT_UNSPECIFIED.

@jvanstraten
Copy link
Contributor

jvanstraten commented Apr 19, 2022

Rather than adding to the FileFormat enum, would it perhaps be more sensible to replace the enum with a oneof? Something like

message FileOrFiles {
  oneof path_type {
    ...
  }

  // Original file format enum, superseded by the file_format oneof.
  reserved 5;
  reserved "format";

  // the index of the partition this item belongs to
  uint64 partition_index = 6;

  // the start position in byte to read from this item
  uint64 start = 7;

  // the length in byte to read from this item
  uint64 length = 8;

  oneof file_format {
    ParquetFormat parquet = 9;
    ArrowIpcFormat arrow_ipc = 10;
    google.protobuf.Any extension = 31;
  }
  
  message ParquetFormat {
  }
  
  message ArrowIpcFormat {
    ArrowIpcSubformat subformat = 1;
    
    enum ArrowIpcSubformat {
      ARROW_IPC_SUBFORMAT_UNSPECIFIED = 0;
      ARROW_IPC_SUBFORMAT_STREAM = 1;
      ARROW_IPC_SUBFORMAT_FILE = 2;
    }
  }
}

File formats like CSV, for example, will really need more information than just "this is a CSV file" to be correctly parsed if they are added at some point. Currently the enum only has one valid value, so it's better for backwards and forwards compatibility to think about this now than later.

ETA: looks like this also came up in #138. Also added extension to the oneof variants.

@westonpace
Copy link
Member Author

I think your proposal is fine. That was originally kind of what I was aiming for with #138 but the response I got there seemed to suggest that "having extra stuff" is a bit of a special case and we will just use the extension for that. I don't really have any strong opinion whether we create dedicated messages or not for the simple self-describing protocols. So, barring any further input, I will adopt your suggestion and update my PR in a few days.

@jvanstraten
Copy link
Contributor

There can also just be a catch-all "empty message" type somewhere. That's what I did for the validator output format. Then you can also do

oneof file_format {
  Empty parquet = 9;
  ArrowIpcFormat arrow_ipc = 10;
  google.protobuf.Any extension = 31;
}

to save some clutter. The oneof variant is determined by the field index, so I'm pretty sure it doesn't matter that the message types would be the same if more self-describing formats are added.

In fact, I wouldn't be surprised if protobuf defines an empty message type in its standard library. I just never bothered to look for one, since it's so easy to define your own and it doesn't cost anything (well, maybe a tiny amount of compile time I suppose).

The data could indeed be attached via an advanced extension too, but it'd be rather awkward with the current structure. The nearest extension field is in LocalFiles, which is common to all FileOrFiles elements, while each FileOrFiles message has its own format field. Then again, it probably doesn't make sense to specify both a CSV file and a Parquet file or whatever in a single read relation, but then I'd argue that the file format should have been part of LocalFiles rather than FileOrFiles. It's not really clear to me why the structure was originally defined this way.

@jacques-n
Copy link
Contributor

I'm very supportive of a change like what @jvanstraten is suggesting above. I agree that many formats need specific properties.

I think the less supportive part of my response in #138 was the proposal of something like a string with opaque consumption. We should work to try to keep things well specified (or use the any object at whatever level is appropriate).

@westonpace
Copy link
Member Author

I updated this PR to a form similar to what @jvanstraten suggested. I dropped the "arrow streaming" format option since that doesn't make sense for local files anyways. I added Orc inspired by #202 .

That being said, I'm not so certain we need to worry about reserving 5/format given the early stage and the lack of support for backwards compatibility here anyways.

@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2022

@jacques-n Perhaps we should disable the buf + BREAKING CHANGE detection in CI until we've hit 1.0?

FILE_FORMAT_PARQUET = 1;
// The format of the files
oneof file_format {
google.protobuf.Empty parquet = 9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we create empty message types for each of these. That would make more sense to people, I think. E.g. message ParquetFormat {}

Let's update format to be marked as deprecated rather than deleting (for now). We can batch up removal of deprecations when we want to bump the format version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we create empty message types for each of these. That would make more sense to people, I think. E.g. message ParquetFormat {}

Done. Please check this to make sure I understood your point.

Let's update format to be marked as deprecated rather than deleting (for now). We can batch up removal of deprecations when we want to bump the format version.

I was hoping that "breaking changes == minor" means that we no longer need to worry about batching breaking changes. Let me know if something further is needed here.

@jacques-n
Copy link
Contributor

#210 is now merged. @westonpace and @jvanstraten, have you two come to a consensus on this? Getting this merged should also help us get some solution for #174 and #202 merged as well.

@westonpace
Copy link
Member Author

I think the conensus was "yes, we need to reserve fields". For this PR in particular on the question of "do we need a legacy option?" I think the answer is "no".

So the last remaining question is:

Can we garbage collect old (now-unused) enums?

This PR does deprecate the FileFormat enum. I believe there was a bit of discussion on Slack here: https://substrait.slack.com/archives/C02D7CTQXHD/p1653693433051879

My interpretation of @jvanstraten 's was that yes, we can garbage collect unused enums. But I'm not entirely sure if we're thinking of the same definition of enum (e.g. protobuf enum vs substrait enum). So I'd appreciate one more review from @jvanstraten on this PR.

I'll update the title / description / commits to semantic release commits.

@westonpace westonpace requested a review from jvanstraten May 30, 2022 17:10
@westonpace westonpace force-pushed the feature/add-arrow-ipc-formats branch 2 times, most recently from e5698bc to 99ef6fa Compare May 30, 2022 17:30
@westonpace westonpace changed the title Add Arrow IPC (file and streaming) as file formats feat: add Arrow and Orc file formats May 30, 2022
@westonpace
Copy link
Member Author

Actually, as I was updating the commits, I also added a new default behavior in the comments to handle the case where no oneof is specified.

In addition I addressed Jacques point above, adding empty message types instead of using google.protobuf.Empty.

@westonpace westonpace force-pushed the feature/add-arrow-ipc-formats branch from 99ef6fa to da122a6 Compare May 30, 2022 17:36
@jacques-n
Copy link
Contributor

Yes, this captures my request. Thanks. Will wait on @jvanstraten confirmation of agreement to merge.

@jvanstraten
Copy link
Contributor

My interpretation of @jvanstraten 's was that yes, we can garbage collect unused enums. But I'm not entirely sure if we're thinking of the same definition of enum (e.g. protobuf enum vs substrait enum).

Oh. Based on this context I guess we weren't, but I don't see why we shouldn't remove things that are no longer reachable.

Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than that default behavior.

ETA: pfft, I guess I didn't submit the change request. w.r.t. the default behavior comment: a oneof should never have default behavior other than to reject the plan. With the suggested behavior, if we add a new option later and pass a plan with that option to an old consumer, it would silently incorrectly default to parquet. We'd never be able to add to the oneof without imposing a (semantic) breaking change.

BREAKING CHANGE: The `substrait/ReadRel/LocalFiles/format` field is deprecated.
This will cause a hard break in compatibility.
Newer consumers will not be able to read older files.
Older consumers will not be able to read newer files.
@westonpace westonpace force-pushed the feature/add-arrow-ipc-formats branch from da122a6 to b03a072 Compare June 3, 2022 01:59
@westonpace
Copy link
Member Author

ETA: pfft, I guess I didn't submit the change request. w.r.t. the default behavior comment: a oneof should never have default behavior other than to reject the plan. With the suggested behavior, if we add a new option later and pass a plan with that option to an old consumer, it would silently incorrectly default to parquet. We'd never be able to add to the oneof without imposing a (semantic) breaking change.

I wish you weren't right but it looks like you are. This will completely break backwards compatibility then.

@westonpace westonpace requested a review from jvanstraten June 3, 2022 02:02
@westonpace
Copy link
Member Author

In the future I suppose we could avoid the break by keeping the old field and changing the enum...

      enum FileFormat {
        FILE_FORMAT_UNSPECIFIED = 0;
        FILE_FORMAT_PARQUET = 1;
        FILE_FORMAT_FUTURE = 2;
      }  

Consumers would then only check the oneof if FILE_FORMAT_FUTURE

But I do not think that change is justified at this point.

@jacques-n jacques-n merged commit 43be00a into substrait-io:main Jun 5, 2022
westonpace pushed a commit to apache/arrow that referenced this pull request Jul 5, 2022
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of.

Stuff that broke:
 - `relations.proto` and `expressions.proto` were merged into `algebra.proto` in substrait-io/substrait#136
 - Breaking change in how file formats are specified in read relations: substrait-io/substrait#169
 - Deprecation in specification of function arguments, switched to the new format (supporting the old one as well would be a bit more work, which I'm not sure is worthwhile at this stage): substrait-io/substrait#161
 - Deprecation of `user_defined_type_reference` in `Type`, replacing it with a message that also supports nullability: substrait-io/substrait#217

Authored-by: Jeroen van Straten <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
drin pushed a commit to drin/arrow that referenced this pull request Jul 5, 2022
Note: I actually upgraded to v0.6.0; it didn't make much sense to me to not just go to the latest release. I guess I'll downgrade if there was a specific reason for going to exactly v0.3.0 that I'm not aware of.

Stuff that broke:
 - `relations.proto` and `expressions.proto` were merged into `algebra.proto` in substrait-io/substrait#136
 - Breaking change in how file formats are specified in read relations: substrait-io/substrait#169
 - Deprecation in specification of function arguments, switched to the new format (supporting the old one as well would be a bit more work, which I'm not sure is worthwhile at this stage): substrait-io/substrait#161
 - Deprecation of `user_defined_type_reference` in `Type`, replacing it with a message that also supports nullability: substrait-io/substrait#217

Authored-by: Jeroen van Straten <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants