Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a datafusion-proto crate for datafusion protobuf serialization #1887

Merged
merged 4 commits into from
Mar 8, 2022

Conversation

carols10cents
Copy link
Contributor

@carols10cents carols10cents commented Feb 25, 2022

Which issue does this PR close?

Closes #1832.

Rationale for this change

It would be nice for other projects (such as influxdata/influxdb_iox) to be able to serialize DataFusion types as protobuf without needing to depend on Ballista.

What changes are included in this PR?

This PR extracts a new crate, datafusion-serialization, that can serialize and deserialize DataFusion types as protocol buffers.

Ballista now depends on datafusion-serialization. However, prost isn't able to provide a way to share/import .proto files between crates, so ballista/rust/core/proto/ballista.proto doesn't have a line that says import "datafusion.proto". This is a problem for other crates that would want to depend on datafusion-serialization too. Some solutions I considered and ruled out:

  • Have the build.rs file of ballista (and any other crate that wants to depend on datafusion-serialization) download datafusion.proto from GitHub. This seems fragile and makes building ballista dependent on network access. This would also likely introduce mismatched version problems.
  • Symlink datafusion-serialization/proto/datafusion.proto into ballista/rust/core/proto, which would work for ballista but not for any crate outside of this repo.
  • Manually copy datafusion-serialization/proto/datafusion.proto into ballista/rust/core/proto, which is a chore no one wants to do and would probably also cause mismatched version problems.

Not liking any of these solutions, I decided the best way is that fields in ballista or other crate protos that want to contain datafusion types should serialize them as google::protobuf::Any, then depend on the datafusion-serialization crate to handle the actual interpretation of the bytes as the Rust types.

Unfortunately, prost doesn't have built in support for this, and the workaround crates I've looked at seem to provide more functionality than is strictly needed here. So I implemented this using a TypeUrl trait and some functions. It's a little messy because I can't use TryFrom/TryInto because of the orphan rule, so a bunch of the ballista from/to proto code needed to be updated.

If there are other solutions I haven't thought of, I'd love to hear them!

There's also plenty of further refactoring that could be done, but this PR is going to be a big review in any case. 😅

Are there any user-facing changes?

I'm not sure how strictly backwards compatibility is considered for ballista.proto-- I changed the types of a bunch of the fields, which isn't backwards compatible. If you'd rather I reserve the existing field names and pick a new name for the fields with the new type, let me know and I'm happy to do so. But as-is, it would be a user-facing change for anyone using the Ballista protobuf definitions.

@tustvold
Copy link
Contributor

tustvold commented Feb 25, 2022

However, prost isn't able to provide a way to share/import .proto files between crates

This link appears to just link back to this issue? FWIW it is fairly common for API specifications be they protobuf, OpenAPI, etc... to simply be manually vended into the client repositories. It's kind of gross, but it works. I guess it also gives you some notion of the version of the API that client is using... These proto specs shouldn't change in backwards incompatible ways, and so if your client is a bit out of date, it shouldn't matter by design.

types should serialize them as google::protobuf::Any

My experience with protobuf::Any even in languages that purport to support it, e.g. Golang and python, is not great. The fact protobuf isn't self-describing makes interacting with these opaque blobs extremely frustrating, and it is extremely common for things to simply not work.

To give an example of this, IOx's storage gRPC API has a protobuf::Any bundled in it, which prevents using tools like grpcui or grpcurl with it. In the end we had to add a custom CLI command to call these APIs in no small part because the ecosystem's support for protobuf::Any is so inconsistent.

I dunno, perhaps there is no other option, but using protobuf::Any for a field that has a defined concrete type feels like a very large hammer...

@carols10cents
Copy link
Contributor Author

This link appears to just link back to this issue?

Oops, sorry, forgot to fill in that link. I've fixed it now-- meant to link to tokio-rs/prost#422.

FWIW it is fairly common for API specifications be they protobuf, OpenAPI, etc... to simply be manually vended into the client repositories. It's kind of gross, but it works.

Yeeeeep, this also feels gross to me, but I'm happy to change to that if maintainers would like.

@andygrove
Copy link
Member

I like the idea of moving this to a separate crate. Would it be worth shortening the crate name to datafusion-serde? Another option would be to name it datafusion-proto.

@carols10cents carols10cents force-pushed the extract-serialization branch 2 times, most recently from 0b5920e to 1a956d1 Compare February 25, 2022 17:30
@carols10cents
Copy link
Contributor Author

Would it be worth shortening the crate name to datafusion-serde? Another option would be to name it datafusion-proto.

I would be worried that datafusion-serde would cause people to assume the serde crate is involved. datafusion-proto seems ok!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for this PR @carols10cents. It is an epic piece of work and I think this feature will make DataFusion even more useful as a foundation for analytic systems. As clever as the Any approach is, as I think it fairly non standard in the protobuf realm and will cause non trivial confusion and impedance mismatches with people who try to use it.

Therefore, after some thought, I suggest we go with the "copy the .proto files approach, because "it is the least bad of the non ideal alternatives"

Specifically, my rationale is being that 1) vendoring/copying the API is a common design pattern for proto based APIs, 2) the proto format is designed to handle mismatches / upgrades somewhat gracefully, and 3) we can use CI checks to verify the files don't get out of sync.

I think datafusion-proto sounds like a good crate name.

What do you think @carols10cents?

@EricJoy2048 EricJoy2048 mentioned this pull request Mar 2, 2022
@carols10cents
Copy link
Contributor Author

That all sounds great! For this repo, would a symlink suffice? I think cargo package will still put the file in the ballista crate even if it's a symlink in the repo; I will verify.

@alamb
Copy link
Contributor

alamb commented Mar 2, 2022

I think a symlink for this repo would be great

@alamb
Copy link
Contributor

alamb commented Mar 7, 2022

FYI the https://github.com/datafusion-contrib/datafusion-substrait repo from @andygrove may be related to this (as in maybe it eventually removes protobuf serialization).

Perhaps to plan for that eventually we could keep the serialization API operating on an opaque format (like fn serialize(expr: Expr) -> Vec<u8>) 🤔

@carols10cents
Copy link
Contributor Author

@alamb @tustvold Ok, I think this is ready for re-review, all the Any stuff is gone :)

FYI the datafusion-contrib/datafusion-substrait repo from @andygrove may be related to this (as in maybe it eventually removes protobuf serialization).

Perhaps to plan for that eventually we could keep the serialization API operating on an opaque format (like fn serialize(expr: Expr) -> Vec<u8>) 🤔

That should be pretty easy to add - would you like me to do that in this PR or in a future PR?

@alamb
Copy link
Contributor

alamb commented Mar 7, 2022

That should be pretty easy to add - would you like me to do that in this PR or in a future PR?

A future PR is good in my opinion

@alamb alamb changed the title Create a new datafusion-serialization crate for datafusion protobuf serialization Create a datafusion-proto crate for datafusion protobuf serialization Mar 7, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me @carols10cents -- thank you ❤️

I wonder if it would make sense to create a PR in the https://github.com/influxdata/influxdb_iox repository that uses this PR / new crate as a proof of concept to how it would be used?

Copy link
Member

@jimexist jimexist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@jimexist jimexist merged commit 2549500 into apache:master Mar 8, 2022
@carols10cents carols10cents deleted the extract-serialization branch March 9, 2022 20:25
@carols10cents
Copy link
Contributor Author

I wonder if it would make sense to create a PR in the influxdata/influxdb_iox repository that uses this PR / new crate as a proof of concept to how it would be used?

It does make sense, done! It seems to work pretty well except for with pbjson....

@alamb
Copy link
Contributor

alamb commented Mar 9, 2022

We'll just keep iterating I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract datafusion protobuf serialization into its own crate
6 participants