-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a datafusion-proto
crate for datafusion protobuf serialization
#1887
Conversation
This link appears to just link back to this issue? FWIW it is fairly common for API specifications be they protobuf, OpenAPI, etc... to simply be manually vended into the client repositories. It's kind of gross, but it works. I guess it also gives you some notion of the version of the API that client is using... These proto specs shouldn't change in backwards incompatible ways, and so if your client is a bit out of date, it shouldn't matter by design.
My experience with To give an example of this, IOx's storage gRPC API has a I dunno, perhaps there is no other option, but using |
Oops, sorry, forgot to fill in that link. I've fixed it now-- meant to link to tokio-rs/prost#422.
Yeeeeep, this also feels gross to me, but I'm happy to change to that if maintainers would like. |
d1713ee
to
ed74758
Compare
I like the idea of moving this to a separate crate. Would it be worth shortening the crate name to |
0b5920e
to
1a956d1
Compare
I would be worried that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you for this PR @carols10cents. It is an epic piece of work and I think this feature will make DataFusion even more useful as a foundation for analytic systems. As clever as the Any
approach is, as I think it fairly non standard in the protobuf realm and will cause non trivial confusion and impedance mismatches with people who try to use it.
Therefore, after some thought, I suggest we go with the "copy the .proto
files approach, because "it is the least bad of the non ideal alternatives"
Specifically, my rationale is being that 1) vendoring/copying the API is a common design pattern for proto based APIs, 2) the proto format is designed to handle mismatches / upgrades somewhat gracefully, and 3) we can use CI checks to verify the files don't get out of sync.
I think datafusion-proto
sounds like a good crate name.
What do you think @carols10cents?
That all sounds great! For this repo, would a symlink suffice? I think |
I think a symlink for this repo would be great |
FYI the https://github.com/datafusion-contrib/datafusion-substrait repo from @andygrove may be related to this (as in maybe it eventually removes protobuf serialization). Perhaps to plan for that eventually we could keep the serialization API operating on an opaque format (like |
1a956d1
to
8f21909
Compare
8f21909
to
af3a81e
Compare
af3a81e
to
c3fc585
Compare
@alamb @tustvold Ok, I think this is ready for re-review, all the
That should be pretty easy to add - would you like me to do that in this PR or in a future PR? |
A future PR is good in my opinion |
datafusion-proto
crate for datafusion protobuf serialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me @carols10cents -- thank you ❤️
I wonder if it would make sense to create a PR in the https://github.com/influxdata/influxdb_iox repository that uses this PR / new crate as a proof of concept to how it would be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
It does make sense, done! It seems to work pretty well except for with pbjson.... |
We'll just keep iterating I think |
Which issue does this PR close?
Closes #1832.
Rationale for this change
It would be nice for other projects (such as influxdata/influxdb_iox) to be able to serialize DataFusion types as protobuf without needing to depend on Ballista.
What changes are included in this PR?
This PR extracts a new crate, datafusion-serialization, that can serialize and deserialize DataFusion types as protocol buffers.
Ballista now depends on datafusion-serialization. However, prost isn't able to provide a way to share/import .proto files between crates, so
ballista/rust/core/proto/ballista.proto
doesn't have a line that saysimport "datafusion.proto"
. This is a problem for other crates that would want to depend on datafusion-serialization too. Some solutions I considered and ruled out:ballista
(and any other crate that wants to depend on datafusion-serialization) downloaddatafusion.proto
from GitHub. This seems fragile and makes buildingballista
dependent on network access. This would also likely introduce mismatched version problems.datafusion-serialization/proto/datafusion.proto
intoballista/rust/core/proto
, which would work forballista
but not for any crate outside of this repo.datafusion-serialization/proto/datafusion.proto
intoballista/rust/core/proto
, which is a chore no one wants to do and would probably also cause mismatched version problems.Not liking any of these solutions, I decided the best way is that fields in
ballista
or other crate protos that want to containdatafusion
types should serialize them asgoogle::protobuf::Any
, then depend on thedatafusion-serialization
crate to handle the actual interpretation of the bytes as the Rust types.Unfortunately, prost doesn't have built in support for this, and the workaround crates I've looked at seem to provide more functionality than is strictly needed here. So I implemented this using a
TypeUrl
trait and some functions. It's a little messy because I can't use TryFrom/TryInto because of the orphan rule, so a bunch of the ballista from/to proto code needed to be updated.If there are other solutions I haven't thought of, I'd love to hear them!
There's also plenty of further refactoring that could be done, but this PR is going to be a big review in any case. 😅
Are there any user-facing changes?
I'm not sure how strictly backwards compatibility is considered for ballista.proto-- I changed the types of a bunch of the fields, which isn't backwards compatible. If you'd rather I reserve the existing field names and pick a new name for the fields with the new type, let me know and I'm happy to do so. But as-is, it would be a user-facing change for anyone using the Ballista protobuf definitions.