Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][FlightRPC] Investigate using generic callback stub to bypass Protobuf serialization #37937

Open
lidavidm opened this issue Sep 28, 2023 · 3 comments

Comments

@lidavidm
Copy link
Member

Describe the enhancement requested

Flight in C++ bypass the Protobuf serializer by specializing a gRPC template and doing what is almost certainly an illegal cast to trick gRPC into using our specialization. However, gRPC supports a "generic" API that lets you call methods by name and get back the gRPC byte buffers, which should be a safe, officially sanctioned way of doing what we want. The API linked there is only applicable to async gRPC, so it would only help our new async implementation.

This has a few other benefits:

  • The gRPC template we specialize technically lets us return (de)serialization errors, but in practice gRPC crashes if you error. This new API would let us handle the error gracefully in Flight code.
  • We could more easily pass in other arguments, like a memory allocator, to the (de)serialization code. (gRPC still controls memory allocation, though.)
  • We could implement more complex deserialization code (e.g. optionally trying to align buffers for [C++][FlightRPC] Flight generates misaligned buffers #32276).

Component(s)

C++, FlightRPC

@pitrou
Copy link
Member

pitrou commented Oct 17, 2023

IIUC, the idea is to use gRPC async APIs to handle serialization and deserialization ourselves ?

@lidavidm
Copy link
Member Author

Yes:

  • Right now, we do a cast that is definitely UB to try to hook into the gRPC serialization machinery
  • gRPC offers a "generic" stub that gives us the raw byte buffers directly without that hack, but it is only available for the async API
  • The gRPC serialization machinery is known to be broken (e.g. if you raise an error gRPC will just abort instead) so being able to bypass it entirely is a benefit

So when we implement the async version of DoGet etc we should just try to use the generic stub and avoid all this in the first place (requires some refactoring, though)

@toddtomkinson
Copy link

Hello! I'm working on implementing a few of the Flight RPC endpoints into our existing gRPC server (where we use the new async/callback api exclusively) and came across this ticket. I have everything working, but would love to be able to utilize the the gRPC zero-copy serialization in my server (right now I'm returning populated FlightData objects, which requires a copy of the data in the IpcPayload objects).

I could reuse the existing functions if, instead of hiding all of the gRPC server internals, arrow exposed the functions in flight/transport/grpc/serialization_internal.{h,cc}--specifically the FlightDataSerialize function. My plan was to build a ServerWriteReactor implementation that my application could add RecordBatches to, that would in turn translate each RecordBatch into a series of FlightPayload objects (similar to what is done in the RecordBatchStream) and in turn to a series of ByteBuffers (using FlightDataSerialize). The reactor would send the ByteBuffers as soon as they were available and the previous send was complete.

If there isn't already work in process to implement this ticket I'd be happy to take a stab at exposing the serialization internals and then exposing a reactor implementation as described above. With those in place it would be much easier for projects to implement Flight RPC interfaces into existing gRPC servers, while still taking advantage of the zero-copy serialization.

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants