Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: reduce allocations on Codec.{Unm,M}arshal codepaths #1478

Closed
wants to merge 1 commit into from

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Aug 25, 2017

The previous use of proto.Buffer was slightly incorrect. proto.Buffer is
only useful when {un,}marshalling proto messages into and from the same
buffer. The current grpc.Codec interface does not have enough
specificity to make good use of this pattern.

In the previous usage we retrieved a cachedProtoBuffer/proto.Buffer from
a sync.Pool pool for every {Unm,M}arshal call, initialized/allocated/set
a completely new buffer and {un,}marshalled into and from it. There was
effectively no re-use of an allocated buffer across invocations of
{Unm,M}arshal. Moreover by simply having used proto.Buffer as an
intermediary between all proto.{Unm,M}arshal calls, we ended up
effectively allocating 2x bytes when only 1x was sufficient. This is due
to the internal implementation details of proto.Buffer where for every
proto.(Marshaler).Marshal() call that returned a byte slice,
proto.Buffer copies it over into it's internal buffer (which we
explicitly initialize/set ourselves). This is shown below and not included in
the commit message.

For types that satisfy the {Unm,M}arshaler interface however we can
simply fall back to {Unm,M}arshaling it directly, without needing to go
through a proto.Buffer. This is what the default proto.{Unm,M}arshal
does with:

// Marshal takes the protocol buffer
// and encodes it into the wire format, returning the data.
func Marshal(pb Message) ([]byte, error) {
  // Can the object marshal itself?
  if m, ok := pb.(Marshaler); ok {
    return m.Marshal()
  }

  p := NewBuffer(nil)
  err := p.Marshal(pb)
  // ...

Falling back to the proto.Buffer for when {Unm,M}arshaler is not
satisfied. The "right" fix for our use of sync.Pool for proto.Buffers at
this level should probably exist in golang/protobuf instead. This patch
wasn't accepted upstream in light of a reworked internal version they're
looking to export.

+cc @bdarnell / @tamird

@irfansharif
Copy link
Contributor Author

With pprof on the heap profile, sample_index=alloc_space for a short run of https://github.com/irfansharif/pinger, before:

         .          .     57:func (p protoCodec) marshal(v interface{}, cb *cachedProtoBuffer) ([]byte, error) {
         .          .     58:   protoMsg := v.(proto.Message)
  532.29MB   532.29MB     59:   newSlice := make([]byte, 0, cb.lastMarshaledSize)
         .          .     60:
         .          .     61:   cb.SetBuf(newSlice)
         .          .     62:   cb.Reset()
         .   556.81MB     63:   if err := cb.Marshal(protoMsg); err != nil {
         .          .     64:           return nil, err
         .          .     65:   }

The misuse of proto.Buffer is indicative here:

         .          .    261:func (p *Buffer) Marshal(pb Message) error {
         .          .    262:   // Can the object marshal itself?
         .          .    263:   if m, ok := pb.(Marshaler); ok {
         .   551.80MB    264:           data, err := m.Marshal()
       5MB        5MB    265:           p.buf = append(p.buf, data...)
         .          .    266:           return err
         .          .    267:   }

Where p.buf is the slice set in cachedProtoBuffer.SetBuf. Not only are we excessively allocating, we're also unnecessarily copying.

@irfansharif
Copy link
Contributor Author

irfansharif commented Aug 25, 2017

image

This is the svg view of the same showing the dual allocation above.
NB: The Unmarshal code path technically isn't affected given we don't allocate buffers for it but it saves us the unnecessary sync.Pool.{Get,Put} operations and the tiny overhead with that. The earlier revision was doing essentially what was already contained within proto.Unmarshal.

@apolcyn
Copy link
Contributor

apolcyn commented Aug 25, 2017

What benchmark are these measurements from?

Id' note the goal of the change that started using cached proto.Buffers was not to save byte slice allocations, but rather to save proto.Buffer allocations.

Measuring total allocations by the benchmark in https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5636470266134528&widget=952298825&container=846542667 (using small messages), over a roughly 30 second perioud, showed about 16GB of total allocations on the server, with a little more than 8GB due to allocation of proto.Buffer structs, (in https://github.com/golang/protobuf/blob/master/proto/encode.go#L235 and https://github.com/golang/protobuf/blob/master/proto/decode.go#L412).

One thing I'm wondering is if the protos generated from an updated generator would have an effect here though.

@irfansharif
Copy link
Contributor Author

irfansharif commented Aug 26, 2017

What benchmark are these measurements from?

Included above but it's running the simplest ping server https://github.com/irfansharif/pinger for a 30s duration on my local macbook-pro.
If you want to replicate run ./pinger -server -t grpc -s 512 and ./pinger -d 45s -t grpc -p 64 -c 512 -s 512 separately (second is the client) then curl -k -o heap 'http://localhost:6060/debug/pprof/heap'; pprof -alloc_space pinger heap where pinger is the binary.

Note for this run proto.Buffer is not allocated given the incoming message satisfies proto.Marshaler, which presumably from the generated .pb.go files comprises of the majority of requests. (I'm actually not sure when it would not.)
Ditto for the satisfied proto.Unmarshaler interfaces, proto.Buffer isn't allocated because we do not hit those code paths.

irfansharif added a commit to irfansharif/gogo-protobuf that referenced this pull request Aug 26, 2017
Lower GC pressure for whenever temporary Buffer types are allocated.
Evidently it's an issue grpc/grpc-go ran into in the past.

+cc grpc/grpc-go#1478.

Additionally remove TODOs left regarding exactly this.
irfansharif added a commit to irfansharif/protobuf that referenced this pull request Aug 26, 2017
Lower GC pressure for whenever temporary Buffer types are allocated,
evidently it's an issue grpc/grpc-go ran into in the past. Additionally
remove the TODOs left regarding exactly this.

+cc grpc/grpc-go#1478.
@irfansharif
Copy link
Contributor Author

irfansharif commented Aug 26, 2017

@apolcyn: I wrote up golang/protobuf#418, which I think is a more appropriate fix for the issue you mentioned above (though I'm still unsure how to replicate it within the context of grpc-go).
Also for the benchmarks you linked above, are those nightly master runs? I'm not sure how to navigate it and what/how to infer, are the benchmarking programs publicly accessible?

EDIT: gah, I only just realized that generated proto messages don't always have {Unm,Marshal} methods, I was using gogo/protobuf and had these autogenerated (+cc golang/protobuf#280). I still think golang/protobuf#418 is the right way to solve this in conjunction with this changeset. Presumably this what you meant with "if the protos generated from an updated generator would have an effect here".

@apolcyn
Copy link
Contributor

apolcyn commented Aug 28, 2017

Also for the benchmarks you linked above, are those nightly master runs? I'm not sure how to navigate it and what/how to infer, are the benchmarking programs publicly accessible?

Yes those are ran about daily. Running manually is slightly involved but there are some docs on it in https://github.com/grpc/grpc/blob/master/tools/run_tests/performance/README.md.

I still think golang/protobuf#418 is the right way to solve this in conjunction with this changeset. Presumably this what you meant with "if the protos generated from an updated generator would have an effect here".

I see, I think that change in protobuf can make sense. Also yes that is what I meant to say - current proto.Buffer caching saves allocs when created protos don't satisfy those interfaces.

@irfansharif
Copy link
Contributor Author

golang/protobuf#418 was unfortunately rejected due to what seems to be them exporting their internal version and wanting to avoid merge conflicts. This PR still benefits the group of users providing {Un,M}arshal methods (they're also not affected by the proto.Buffer allocations by not hitting those codepaths) at the cost proto.Buffer allocations for users not doing so. We can of course copy over small parts of golang/protobuf (effectively re-doing golang/protobuf#418 but here) to address both use cases but I'd like to know first if this is something that is acceptable here.

In any case it's worth noting that with the current state of things across here and golang/protobuf, for every marshaled byte through grpc we allocate twice as much as necessary (within these codepaths) and explicitly copy once more than necessary (within these codepaths).

@MakMukhi
Copy link
Contributor

@irfansharif Thanks for reading through the code, understanding it and finding optimization hot-spots. However, we do wish you had talked to us a little, before writing out all the code that you have in all your 3 PRs. We're working on optimization quite actively and are aware of these optimizations. Some of your changes albeit good, introduce issues. For instance, in your buffer re-use PR you're making a copy of data to add it to headers. Also, some of these changes will be rendered obsolete once we're done with the code restructuring that's underway currently.
Unfortunately we won't be able to review the code either for a couple of weeks. I apologize about that.
It's always a good idea to drop in a quick note to discuss design before making changes.
Thanks for your time,

@irfansharif
Copy link
Contributor Author

irfansharif commented Aug 29, 2017

@MakMukhi: it's no issue haha, like I mentioned in #1482 I was just chomping through some of our grpc bottlenecks within cockroachdb/cockroach and these seemed like easy enough pickings to upstream. Nothing ventured, nothing gained!

We're working on optimization quite actively and are aware of these optimizations

I'm glad to hear, I should have consulted earlier but wanted to see how far I could get over the weekend. I have a host of other PRs on a similar vein that I'll hold off on in the interim.

Also, some of these changes will be rendered obsolete once we're done with the code restructuring that's underway currently.

Are plans for this publicly visible? I did look for indication of this (mailing lists and here) but did not come across any. In all cases if there's anything here I can help with, do let me know. Looking to upstream as many of our results as possible. Feel free to close this PR if needed.

The previous use of proto.Buffer was slightly incorrect. proto.Buffer is
only useful when {un,}marshalling proto messages into and from the same
buffer. The current grpc.Codec interface does not have enough
specificity to make good use of this pattern.

In the previous usage we retrieved a cachedProtoBuffer/proto.Buffer from
a sync.Pool pool for every {Unm,M}arshal call, initialized/allocated/set
a completely new buffer and {un,}marshalled into and from it. There was
effectively no re-use of an allocated buffer across invocations of
{Unm,M}arshal. Moreover by simply having used proto.Buffer as an
intermediary between all proto.{Unm,M}arshal calls, we ended up
effectively allocating 2x bytes when only 1x was sufficient. This is due
to the internal implementation details of proto.Buffer where for every
proto.(Marshaler).Marshal() call that returned a byte slice,
proto.Buffer copies it over into it's internal buffer (which we
explicitly initialize/set ourselves).

For types that satisfy the {Unm,M}arshaler interface however we can
simply fall back to {Unm,M}arshaling it directly, without needing to go
through a proto.Buffer. This is what the default proto.{Unm,M}arshal
does with:

    // Marshal takes the protocol buffer
    // and encodes it into the wire format, returning the data.
    func Marshal(pb Message) ([]byte, error) {
      // Can the object marshal itself?
      if m, ok := pb.(Marshaler); ok {
        return m.Marshal()
      }

      p := NewBuffer(nil)
      err := p.Marshal(pb)
      // ...

Falling back to the proto.Buffer for when {Unm,M}arshaler is not
satisfied. The "right" fix for our use of sync.Pool for proto.Buffers at
this level should probably exist in golang/protobuf instead. This patch
wasn't accepted upstream in light of a reworked internal version they're
looking to export.
@irfansharif
Copy link
Contributor Author

irfansharif commented Aug 29, 2017

@MakMukhi, @apolcyn: PTAL here, reworked it in light of golang/protobuf#418 not getting in. We should should have the best of both worlds, no extra proto.Buffer allocs and no extra copying for users providing {Unm,M}arshal helpers. Commit message updated accordingly.

@dfawley dfawley added the Type: Performance Performance improvements (CPU, network, memory, etc) label Aug 31, 2017
@MakMukhi
Copy link
Contributor

MakMukhi commented Sep 28, 2017

@irfansharif This PR looks good. Can you run our benchmarks on this to see that it doesn't have any detrimental effects on the normal case(proto object doesn't implement Marshal or Unmarshal).

@MakMukhi
Copy link
Contributor

Ping.

@thelinuxfoundation
Copy link

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

Regards,
The Linux Foundation CLA GitHub bot

@dfawley dfawley closed this Oct 26, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Performance Performance improvements (CPU, network, memory, etc)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants