grpc: reduce allocations on Codec.{Unm,M}arshal codepaths #1478

irfansharif · 2017-08-25T23:33:08Z

The previous use of proto.Buffer was slightly incorrect. proto.Buffer is
only useful when {un,}marshalling proto messages into and from the same
buffer. The current grpc.Codec interface does not have enough
specificity to make good use of this pattern.

In the previous usage we retrieved a cachedProtoBuffer/proto.Buffer from
a sync.Pool pool for every {Unm,M}arshal call, initialized/allocated/set
a completely new buffer and {un,}marshalled into and from it. There was
effectively no re-use of an allocated buffer across invocations of
{Unm,M}arshal. Moreover by simply having used proto.Buffer as an
intermediary between all proto.{Unm,M}arshal calls, we ended up
effectively allocating 2x bytes when only 1x was sufficient. This is due
to the internal implementation details of proto.Buffer where for every
proto.(Marshaler).Marshal() call that returned a byte slice,
proto.Buffer copies it over into it's internal buffer (which we
explicitly initialize/set ourselves). This is shown below and not included in
the commit message.

For types that satisfy the {Unm,M}arshaler interface however we can
simply fall back to {Unm,M}arshaling it directly, without needing to go
through a proto.Buffer. This is what the default proto.{Unm,M}arshal
does with:

// Marshal takes the protocol buffer
// and encodes it into the wire format, returning the data.
func Marshal(pb Message) ([]byte, error) {
  // Can the object marshal itself?
  if m, ok := pb.(Marshaler); ok {
    return m.Marshal()
  }

  p := NewBuffer(nil)
  err := p.Marshal(pb)
  // ...

Falling back to the proto.Buffer for when {Unm,M}arshaler is not
satisfied. The "right" fix for our use of sync.Pool for proto.Buffers at
this level should probably exist in golang/protobuf instead. This patch
wasn't accepted upstream in light of a reworked internal version they're
looking to export.

+cc @bdarnell / @tamird

irfansharif · 2017-08-25T23:38:45Z

With pprof on the heap profile, sample_index=alloc_space for a short run of https://github.com/irfansharif/pinger, before:

         .          .     57:func (p protoCodec) marshal(v interface{}, cb *cachedProtoBuffer) ([]byte, error) {
         .          .     58:   protoMsg := v.(proto.Message)
  532.29MB   532.29MB     59:   newSlice := make([]byte, 0, cb.lastMarshaledSize)
         .          .     60:
         .          .     61:   cb.SetBuf(newSlice)
         .          .     62:   cb.Reset()
         .   556.81MB     63:   if err := cb.Marshal(protoMsg); err != nil {
         .          .     64:           return nil, err
         .          .     65:   }

The misuse of proto.Buffer is indicative here:

         .          .    261:func (p *Buffer) Marshal(pb Message) error {
         .          .    262:   // Can the object marshal itself?
         .          .    263:   if m, ok := pb.(Marshaler); ok {
         .   551.80MB    264:           data, err := m.Marshal()
       5MB        5MB    265:           p.buf = append(p.buf, data...)
         .          .    266:           return err
         .          .    267:   }

Where p.buf is the slice set in cachedProtoBuffer.SetBuf. Not only are we excessively allocating, we're also unnecessarily copying.

irfansharif · 2017-08-25T23:46:51Z

This is the svg view of the same showing the dual allocation above.
NB: The Unmarshal code path technically isn't affected given we don't allocate buffers for it but it saves us the unnecessary sync.Pool.{Get,Put} operations and the tiny overhead with that. The earlier revision was doing essentially what was already contained within proto.Unmarshal.

apolcyn · 2017-08-25T23:52:39Z

What benchmark are these measurements from?

Id' note the goal of the change that started using cached proto.Buffers was not to save byte slice allocations, but rather to save proto.Buffer allocations.

Measuring total allocations by the benchmark in https://performance-dot-grpc-testing.appspot.com/explore?dashboard=5636470266134528&widget=952298825&container=846542667 (using small messages), over a roughly 30 second perioud, showed about 16GB of total allocations on the server, with a little more than 8GB due to allocation of proto.Buffer structs, (in https://github.com/golang/protobuf/blob/master/proto/encode.go#L235 and https://github.com/golang/protobuf/blob/master/proto/decode.go#L412).

One thing I'm wondering is if the protos generated from an updated generator would have an effect here though.

irfansharif · 2017-08-26T00:02:56Z

What benchmark are these measurements from?

Included above but it's running the simplest ping server https://github.com/irfansharif/pinger for a 30s duration on my local macbook-pro.
If you want to replicate run ./pinger -server -t grpc -s 512 and ./pinger -d 45s -t grpc -p 64 -c 512 -s 512 separately (second is the client) then curl -k -o heap 'http://localhost:6060/debug/pprof/heap'; pprof -alloc_space pinger heap where pinger is the binary.

Note for this run proto.Buffer is not allocated given the incoming message satisfies proto.Marshaler, which presumably from the generated .pb.go files comprises of the majority of requests. (I'm actually not sure when it would not.)
Ditto for the satisfied proto.Unmarshaler interfaces, proto.Buffer isn't allocated because we do not hit those code paths.

Lower GC pressure for whenever temporary Buffer types are allocated. Evidently it's an issue grpc/grpc-go ran into in the past. +cc grpc/grpc-go#1478. Additionally remove TODOs left regarding exactly this.

Lower GC pressure for whenever temporary Buffer types are allocated, evidently it's an issue grpc/grpc-go ran into in the past. Additionally remove the TODOs left regarding exactly this. +cc grpc/grpc-go#1478.

irfansharif · 2017-08-26T02:37:37Z

@apolcyn: I wrote up golang/protobuf#418, which I think is a more appropriate fix for the issue you mentioned above (~~though I'm still unsure how to replicate it within the context of grpc-go~~).
Also for the benchmarks you linked above, are those nightly master runs? I'm not sure how to navigate it and what/how to infer, are the benchmarking programs publicly accessible?

EDIT: gah, I only just realized that generated proto messages don't always have {Unm,Marshal} methods, I was using gogo/protobuf and had these autogenerated (+cc golang/protobuf#280). I still think golang/protobuf#418 is the right way to solve this in conjunction with this changeset. Presumably this what you meant with "if the protos generated from an updated generator would have an effect here".

apolcyn · 2017-08-28T19:54:46Z

Also for the benchmarks you linked above, are those nightly master runs? I'm not sure how to navigate it and what/how to infer, are the benchmarking programs publicly accessible?

Yes those are ran about daily. Running manually is slightly involved but there are some docs on it in https://github.com/grpc/grpc/blob/master/tools/run_tests/performance/README.md.

I still think golang/protobuf#418 is the right way to solve this in conjunction with this changeset. Presumably this what you meant with "if the protos generated from an updated generator would have an effect here".

I see, I think that change in protobuf can make sense. Also yes that is what I meant to say - current proto.Buffer caching saves allocs when created protos don't satisfy those interfaces.

irfansharif · 2017-08-28T21:12:58Z

golang/protobuf#418 was unfortunately rejected due to what seems to be them exporting their internal version and wanting to avoid merge conflicts. This PR still benefits the group of users providing {Un,M}arshal methods (they're also not affected by the proto.Buffer allocations by not hitting those codepaths) at the cost proto.Buffer allocations for users not doing so. We can of course copy over small parts of golang/protobuf (effectively re-doing golang/protobuf#418 but here) to address both use cases but I'd like to know first if this is something that is acceptable here.

In any case it's worth noting that with the current state of things across here and golang/protobuf, for every marshaled byte through grpc we allocate twice as much as necessary (within these codepaths) and explicitly copy once more than necessary (within these codepaths).

MakMukhi · 2017-08-28T22:51:23Z

@irfansharif Thanks for reading through the code, understanding it and finding optimization hot-spots. However, we do wish you had talked to us a little, before writing out all the code that you have in all your 3 PRs. We're working on optimization quite actively and are aware of these optimizations. Some of your changes albeit good, introduce issues. For instance, in your buffer re-use PR you're making a copy of data to add it to headers. Also, some of these changes will be rendered obsolete once we're done with the code restructuring that's underway currently.
Unfortunately we won't be able to review the code either for a couple of weeks. I apologize about that.
It's always a good idea to drop in a quick note to discuss design before making changes.
Thanks for your time,

irfansharif · 2017-08-29T02:48:56Z

@MakMukhi: it's no issue haha, like I mentioned in #1482 I was just chomping through some of our grpc bottlenecks within cockroachdb/cockroach and these seemed like easy enough pickings to upstream. Nothing ventured, nothing gained!

We're working on optimization quite actively and are aware of these optimizations

I'm glad to hear, I should have consulted earlier but wanted to see how far I could get over the weekend. I have a host of other PRs on a similar vein that I'll hold off on in the interim.

Also, some of these changes will be rendered obsolete once we're done with the code restructuring that's underway currently.

Are plans for this publicly visible? I did look for indication of this (mailing lists and here) but did not come across any. In all cases if there's anything here I can help with, do let me know. Looking to upstream as many of our results as possible. Feel free to close this PR if needed.

The previous use of proto.Buffer was slightly incorrect. proto.Buffer is only useful when {un,}marshalling proto messages into and from the same buffer. The current grpc.Codec interface does not have enough specificity to make good use of this pattern. In the previous usage we retrieved a cachedProtoBuffer/proto.Buffer from a sync.Pool pool for every {Unm,M}arshal call, initialized/allocated/set a completely new buffer and {un,}marshalled into and from it. There was effectively no re-use of an allocated buffer across invocations of {Unm,M}arshal. Moreover by simply having used proto.Buffer as an intermediary between all proto.{Unm,M}arshal calls, we ended up effectively allocating 2x bytes when only 1x was sufficient. This is due to the internal implementation details of proto.Buffer where for every proto.(Marshaler).Marshal() call that returned a byte slice, proto.Buffer copies it over into it's internal buffer (which we explicitly initialize/set ourselves). For types that satisfy the {Unm,M}arshaler interface however we can simply fall back to {Unm,M}arshaling it directly, without needing to go through a proto.Buffer. This is what the default proto.{Unm,M}arshal does with: // Marshal takes the protocol buffer // and encodes it into the wire format, returning the data. func Marshal(pb Message) ([]byte, error) { // Can the object marshal itself? if m, ok := pb.(Marshaler); ok { return m.Marshal() } p := NewBuffer(nil) err := p.Marshal(pb) // ... Falling back to the proto.Buffer for when {Unm,M}arshaler is not satisfied. The "right" fix for our use of sync.Pool for proto.Buffers at this level should probably exist in golang/protobuf instead. This patch wasn't accepted upstream in light of a reworked internal version they're looking to export.

irfansharif · 2017-08-29T17:03:03Z

@MakMukhi, @apolcyn: PTAL here, reworked it in light of golang/protobuf#418 not getting in. We should should have the best of both worlds, no extra proto.Buffer allocs and no extra copying for users providing {Unm,M}arshal helpers. Commit message updated accordingly.

MakMukhi · 2017-09-28T20:27:50Z

@irfansharif This PR looks good. Can you run our benchmarks on this to see that it doesn't have any detrimental effects on the normal case(proto object doesn't implement Marshal or Unmarshal).

MakMukhi · 2017-10-19T20:18:20Z

Ping.

thelinuxfoundation · 2017-10-19T20:18:21Z

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

User @irfansharif isn't covered by a CLA. They will need to complete the form at https://identity.linuxfoundation.org/projects/cncf

Regards,
The Linux Foundation CLA GitHub bot

irfansharif mentioned this pull request Aug 26, 2017

proto: use sync.Pool for temporary Buffer types gogo/protobuf#328

Closed

irfansharif mentioned this pull request Aug 26, 2017

proto: use sync.Pool for temporary Buffer types golang/protobuf#418

Closed

irfansharif mentioned this pull request Aug 27, 2017

perf: investigate gRPC packeting optimizations cockroachdb/cockroach#17370

Closed

dfawley assigned MakMukhi Aug 28, 2017

irfansharif force-pushed the proto-buffer-usage branch from 9f22559 to aaffef0 Compare August 29, 2017 17:00

dfawley mentioned this pull request Aug 29, 2017

grpc: zero-alloc {Client,Server}Transport.Write paths #1482

Closed

dfawley added the Type: Performance Performance improvements (CPU, network, memory, etc) label Aug 31, 2017

dfawley closed this Oct 26, 2017

dfawley mentioned this pull request Nov 30, 2017

protoCodec: return early if proto.Marshaler #1689

Merged

lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpc: reduce allocations on Codec.{Unm,M}arshal codepaths #1478

grpc: reduce allocations on Codec.{Unm,M}arshal codepaths #1478

irfansharif commented Aug 25, 2017 •

edited

Loading

irfansharif commented Aug 25, 2017

irfansharif commented Aug 25, 2017 •

edited

Loading

apolcyn commented Aug 25, 2017

irfansharif commented Aug 26, 2017 •

edited

Loading

irfansharif commented Aug 26, 2017 •

edited

Loading

apolcyn commented Aug 28, 2017

irfansharif commented Aug 28, 2017

MakMukhi commented Aug 28, 2017

irfansharif commented Aug 29, 2017 •

edited

Loading

irfansharif commented Aug 29, 2017 •

edited

Loading

MakMukhi commented Sep 28, 2017 •

edited

Loading

MakMukhi commented Oct 19, 2017

thelinuxfoundation commented Oct 19, 2017

grpc: reduce allocations on Codec.{Unm,M}arshal codepaths #1478

grpc: reduce allocations on Codec.{Unm,M}arshal codepaths #1478

Conversation

irfansharif commented Aug 25, 2017 • edited Loading

irfansharif commented Aug 25, 2017

irfansharif commented Aug 25, 2017 • edited Loading

apolcyn commented Aug 25, 2017

irfansharif commented Aug 26, 2017 • edited Loading

irfansharif commented Aug 26, 2017 • edited Loading

apolcyn commented Aug 28, 2017

irfansharif commented Aug 28, 2017

MakMukhi commented Aug 28, 2017

irfansharif commented Aug 29, 2017 • edited Loading

irfansharif commented Aug 29, 2017 • edited Loading

MakMukhi commented Sep 28, 2017 • edited Loading

MakMukhi commented Oct 19, 2017

thelinuxfoundation commented Oct 19, 2017

irfansharif commented Aug 25, 2017 •

edited

Loading

irfansharif commented Aug 25, 2017 •

edited

Loading

irfansharif commented Aug 26, 2017 •

edited

Loading

irfansharif commented Aug 26, 2017 •

edited

Loading

irfansharif commented Aug 29, 2017 •

edited

Loading

irfansharif commented Aug 29, 2017 •

edited

Loading

MakMukhi commented Sep 28, 2017 •

edited

Loading