Unmarshal allocates twice the memory needed for repeated fields #436

gunnihinn · 2018-07-24T11:32:55Z

The generated code that deserializes a repeated field is essentially:

xs := make([]type, 0)
for _, x := range things {
    xs = append(xs, x)
}

When the Go runtime needs to grow the backing array of a slice, it uses a doubling strategy to do so. If we have n elements to append to an empty slice, and k is such that 2^k < n < 2^{k+1}, then this means we usually end up allocating space for at least

1 + 2 + 4 + ... + 2^{k+1} = 2 * 2^{k+1}

elements, while making a single allocation of n elements would have been enough.

When we deserialize a protobuf message with a packed repeated field, we know how many elements we're going to deserialize and can allocate all the memory needed in one go. Here is a demonstration of a simple program that does this. Profiling its memory use with the default generated code, and also with a single ad-hoc patch that performs this allocation, shows that what we describe above really does happen.

I work for Booking.com, and we're very interested in making this deserialization code more clever about its allocations if possible and acceptable to the maintainers. We use this library in our implementation of a Graphite server, and Unmarshal is responsible for the vast majority of its memory use (around 20 GB when things are calm), which is unsurprising as the server spends most of its time deserializing very large arrays of doubles. Getting rid of this behaviour would cut our memory use by about half.

Sample results from running the above program:

## Commit 36a0ed7 (current behaviour):

$ ./protobuf-alloc
$ go tool pprof protobuf-alloc memory.prof
File: protobuf-alloc
Type: inuse_space
Time: Jul 24, 2018 at 12:01pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top                                                                                                                                                                  
Showing nodes accounting for 5175.25MB, 100% of 5175.25MB total                                                                                                              
      flat  flat%   sum%        cum   cum%
 2871.24MB 55.48% 55.48%  2871.24MB 55.48%  github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Unmarshal
    2048MB 39.57% 95.05%  5175.25MB   100%  main.main
  256.01MB  4.95%   100%   256.01MB  4.95%  github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Marshal
         0     0%   100%  5175.25MB   100%  runtime.main

## Commit 97163c0 (with ad-hoc allocation patch):

$ ./protobuf-alloc
$ go tool pprof protobuf-alloc memory.prof                                                                                         
File: protobuf-alloc                                                                                                                                                         
Type: inuse_space
Time: Jul 24, 2018 at 12:03pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top                                                                                                                                                                  
Showing nodes accounting for 2304.01MB, 100% of 2304.01MB total                                                                                                              
      flat  flat%   sum%        cum   cum%
    2048MB 88.89% 88.89%  2304.01MB   100%  main.main
  256.01MB 11.11%   100%   256.01MB 11.11%  github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Marshal
         0     0%   100%  2304.01MB   100%  runtime.main

The text was updated successfully, but these errors were encountered:

awalterschulze · 2018-07-24T16:04:47Z

Can you please show the original generated code that sets the slice cap to zero? I can't find it in your example?

…

On Tue, 24 Jul 2018, 12:32 Gunnar Þór Magnússon, ***@***.***> wrote: The generated code that deserializes a repeated field is essentially: xs := make([]type, 0) for _, x := range things { xs = append(xs, x) } When the Go runtime needs to grow the backing array of a slice, it uses a doubling strategy to do so. If we have *n* elements to append to an empty slice, and *k* is such that *2^k < n < 2^{k+1}*, then this means we usually end up allocating space for at least 1 + 2 + 4 + ... + 2^{k+1} = 2 * 2^{k+1} elements, while making a single allocation of *n* elements would have been enough. When we deserialize a protobuf message with a packed repeated field, we know how many elements we're going to deserialize and can allocate all the memory needed in one go. Here <https://github.com/gunnihinn/protobuf-alloc> is a demonstration of a simple program that does this. Profiling its memory use with the default generated code, and also with a single ad-hoc patch that performs this allocation, shows that what we describe above really does happen. I work for Booking.com, and we're very interested in making this deserialization code more clever about its allocations if possible and acceptable to the maintainers. We use this library in our implementation of a Graphite server <https://github.com/go-graphite/carbonapi>, and Unmarshal is responsible for the vast majority of its memory use (around 20 GB when things are calm), which is unsurprising as the server spends most of its time deserializing very large arrays of doubles. Getting rid of this behaviour would cut our memory use by about half. Sample results from running the above program <https://github.com/gunnihinn/protobuf-alloc>: ## Commit 36a0ed7 (current behaviour): $ ./protobuf-alloc $ go tool pprof protobuf-alloc memory.prof File: protobuf-alloc Type: inuse_space Time: Jul 24, 2018 at 12:01pm (CEST) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top Showing nodes accounting for 5175.25MB, 100% of 5175.25MB total flat flat% sum% cum cum% 2871.24MB 55.48% 55.48% 2871.24MB 55.48% github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Unmarshal 2048MB 39.57% 95.05% 5175.25MB 100% main.main 256.01MB 4.95% 100% 256.01MB 4.95% github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Marshal 0 0% 100% 5175.25MB 100% runtime.main ## Commit 97163c0 (with ad-hoc allocation patch): $ ./protobuf-alloc $ go tool pprof protobuf-alloc memory.prof File: protobuf-alloc Type: inuse_space Time: Jul 24, 2018 at 12:03pm (CEST) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top Showing nodes accounting for 2304.01MB, 100% of 2304.01MB total flat flat% sum% cum cum% 2048MB 88.89% 88.89% 2304.01MB 100% main.main 256.01MB 11.11% 100% 256.01MB 11.11% github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Marshal 0 0% 100% 2304.01MB 100% runtime.main — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#436>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvsLc-6KMMMb7n8kv7FQyue8q6iNRblks5uJwXogaJpZM4VckeH> .

gunnihinn · 2018-07-24T16:52:34Z

I'm sorry, I was not precise. The generated code doesn't perform any explicit allocations, it just works with what it's given. When we write something like

f := &foo.Foo{}
f.Unmarshal(blob)

then the fields in f will have their default values, which in the case of slices have length zero. Thus, in effect, Unmarshal will behave as in the first code block in the issue report.

You could say that the users have enough information to figure out how much space they'll need for repeated fields by the time they have both the byte blob and the protobuf definition, and should setup the struct they're going to unmarhal into accordingly. If that's the recommended use of the library, we'll follow that. However, it seems likely that most people are using the library like we are, and in that case we could check if they've initialized their repeated fields, and do it for them it not. Most of them might not care, but a few people, like us, will end up caring a lot.

We're happy to write the patches to do this and take them through your review process. We just want to make sure this is something you'd want in the first place before doing it. :)

awalterschulze · 2018-07-24T16:59:31Z

Cool. Yes I think this optimization can work for packed repeated fields, which includes proto3 native fields. If you want to contribute a patch that would be great 😀

…

On Tue, 24 Jul 2018, 17:52 Gunnar Þór Magnússon, ***@***.***> wrote: I'm sorry, I was not precise. The generated code doesn't perform any explicit allocations, it just works with what it's given. When we write something like f := &foo.Foo{} f.Unmarshal(blob) then the fields in f will have their default values, which in the case of slices have length zero. Thus, in effect, Unmarshal will behave as in the first code block in the issue report. You could say that the users have enough information to figure out how much space they'll need for repeated fields by the time they have both the byte blob and the protobuf definition, and should setup the struct they're going to unmarhal into accordingly. If that's the recommended use of the library, we'll follow that. However, it seems likely that most people are using the library like we are, and in that case we *could* check if they've initialized their repeated fields, and do it for them it not. Most of them might not care, but a few people, like us, will end up caring a lot. We're happy to write the patches to do this and take them through your review process. We just want to make sure this is something you'd want in the first place before doing it. :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#436 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvsLQvyRiIvqjTEcmA5SAuvQR3wEB7Cks5uJ1DVgaJpZM4VckeH> .

gunnihinn · 2018-07-24T17:37:11Z

Cool. We'll try to get something to you in the next week or two.

This is a regression test for issue 436: gogo#436

Suppose we have the protobuf definition message Foo { repeated type Stuff = 1; } that we deserialize in a fairly common way: f := &Foo{} f.Unmarshal(blob) Before the call to Unmarshal, `f.Stuff` will be a slice of length 0, so the Unmarshal operation will more or less be: for _, x := range xs { f.Stuff = append(f.Stuff, x) } If we don't know how many elements we're going to deserialize beforehand, this is the best we can do. Suppose, however, that we know that we're going to deserialize n elements. If k is such that 2^k < n <= 2^{k+1}, then the Go runtime's exponential doubling strategy for resizing the arrays that back slices will cause us to allocate memory for at least 1 + 2 + ... + 2^{k+1} = 2 * 2^{k+1} elements, which is usually more than double what we actually need. When we deserialize packed fields, we know how many bytes we're going to deserialize before we start the default append loop. If we furthermore know how many elements those bytes correspond to, which we do when the protobuf wire type corresponding to `type` has fixed length [1], we can prepend the default append loop with f.Stuff = make([]type, 0, n) and ask for exactly the memory we're going to use. This results in considerable memory savings, between 50 and 80 percent, compared with the default strategy. These savings are important to people who use protobuf to communicate things like time series between services, which consist almost entirely of large arrays of floats and doubles. This fixes gogo#436. It's conceivable to implement similar things for packed types of non-fixed length. They're encoded with varints, and we _could_ run through the byte stream we're going to deserialize and count how many bytes don't have the most significant bit set, but the performance implications of that seem less predictable than of the simple division we can perform here. [1] https://developers.google.com/protocol-buffers/docs/encoding#structure

tomwilkie · 2018-07-27T10:36:48Z

I'm thinking about a similar optimisation - but for nested fields. Our proto is basically:

message WriteRequest {
  repeated TimeSeries timeseries = 1 [(gogoproto.nullable) = false];
}

message TimeSeries {
  repeated LabelPair labels = 1 [(gogoproto.nullable) = false];
  // Sorted by time, oldest sample first.
  repeated Sample samples   = 2 [(gogoproto.nullable) = false];
}

message LabelPair {
  bytes name  = 1 [(gogoproto.customtype) = "github.com/weaveworks/cortex/pkg/util/wire.Bytes", (gogoproto.nullable) = false];
  bytes value = 2 [(gogoproto.customtype) = "github.com/weaveworks/cortex/pkg/util/wire.Bytes", (gogoproto.nullable) = false];
}

message Sample {
  double value       = 1;
  int64 timestamp_ms = 2;
}

Some 30% of the allocations tend to be when unmarshalling samples and labels, and we tend to know how big these array will be (~100 samples and ~10 labels). I wonder if we could provide this number in an extension/gadget/annotation (not sure the right term)?

gunnihinn · 2018-07-27T11:08:28Z

Not a maintainer, just a random dude, but if you know in advance how many things live in the slices you're going to deserialize, I think you can allocate space for them in the struct you deserialize them into before calling Unmarshal:

wr := &WriteRequest{
    timeseries: TimeSeries{
        labels: make([]LabelPair, 0, n),
        samples: make([]Sample, 0, m),
    },
}
wr.Unmarshal()

From over here, your case seems a bit different than the one we're working on (I got two junior developers at work interested in treating the case of packed varints this morning). When dealing with packed fields, we know how many of the next bytes in the stream are relevant to the repeated field, and can decide based on that whether to try to count how many elements they correspond to. We don't have that byte count for non-packed fields, so it seems less obvious whether to scan ahead to try to count elements or not. We're going to err on the side of not in our PRs.

If we're able to handle the case of varints on our end, and get that PR and the one that treats packed fixed-length types merged, then you could profit from that by changing the protobuf definition to

message TimeSeries {
  repeated LabelPair labels = 1;
  //repeated Sample samples  = 2;
  repeated Values double = 3;
  repeated TimestampsMs int64 = 4;
}

so Values and TimestampsMs would be packed primitives.

If that doesn't seem right to you, maybe your problem would get more attention from the maintainers in its own issue?

tomwilkie · 2018-07-27T11:34:21Z

I think you can allocate space for them in the struct you deserialize them into before calling Unmarshal

Thanks for the response! I tried that - but these are nested repeated fields. The Unmarshal code puts an empty TimeSeries in the []TimeSeries, wiping any chance I have to preallocated Labels and Samples.

changing the protobuf definition

I wish I could! But this is a external API, so its pretty hard to go through such a change.

tomwilkie · 2018-07-27T12:34:18Z

I did it in an even more hacked up way (with a custom type): grafana/cortex@ef1728b#diff-8ddb008bd1159258872969a8f392d3d7L27

awalterschulze · 2018-07-31T14:24:51Z

@tomwilkie I am struggling to decide whether that is a hack or just the correct solution.

awalterschulze · 2018-08-01T15:18:07Z

I think we can close this issue as the pull request has been merged, or do you think there is more we can do here?

tomwilkie · 2018-08-01T15:26:06Z

Yeah my comments should be a separate issue ideally.

gunnihinn · 2018-08-01T16:56:44Z

I think we can close this issue as the pull request has been merged, or do you think there is more we can do here?

I still think it's worth it to do the same for packed fields of varints. I managed to interest some junior developers over here in that as a nice project. We're having a look together, but I don't have a deadline for when their patches will be ready. I'll see to that it gets picked up one way or the other, so you can close the issue if you want to.

awalterschulze · 2018-08-01T18:25:59Z

Ok cool, then lets keep the issue open :)

Validates gogo#436

Resolves gogo#436

Validates gogo#436

Resolves gogo#436

* Adds a test to verify Unmarshal's memory usage for Varint Validates #436 * Allocates the exact memory needed for packed repeated varints Resolves #436 * Add benchmarks for Issue436. We'll test the change against a range of array sizes as well as different varint range. The before/after tests were made using the same code against fork/master branches. * Run 'make' with Go 1.11 and protobuf 3.5.1 It seems that combination is somehow special to the CI tooling.

gunnihinn pushed a commit to gunnihinn/protobuf that referenced this issue Jul 25, 2018

Add a test for Unmarshal memory use

a5d95dd

This is a regression test for issue 436: gogo#436

gunnihinn mentioned this issue Jul 25, 2018

exact slice allocation for fixed size repeated packed fields #437

Merged

awalterschulze added the enhancement label Aug 1, 2018

mansimarkaur pushed a commit to mansimarkaur/protobuf that referenced this issue Sep 10, 2018

Adds a test to verify Unmarshal's memory usage for Varint

39e1a3f

Validates gogo#436

mansimarkaur pushed a commit to mansimarkaur/protobuf that referenced this issue Sep 10, 2018

Allocates the exact memory needed for packed repeated varints

49a031d

Resolves gogo#436

mansimarkaur mentioned this issue Sep 10, 2018

Exact slice allocation for repeated packed varints #480

Merged

mansimarkaur pushed a commit to mansimarkaur/protobuf that referenced this issue Sep 12, 2018

Adds a test to verify Unmarshal's memory usage for Varint

89d4b68

Validates gogo#436

mansimarkaur pushed a commit to mansimarkaur/protobuf that referenced this issue Sep 12, 2018

Allocates the exact memory needed for packed repeated varints

06d9f90

Resolves gogo#436

jmarais closed this as completed in #480 Sep 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unmarshal allocates twice the memory needed for repeated fields #436

Unmarshal allocates twice the memory needed for repeated fields #436

gunnihinn commented Jul 24, 2018

awalterschulze commented Jul 24, 2018 via email

gunnihinn commented Jul 24, 2018

awalterschulze commented Jul 24, 2018 via email

gunnihinn commented Jul 24, 2018

tomwilkie commented Jul 27, 2018

gunnihinn commented Jul 27, 2018

tomwilkie commented Jul 27, 2018

tomwilkie commented Jul 27, 2018

awalterschulze commented Jul 31, 2018 •

edited

Loading

awalterschulze commented Aug 1, 2018

tomwilkie commented Aug 1, 2018

gunnihinn commented Aug 1, 2018 •

edited

Loading

awalterschulze commented Aug 1, 2018

Unmarshal allocates twice the memory needed for repeated fields #436

Unmarshal allocates twice the memory needed for repeated fields #436

Comments

gunnihinn commented Jul 24, 2018

Sample results from running the above program:

awalterschulze commented Jul 24, 2018 via email

gunnihinn commented Jul 24, 2018

awalterschulze commented Jul 24, 2018 via email

gunnihinn commented Jul 24, 2018

tomwilkie commented Jul 27, 2018

gunnihinn commented Jul 27, 2018

tomwilkie commented Jul 27, 2018

tomwilkie commented Jul 27, 2018

awalterschulze commented Jul 31, 2018 • edited Loading

awalterschulze commented Aug 1, 2018

tomwilkie commented Aug 1, 2018

gunnihinn commented Aug 1, 2018 • edited Loading

awalterschulze commented Aug 1, 2018

awalterschulze commented Jul 31, 2018 •

edited

Loading

gunnihinn commented Aug 1, 2018 •

edited

Loading