Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmarshal allocates twice the memory needed for repeated fields #436

Closed
gunnihinn opened this issue Jul 24, 2018 · 13 comments
Closed

Unmarshal allocates twice the memory needed for repeated fields #436

gunnihinn opened this issue Jul 24, 2018 · 13 comments

Comments

@gunnihinn
Copy link
Contributor

The generated code that deserializes a repeated field is essentially:

xs := make([]type, 0)
for _, x := range things {
    xs = append(xs, x)
}

When the Go runtime needs to grow the backing array of a slice, it uses a doubling strategy to do so. If we have n elements to append to an empty slice, and k is such that 2^k < n < 2^{k+1}, then this means we usually end up allocating space for at least

1 + 2 + 4 + ... + 2^{k+1} = 2 * 2^{k+1}

elements, while making a single allocation of n elements would have been enough.

When we deserialize a protobuf message with a packed repeated field, we know how many elements we're going to deserialize and can allocate all the memory needed in one go. Here is a demonstration of a simple program that does this. Profiling its memory use with the default generated code, and also with a single ad-hoc patch that performs this allocation, shows that what we describe above really does happen.

I work for Booking.com, and we're very interested in making this deserialization code more clever about its allocations if possible and acceptable to the maintainers. We use this library in our implementation of a Graphite server, and Unmarshal is responsible for the vast majority of its memory use (around 20 GB when things are calm), which is unsurprising as the server spends most of its time deserializing very large arrays of doubles. Getting rid of this behaviour would cut our memory use by about half.

Sample results from running the above program:

## Commit 36a0ed7 (current behaviour):

$ ./protobuf-alloc
$ go tool pprof protobuf-alloc memory.prof
File: protobuf-alloc
Type: inuse_space
Time: Jul 24, 2018 at 12:01pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top                                                                                                                                                                  
Showing nodes accounting for 5175.25MB, 100% of 5175.25MB total                                                                                                              
      flat  flat%   sum%        cum   cum%
 2871.24MB 55.48% 55.48%  2871.24MB 55.48%  github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Unmarshal
    2048MB 39.57% 95.05%  5175.25MB   100%  main.main
  256.01MB  4.95%   100%   256.01MB  4.95%  github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Marshal
         0     0%   100%  5175.25MB   100%  runtime.main

## Commit 97163c0 (with ad-hoc allocation patch):

$ ./protobuf-alloc
$ go tool pprof protobuf-alloc memory.prof                                                                                         
File: protobuf-alloc                                                                                                                                                         
Type: inuse_space
Time: Jul 24, 2018 at 12:03pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top                                                                                                                                                                  
Showing nodes accounting for 2304.01MB, 100% of 2304.01MB total                                                                                                              
      flat  flat%   sum%        cum   cum%
    2048MB 88.89% 88.89%  2304.01MB   100%  main.main
  256.01MB 11.11%   100%   256.01MB 11.11%  github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Marshal
         0     0%   100%  2304.01MB   100%  runtime.main
@awalterschulze
Copy link
Member

awalterschulze commented Jul 24, 2018 via email

@gunnihinn
Copy link
Contributor Author

I'm sorry, I was not precise. The generated code doesn't perform any explicit allocations, it just works with what it's given. When we write something like

f := &foo.Foo{}
f.Unmarshal(blob)

then the fields in f will have their default values, which in the case of slices have length zero. Thus, in effect, Unmarshal will behave as in the first code block in the issue report.

You could say that the users have enough information to figure out how much space they'll need for repeated fields by the time they have both the byte blob and the protobuf definition, and should setup the struct they're going to unmarhal into accordingly. If that's the recommended use of the library, we'll follow that. However, it seems likely that most people are using the library like we are, and in that case we could check if they've initialized their repeated fields, and do it for them it not. Most of them might not care, but a few people, like us, will end up caring a lot.

We're happy to write the patches to do this and take them through your review process. We just want to make sure this is something you'd want in the first place before doing it. :)

@awalterschulze
Copy link
Member

awalterschulze commented Jul 24, 2018 via email

@gunnihinn
Copy link
Contributor Author

Cool. We'll try to get something to you in the next week or two.

gunnihinn pushed a commit to gunnihinn/protobuf that referenced this issue Jul 25, 2018
This is a regression test for issue 436:

    gogo#436
gunnihinn pushed a commit to gunnihinn/protobuf that referenced this issue Jul 25, 2018
Suppose we have the protobuf definition

    message Foo {
        repeated type Stuff = 1;
    }

that we deserialize in a fairly common way:

    f := &Foo{}
    f.Unmarshal(blob)

Before the call to Unmarshal, `f.Stuff` will be a slice of length 0, so
the Unmarshal operation will more or less be:

    for _, x := range xs {
        f.Stuff = append(f.Stuff, x)
    }

If we don't know how many elements we're going to deserialize
beforehand, this is the best we can do.

Suppose, however, that we know that we're going to deserialize n
elements. If k is such that 2^k < n <= 2^{k+1}, then the Go runtime's
exponential doubling strategy for resizing the arrays that back slices
will cause us to allocate memory for at least

    1 + 2 + ... + 2^{k+1} = 2 * 2^{k+1}

elements, which is usually more than double what we actually need.

When we deserialize packed fields, we know how many bytes we're going to
deserialize before we start the default append loop. If we furthermore
know how many elements those bytes correspond to, which we do when the
protobuf wire type corresponding to `type` has fixed length [1], we can
prepend the default append loop with

    f.Stuff = make([]type, 0, n)

and ask for exactly the memory we're going to use.

This results in considerable memory savings, between 50 and 80 percent,
compared with the default strategy. These savings are important to
people who use protobuf to communicate things like time series between
services, which consist almost entirely of large arrays of floats and
doubles.

This fixes gogo#436.

It's conceivable to implement similar things for packed types of
non-fixed length. They're encoded with varints, and we _could_ run
through the byte stream we're going to deserialize and count how many
bytes don't have the most significant bit set, but the performance
implications of that seem less predictable than of the simple division
we can perform here.

[1] https://developers.google.com/protocol-buffers/docs/encoding#structure
@tomwilkie
Copy link
Contributor

I'm thinking about a similar optimisation - but for nested fields. Our proto is basically:

message WriteRequest {
  repeated TimeSeries timeseries = 1 [(gogoproto.nullable) = false];
}

message TimeSeries {
  repeated LabelPair labels = 1 [(gogoproto.nullable) = false];
  // Sorted by time, oldest sample first.
  repeated Sample samples   = 2 [(gogoproto.nullable) = false];
}

message LabelPair {
  bytes name  = 1 [(gogoproto.customtype) = "github.com/weaveworks/cortex/pkg/util/wire.Bytes", (gogoproto.nullable) = false];
  bytes value = 2 [(gogoproto.customtype) = "github.com/weaveworks/cortex/pkg/util/wire.Bytes", (gogoproto.nullable) = false];
}

message Sample {
  double value       = 1;
  int64 timestamp_ms = 2;
}

Some 30% of the allocations tend to be when unmarshalling samples and labels, and we tend to know how big these array will be (~100 samples and ~10 labels). I wonder if we could provide this number in an extension/gadget/annotation (not sure the right term)?

@gunnihinn
Copy link
Contributor Author

Not a maintainer, just a random dude, but if you know in advance how many things live in the slices you're going to deserialize, I think you can allocate space for them in the struct you deserialize them into before calling Unmarshal:

wr := &WriteRequest{
    timeseries: TimeSeries{
        labels: make([]LabelPair, 0, n),
        samples: make([]Sample, 0, m),
    },
}
wr.Unmarshal()

From over here, your case seems a bit different than the one we're working on (I got two junior developers at work interested in treating the case of packed varints this morning). When dealing with packed fields, we know how many of the next bytes in the stream are relevant to the repeated field, and can decide based on that whether to try to count how many elements they correspond to. We don't have that byte count for non-packed fields, so it seems less obvious whether to scan ahead to try to count elements or not. We're going to err on the side of not in our PRs.

If we're able to handle the case of varints on our end, and get that PR and the one that treats packed fixed-length types merged, then you could profit from that by changing the protobuf definition to

message TimeSeries {
  repeated LabelPair labels = 1;
  //repeated Sample samples  = 2;
  repeated Values double = 3;
  repeated TimestampsMs int64 = 4;
}

so Values and TimestampsMs would be packed primitives.

If that doesn't seem right to you, maybe your problem would get more attention from the maintainers in its own issue?

@tomwilkie
Copy link
Contributor

I think you can allocate space for them in the struct you deserialize them into before calling Unmarshal

Thanks for the response! I tried that - but these are nested repeated fields. The Unmarshal code puts an empty TimeSeries in the []TimeSeries, wiping any chance I have to preallocated Labels and Samples.

changing the protobuf definition

I wish I could! But this is a external API, so its pretty hard to go through such a change.

@tomwilkie
Copy link
Contributor

I did it in an even more hacked up way (with a custom type): grafana/cortex@ef1728b#diff-8ddb008bd1159258872969a8f392d3d7L27

@awalterschulze
Copy link
Member

awalterschulze commented Jul 31, 2018

@tomwilkie I am struggling to decide whether that is a hack or just the correct solution.

@awalterschulze
Copy link
Member

I think we can close this issue as the pull request has been merged, or do you think there is more we can do here?

@tomwilkie
Copy link
Contributor

Yeah my comments should be a separate issue ideally.

@gunnihinn
Copy link
Contributor Author

gunnihinn commented Aug 1, 2018

I think we can close this issue as the pull request has been merged, or do you think there is more we can do here?

I still think it's worth it to do the same for packed fields of varints. I managed to interest some junior developers over here in that as a nice project. We're having a look together, but I don't have a deadline for when their patches will be ready. I'll see to that it gets picked up one way or the other, so you can close the issue if you want to.

@awalterschulze
Copy link
Member

Ok cool, then lets keep the issue open :)

mansimarkaur pushed a commit to mansimarkaur/protobuf that referenced this issue Sep 10, 2018
mansimarkaur pushed a commit to mansimarkaur/protobuf that referenced this issue Sep 10, 2018
mansimarkaur pushed a commit to mansimarkaur/protobuf that referenced this issue Sep 12, 2018
mansimarkaur pushed a commit to mansimarkaur/protobuf that referenced this issue Sep 12, 2018
jmarais pushed a commit that referenced this issue Sep 14, 2018
* Adds a test to verify Unmarshal's memory usage for Varint

Validates #436

* Allocates the exact memory needed for packed repeated varints

Resolves #436

* Add benchmarks for Issue436. We'll test the change against a range of array sizes as well as different varint range.
The before/after tests were made using the same code against fork/master branches.

* Run 'make' with Go 1.11 and protobuf 3.5.1

It seems that combination is somehow special to the CI tooling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants