-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unmarshal allocates twice the memory needed for repeated fields #436
Comments
Can you please show the original generated code that sets the slice cap to
zero? I can't find it in your example?
…On Tue, 24 Jul 2018, 12:32 Gunnar Þór Magnússon, ***@***.***> wrote:
The generated code that deserializes a repeated field is essentially:
xs := make([]type, 0)
for _, x := range things {
xs = append(xs, x)
}
When the Go runtime needs to grow the backing array of a slice, it uses a
doubling strategy to do so. If we have *n* elements to append to an empty
slice, and *k* is such that *2^k < n < 2^{k+1}*, then this means we
usually end up allocating space for at least
1 + 2 + 4 + ... + 2^{k+1} = 2 * 2^{k+1}
elements, while making a single allocation of *n* elements would have
been enough.
When we deserialize a protobuf message with a packed repeated field, we
know how many elements we're going to deserialize and can allocate all the
memory needed in one go. Here
<https://github.com/gunnihinn/protobuf-alloc> is a demonstration of a
simple program that does this. Profiling its memory use with the default
generated code, and also with a single ad-hoc patch that performs this
allocation, shows that what we describe above really does happen.
I work for Booking.com, and we're very interested in making this
deserialization code more clever about its allocations if possible and
acceptable to the maintainers. We use this library in our implementation of
a Graphite server <https://github.com/go-graphite/carbonapi>, and
Unmarshal is responsible for the vast majority of its memory use (around 20
GB when things are calm), which is unsurprising as the server spends most
of its time deserializing very large arrays of doubles. Getting rid of this
behaviour would cut our memory use by about half.
Sample results from running the above program
<https://github.com/gunnihinn/protobuf-alloc>:
## Commit 36a0ed7 (current behaviour):
$ ./protobuf-alloc
$ go tool pprof protobuf-alloc memory.prof
File: protobuf-alloc
Type: inuse_space
Time: Jul 24, 2018 at 12:01pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 5175.25MB, 100% of 5175.25MB total
flat flat% sum% cum cum%
2871.24MB 55.48% 55.48% 2871.24MB 55.48% github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Unmarshal
2048MB 39.57% 95.05% 5175.25MB 100% main.main
256.01MB 4.95% 100% 256.01MB 4.95% github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Marshal
0 0% 100% 5175.25MB 100% runtime.main
## Commit 97163c0 (with ad-hoc allocation patch):
$ ./protobuf-alloc
$ go tool pprof protobuf-alloc memory.prof
File: protobuf-alloc
Type: inuse_space
Time: Jul 24, 2018 at 12:03pm (CEST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2304.01MB, 100% of 2304.01MB total
flat flat% sum% cum cum%
2048MB 88.89% 88.89% 2304.01MB 100% main.main
256.01MB 11.11% 100% 256.01MB 11.11% github.com/gunnihinn/protobuf-alloc/foo.(*Foo).Marshal
0 0% 100% 2304.01MB 100% runtime.main
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#436>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvsLc-6KMMMb7n8kv7FQyue8q6iNRblks5uJwXogaJpZM4VckeH>
.
|
I'm sorry, I was not precise. The generated code doesn't perform any explicit allocations, it just works with what it's given. When we write something like
then the fields in You could say that the users have enough information to figure out how much space they'll need for repeated fields by the time they have both the byte blob and the protobuf definition, and should setup the struct they're going to unmarhal into accordingly. If that's the recommended use of the library, we'll follow that. However, it seems likely that most people are using the library like we are, and in that case we could check if they've initialized their repeated fields, and do it for them it not. Most of them might not care, but a few people, like us, will end up caring a lot. We're happy to write the patches to do this and take them through your review process. We just want to make sure this is something you'd want in the first place before doing it. :) |
Cool. Yes I think this optimization can work for packed repeated fields,
which includes proto3 native fields.
If you want to contribute a patch that would be great 😀
…On Tue, 24 Jul 2018, 17:52 Gunnar Þór Magnússon, ***@***.***> wrote:
I'm sorry, I was not precise. The generated code doesn't perform any
explicit allocations, it just works with what it's given. When we write
something like
f := &foo.Foo{}
f.Unmarshal(blob)
then the fields in f will have their default values, which in the case of
slices have length zero. Thus, in effect, Unmarshal will behave as in the
first code block in the issue report.
You could say that the users have enough information to figure out how
much space they'll need for repeated fields by the time they have both the
byte blob and the protobuf definition, and should setup the struct they're
going to unmarhal into accordingly. If that's the recommended use of the
library, we'll follow that. However, it seems likely that most people are
using the library like we are, and in that case we *could* check if
they've initialized their repeated fields, and do it for them it not. Most
of them might not care, but a few people, like us, will end up caring a lot.
We're happy to write the patches to do this and take them through your
review process. We just want to make sure this is something you'd want in
the first place before doing it. :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#436 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABvsLQvyRiIvqjTEcmA5SAuvQR3wEB7Cks5uJ1DVgaJpZM4VckeH>
.
|
Cool. We'll try to get something to you in the next week or two. |
This is a regression test for issue 436: gogo#436
Suppose we have the protobuf definition message Foo { repeated type Stuff = 1; } that we deserialize in a fairly common way: f := &Foo{} f.Unmarshal(blob) Before the call to Unmarshal, `f.Stuff` will be a slice of length 0, so the Unmarshal operation will more or less be: for _, x := range xs { f.Stuff = append(f.Stuff, x) } If we don't know how many elements we're going to deserialize beforehand, this is the best we can do. Suppose, however, that we know that we're going to deserialize n elements. If k is such that 2^k < n <= 2^{k+1}, then the Go runtime's exponential doubling strategy for resizing the arrays that back slices will cause us to allocate memory for at least 1 + 2 + ... + 2^{k+1} = 2 * 2^{k+1} elements, which is usually more than double what we actually need. When we deserialize packed fields, we know how many bytes we're going to deserialize before we start the default append loop. If we furthermore know how many elements those bytes correspond to, which we do when the protobuf wire type corresponding to `type` has fixed length [1], we can prepend the default append loop with f.Stuff = make([]type, 0, n) and ask for exactly the memory we're going to use. This results in considerable memory savings, between 50 and 80 percent, compared with the default strategy. These savings are important to people who use protobuf to communicate things like time series between services, which consist almost entirely of large arrays of floats and doubles. This fixes gogo#436. It's conceivable to implement similar things for packed types of non-fixed length. They're encoded with varints, and we _could_ run through the byte stream we're going to deserialize and count how many bytes don't have the most significant bit set, but the performance implications of that seem less predictable than of the simple division we can perform here. [1] https://developers.google.com/protocol-buffers/docs/encoding#structure
I'm thinking about a similar optimisation - but for nested fields. Our proto is basically:
Some 30% of the allocations tend to be when unmarshalling |
Not a maintainer, just a random dude, but if you know in advance how many things live in the slices you're going to deserialize, I think you can allocate space for them in the struct you deserialize them into before calling Unmarshal:
From over here, your case seems a bit different than the one we're working on (I got two junior developers at work interested in treating the case of packed varints this morning). When dealing with packed fields, we know how many of the next bytes in the stream are relevant to the repeated field, and can decide based on that whether to try to count how many elements they correspond to. We don't have that byte count for non-packed fields, so it seems less obvious whether to scan ahead to try to count elements or not. We're going to err on the side of not in our PRs. If we're able to handle the case of varints on our end, and get that PR and the one that treats packed fixed-length types merged, then you could profit from that by changing the protobuf definition to
so Values and TimestampsMs would be packed primitives. If that doesn't seem right to you, maybe your problem would get more attention from the maintainers in its own issue? |
Thanks for the response! I tried that - but these are nested
I wish I could! But this is a external API, so its pretty hard to go through such a change. |
I did it in an even more hacked up way (with a custom type): grafana/cortex@ef1728b#diff-8ddb008bd1159258872969a8f392d3d7L27 |
@tomwilkie I am struggling to decide whether that is a hack or just the correct solution. |
I think we can close this issue as the pull request has been merged, or do you think there is more we can do here? |
Yeah my comments should be a separate issue ideally. |
I still think it's worth it to do the same for packed fields of varints. I managed to interest some junior developers over here in that as a nice project. We're having a look together, but I don't have a deadline for when their patches will be ready. I'll see to that it gets picked up one way or the other, so you can close the issue if you want to. |
Ok cool, then lets keep the issue open :) |
* Adds a test to verify Unmarshal's memory usage for Varint Validates #436 * Allocates the exact memory needed for packed repeated varints Resolves #436 * Add benchmarks for Issue436. We'll test the change against a range of array sizes as well as different varint range. The before/after tests were made using the same code against fork/master branches. * Run 'make' with Go 1.11 and protobuf 3.5.1 It seems that combination is somehow special to the CI tooling.
The generated code that deserializes a repeated field is essentially:
When the Go runtime needs to grow the backing array of a slice, it uses a doubling strategy to do so. If we have n elements to append to an empty slice, and k is such that 2^k < n < 2^{k+1}, then this means we usually end up allocating space for at least
elements, while making a single allocation of n elements would have been enough.
When we deserialize a protobuf message with a packed repeated field, we know how many elements we're going to deserialize and can allocate all the memory needed in one go. Here is a demonstration of a simple program that does this. Profiling its memory use with the default generated code, and also with a single ad-hoc patch that performs this allocation, shows that what we describe above really does happen.
I work for Booking.com, and we're very interested in making this deserialization code more clever about its allocations if possible and acceptable to the maintainers. We use this library in our implementation of a Graphite server, and Unmarshal is responsible for the vast majority of its memory use (around 20 GB when things are calm), which is unsurprising as the server spends most of its time deserializing very large arrays of doubles. Getting rid of this behaviour would cut our memory use by about half.
Sample results from running the above program:
The text was updated successfully, but these errors were encountered: