-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization and deserialization of large messages can be slow #166
Comments
Morning! So, following up on the other PR. Problem StatementThe rough problem statement is: when working with a Twirp server in Go that handles very large ProtoBuf messages, there's significant overhead caused in the serialization and de-serialization code. This overhead comes from two main sources: twirp/example/service.twirp.go Lines 269 to 280 in c2eb369
The serialization code is the hottest in both memory and CPU profiles. This is because we're using Let's take a look at a couple memory profiles. Here
The most interesting thing here is that, because of the way The result of that churn can clearly be seen in the CPU profile graph:
GC + memory allocation just overwhelms the actual parsing time of the ProtoBuf objects in synthetic benchmarks. In a production service benchmark that performs operations on the buffers, GC + memory allocation is no longer number 1, but it also dominates most of the top10. Possible SolutionsLet's enumerate the possible solutions which I've considered and which @spenczar proposed in the previous PR.
I think that's a good summary of where are we at! Are we missing something? From my point of view: large messages are clearly a performance issue because of the (re)allocations. The least painful way to improve this bottleneck is memory pooling. The biggest open questions are whether we should make this memory pooling opt-in, whether it should be always enabled by default, how are going to do the pooling, and what kind of objects are we going to pool. I would love to hear your thoughts on this! |
Thanks, terrific write-up and just what I was hoping for. I'd like to set some additional boundaries on this problem:
I have an additional solution to consider: Content-Length Header: Since the issue seems to be mostly around the way This would mean there's just one allocation per unmarshal. However, I'd be very wary of reaching too quickly for First, it can have non-linear and unexpected negative consequences for garbage collection performance. golang/go#22950 is a good summary of the problem, and golang/go#14812 (comment) describes a very similar application of pools that ends up reducing a service's throughput. You gain a little from reducing allocations but may pay a lot on GC assist CPU usage, and can pay out the nose due to the way pool cleanup is implemented. This is a long-term issue that has been known since 2013 and doesn't look likely to be resolved any time very soon. Second, variable-sized workloads can be very bad for sync.Pool's performance, but different RPCs will have different sized requests and responses (and, indeed, clients may send wildly differently-sized messages for application-specific reasons). @dsnet has summarized the problem in golang/go#23199; since the pool returns a random object, a few large requests can "poison" a pool and make the heap get really big (and introduce GC pressure that slows down your application!). However, we might be able to address this by using many fixed-size buffers in the pool, and pulling them out as needed (like, Third, I am worried about lock contention on high-RPS services. Today, These aren't necessarily dealbreakers, but they're worrying. I think we need to have profiles and/or benchmarks of different load patterns before we can say whether memory pools are actually a net win for most bottlenecked services. |
Even a max of 10MiB is difficult, since it means that an adversary only needs send 100 small requests, each claiming to be 10MiB, causing the server to allocate 1GiB. An important principle to DDOS protection is that the attacker has to at least spend proportional resources to what they are causing you to waste. The implementation of Go protobufs is unique in that we operate on a single I should note that protobufs for C++ and Java operate on scatter/gather like APIs for buffers. |
Yeah, @dsnet, you have me convinced that trusting Content-Length is probably a dead-end. Doesn't this problem exist to some extent for any protobuf-based protocol, since strings and bytes and embedded messages are length-prefixed? How does the github.com/golang/protobuf/proto unmarshaler avoid allocating 1GiB when a (possibly untrusted) message promises a 1GiB string, but never actually delivers it? golang/protobuf#609 sounds like a nice strategy, but if @dsnet doesn't want to move forward on it until @dfawley's suggestion in golang/protobuf#609 (comment) is what I had in mind when I suggested a streaming deserializer. Man, we don't have a lot of options here. The most promising option might be to use pools with fixed (large-ish) buffer sizes and go from there. |
The fact that protobuf takes a
It's possible that we move forward without it, or push more heavily to get it added in Go1.13 or Go1.14 (no promises). Alternative APIs that don't involve |
Back on this thread. Sorry, I've been busy! 😅 I am not particularly opposed to It seems like our only realistic options are waiting for golang/protobuf#609 (which may take a while), or write a smarter self-adjusting memory pool that is enabled by default and that works well with small messages. I think a smarter pool is doable and will provide solid benefits right away. @spenczar: would you like me to submit something like that as a PR? |
@vmg A smarter pool could work. I'd be interested in a design for one, yes. If you think it's easiest to talk over a design in PR form, that's fine by me; discussion in this issue is good too. |
What I had in mind is something similar to the pool that I think we would need to base it off this code because we would want to pool What do you think? |
Sorry for the long silence here. That code looks alright, but it's fairly complicated. It adds a bunch of atomic calls, which could require synchronization during high-throughput work. I think we should go carefully. Can we gather real-world, production benchmarks from a version of Twirp that uses that pool before committing to it? |
This issue is stale because it has been open 60 days with no activity. Remove stale label / comment or this will be closed in 5 days |
As mentioned in #165, very large messages can slow down a Twirp server due to serialization and deserialization costs.
More detail is needed here: what are the profiling hotspots? It sounds like byte slice allocation is a big one; if so, memory pooling could be useful.
The text was updated successfully, but these errors were encountered: