-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ResteasyReactiveOutputStream could make better use of the Netty (direct) pooled allocator #32546
Comments
/cc @FroMage (resteasy-reactive), @Sgitario (resteasy-reactive), @stuartwdouglas (resteasy-reactive) |
Thanks for writing this up @franz1981. What do you propose we do? |
@geoand I'm preparing a PoC, but the ideas that come in my mind are:
Probably you can help me understand how other streaming cases would use this stream to append chunks; if it can happen |
I've learnt a bit more how It means that we cannot rely on the fact that |
In almost all cases, we know the full object which will be written (like in the case of Jackson, where we obviously know the POJO). |
Also the user can just directly use this stream. Something I did notice you mentioned was this: 'Performing an allocation of 8K and assuming a I/O caller thread (aka the event loop), ' This is a blocking stream, it should not really be used by IO threads as it can block them? |
@stuartwdouglas |
@geoand @stuartwdouglas Proposal is at https://github.com/franz1981/quarkus/tree/append_buffer
The idea is to have Jackson able to use a "special" buffer that is aware that the caller is already batchy and don't try to be "smart". The append buffer I've designed can work in 3 different ways:
I would use It's important for I didn't yet implemented any mechanism to release components in case of errors, but I would likely implement those in the append buffer itself, for simplicity - the original code was very well designed to save leak and I see no tests (I would add some to be sure to keep this nice feature). Feedbacks are welcome before sending a PR |
Can I propose a slightly different approach?
Most uses will be a single exact write, so withExactChunks is the correct approach, and even multiple larger writes it will work fine as long as the writes are then merged into a composite buffer so there is only a single writev call. You might also want to consider a high/low water mark approach to max buffer size. E.g. once the buffer hits 8k we flush it, but if we have a single large write we can allocate up to 16k or more because we know it will immediately be sent to Netty. It would also be good to know what the actual optimal chunk size is these days. When I was doing Undertow I got the best performance using 16kb write calls, but that was a long time ago, and all hardware and software involved is now massivly obsolete. So to try and sum up all those points it would look something like:
Obviously all these numbers are made up, and should be replaced with numbers from hard evidence rather than my gut feeling. |
On second thoughts the 'high water/low water' approach might not be the best idea, especially if your batch size is properly calibrated. It could result in a little bit of overflow being sent in its own packet instead of being batched. |
I would propose testing both BTW, @franz1981 you probably also want to run the tests in |
I would really like to get something along these lines in however, because as I understand it by what Franz has said from his preliminary investigations, this will lead to non trivial gains in terms of memory usage |
To summarize, we have here a mix of different things we would like to do right I see:
In addition, I would add the behaviour of HTTP 11 (didn't checked for HTTP 2) at https://github.com/netty/netty/blob/9f10a284712bbd7f4f8ded7711a50bfaff38f567/codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java#L334 In short, if the previously allocated headers buffers (Netty uses a statistics to size them, based on past data points) are big enough to host the provided content (ie what quarkus is passing), it would write it there instead of passing it to the transport as it is ie potentially a composite buffer. The point made by @stuartwdouglas is fair: right now the buffer is using a bi-modal logic where you can disable the limits in the single writes (as long as they fit into the over all total capacity offered - that's ~8K by default), but I can slightly change it in order to still have the notion of total capacity (which influence when an The sole concern here (meaning that I gotta band my head on this a bit more), is that batching for a minimum of |
I've added a further commit to modify the behavior as suggested by @stuartwdouglas at franz1981@f428f93 (same branch as before, but new commit): right now is using a magic value of I am now checking how the size classes of Netty are composed and searching on https://www.kernel.org/doc/html/latest/index.html if I can find anything interesting re |
Something else to consider if we want absolute maximum performance is that the first write will also have the headers, while subsequent writes won't (although they may have chunking data). Does Netty handle this sort of batching optimisation? Maybe a Netty handler is actually the correct place because you can just deal with fixed size buffers without needing to know about headers etc. My gut feeling is that you would want to batch close to the MTU size so that every writev call maps to a single packet. |
The netty http 1.1 encoder is smart enough to try estimating the http headers based on a statistics of the previously sent one, allocating a single buffer in one go - and if the payload (non-chunked) is small enough, can be usually written directly into it (due to padding for alignments of the Netty allocator).
That's more a problem of the overall capacity then the single chunk (that's the hidden detail that can cause composite buffers to be used), which cost is mostly due to:
if we batch enough, even by exceeding the MTU, we guarantee decent packet utilization (we amortize the single TCP header packet cost by putting enough payload), even with fragmentation. |
I don't want again to define a "wrong" minChunkSize that just makes allocations not efficient due to not correctly using Netty TLABs: is important to remember that missing TLAB allocations would cause using the shared arenas of the allocator and impacting (if unlucky) other threads (I/O and not) then causing a small scalability issue. This was the first reason to just use whatever was the requested required capacity and just let Netty "do its thing" - said that, I agree that we should find a proper value to save obviously bad things to happen (sub-100 B or 1K allocations). |
Description
quarkus.resteasy-reactive.output-buffer-size
(see quarkus.resteasy-reactive.output-buffer-size's doc) seems to not just control when responses should contain theContent-Length
header and being chunked, but the initial eagerly allocated capacity of the buffer used to stream content over the connection per response as well, seequarkus/independent-projects/resteasy-reactive/server/vertx/src/main/java/org/jboss/resteasy/reactive/server/vertx/ResteasyReactiveOutputStream.java
Line 208 in f6851e3
This "eager" behavior, despite seems optimal while simple (because it allows to perform optimistic batching) can be problematic for the Netty allocator, due to how it works under the hood.
Adding more notes below, to explain the effects/consequences.
NOTE:
the default buffer size is
8191
byteseg
Performing an allocation of 8K and assuming a I/O caller thread (aka the event loop), in a default configured Netty scenario, is going to use the netty
PoolThreadCache
at https://github.com/netty/netty/blob/c353f4fea52559d09b3811492c92a38aa1887501/buffer/src/main/java/io/netty/buffer/PoolThreadCache.java#L299-L303 ie a so-calledsmall
direct pooled allocation.The cache itself is organized in order to have few
MemoryRegionCache
s (in an array) chosen based on the normalized required capacity; in this case it's going to use the one at thesizeIdx = 31
, obtained by (https://github.com/netty/netty/blob/eb3feb479826949090a9a55c782722ece9b42e50/buffer/src/main/java/io/netty/buffer/SizeClasses.java#L317)[SizeClasses::size2SizeIdx].Every
MemoryRegionCache
has a finite number of pooled (and thread local) buffers ieio.netty.allocator.smallCacheSize
, which is 256 by default.Such pooled and thread local instances are not filled into
MemoryRegionCache
till they are used/allocated the very first time (on demand) - it means that in the worst case; given that we always allocate 8K chunks, we risk to have a (8K * 256) memory footprint per even loop thread, while idle, if a previous run has caused them to be fully allocated (maybe because of 256 concurrent and slow connections running on the same event loop thread).In short: the problem is not the allocation cost, because, in the happy path is a cached already-happened thread-local allocation, but is a matter of idle memory footprint and bad utilization of the existing caching behavior provided by Netty: if we could use different sized allocations here, we can make uses of all the different
sizeIdx
that Netty provide, possibly reducing the idle utilization as well due to retaining just the effectively used ones (or near to what has been the effective usage).In addition, using many
MemoryRegionCache
entries, reduce the chances of un-happy paths, because we are not bounded just to a singleMemoryRegionCache
capacity (that's fairly low, as said - just 256 buffers).Implementation ideas
https://github.com/franz1981/quarkus/tree/append_buffer
The text was updated successfully, but these errors were encountered: