-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transfer network bytes to smaller buffer #62673
Transfer network bytes to smaller buffer #62673
Conversation
Currently we read in 64KB blocks from the network. When TLS is not enabled, these bytes are normally passed all the way to the application layer (some exceptions: compression). For the HTTP layer this means that these bytes can live throughout the entire lifecycle of an indexing request. The problem is that if the reads from the socket are small, this means that 64KB buffers can be consumed by 1KB or smaller reads. If the socket buffer or TCP buffer sizes are small, the leads to massive memory waste. It has been identified as a major source of OOMs on coordinating nodes as Elasticsearch easily exhausts the heap for these network bytes. This commit resolves the problem by placing a handler after the TLS handler to copy these bytes to a more appropriate buffer size as necessary. This comes after TLS, because TLS is a framing layer which often resolves this problem for us (the 64KB buffer will be decoded into a more appropriate buffer size). However, this extra handler will solve it for the non-TLS pipelines.
Pinging @elastic/es-distributed (:Distributed/Network) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some smaller comments and a question.
Generally +1 on this, especially for searches we seem to be wasting a ton of bytes here.
modules/transport-netty4/src/main/java/org/elasticsearch/transport/NettyByteBufSizer.java
Outdated
Show resolved
Hide resolved
// copy. | ||
int estimatedSize = buf.maxFastWritableBytes() + buf.writerIndex(); | ||
if (estimatedSize > 1024 && buf.maxFastWritableBytes() >= buf.readableBytes()) { | ||
ByteBuf newBuffer = ctx.alloc().heapBuffer(buf.readableBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might give us more optimally sized buffers if we set the max capacity as well through
via ctx.alloc().heapBuffer(length, length);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max capacity does not impact the allocation size. It is essentially a limit for future expansion (reallocations) of the buffer.
public class NettyByteBufSizer extends MessageToMessageDecoder<ByteBuf> { | ||
|
||
@Override | ||
protected void decode(ChannelHandlerContext ctx, ByteBuf buf, List<Object> out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really do this in general? It seems only makes sense for REST handlers that don't copy the buffers to unpooled anyway (search and bulk only as of right now). Maybe we should just copy those requests to new pooled buffers of appropriate size and leave the rest of them alone since we're releasing them on the io thread right away anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the meeting, this is valuable as large messages can be aggregated for a period of time, hurting the memory ratios without this change.
This is the simpler PR that I think we preferred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, I'm +1 on this solution over the alternative. Just one question.
// twice as big as necessary to contain the data. If that is the case, allocate a new buffer and | ||
// copy. | ||
int estimatedSize = maxFastWritableBytes + buf.writerIndex(); | ||
if (estimatedSize > 1024 && maxFastWritableBytes >= readableBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we, instead of rolling our own copying here, just use a call to ByteBuf#capacity
and make Netty resize things? I.e. just do something like:
@Override
protected void decode(ChannelHandlerContext ctx, ByteBuf buf, List<Object> out) {
final int readableBytes = buf.readableBytes();
out.add(buf.discardReadBytes().capacity(readableBytes).retain());
}
At least in some quick experimentation with the debugger that method seems to accomplish a similar if not the same thing we do here but with less copying in case the reader index is 0
(which it probably is most of the time?) but maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine. I guarded against less than 1024 because capacity will always copy small arrays which seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but let us have @original-brownbear's approval too before merging.
Could we maybe add a test just testing the logic in NettyByteBufSizer
in isolation, verifying that we get all the bytes out in a reasonably sized buffer?
I did but then removed it when I took Armin's suggestion. If we allow netty to control the resizing, it mutates the buffer in place and there is nothing to assert on. We depend that Netty internally reallocated and copied the bytes opposed to just adjusted indexes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Tim!
Currently we read in 64KB blocks from the network. When TLS is not enabled, these bytes are normally passed all the way to the application layer (some exceptions: compression). For the HTTP layer this means that these bytes can live throughout the entire lifecycle of an indexing request. The problem is that if the reads from the socket are small, this means that 64KB buffers can be consumed by 1KB or smaller reads. If the socket buffer or TCP buffer sizes are small, the leads to massive memory waste. It has been identified as a major source of OOMs on coordinating nodes as Elasticsearch easily exhausts the heap for these network bytes. This commit resolves the problem by placing a handler after the TLS handler to copy these bytes to a more appropriate buffer size as necessary. This comes after TLS, because TLS is a framing layer which often resolves this problem for us (the 64KB buffer will be decoded into a more appropriate buffer size). However, this extra handler will solve it for the non-TLS pipelines.
Still needs back port to 7.9.x. |
Currently we read in 64KB blocks from the network. When TLS is not enabled, these bytes are normally passed all the way to the application layer (some exceptions: compression). For the HTTP layer this means that these bytes can live throughout the entire lifecycle of an indexing request. The problem is that if the reads from the socket are small, this means that 64KB buffers can be consumed by 1KB or smaller reads. If the socket buffer or TCP buffer sizes are small, the leads to massive memory waste. It has been identified as a major source of OOMs on coordinating nodes as Elasticsearch easily exhausts the heap for these network bytes. This commit resolves the problem by placing a handler after the TLS handler to copy these bytes to a more appropriate buffer size as necessary. This comes after TLS, because TLS is a framing layer which often resolves this problem for us (the 64KB buffer will be decoded into a more appropriate buffer size). However, this extra handler will solve it for the non-TLS pipelines.
Currently we read in 64KB blocks from the network. When TLS is not
enabled, these bytes are normally passed all the way to the application
layer (some exceptions: compression). For the HTTP layer this means that
these bytes can live throughout the entire lifecycle of an indexing
request.
The problem is that if the reads from the socket are small, this means
that 64KB buffers can be consumed by 1KB or smaller reads. If the socket
buffer or TCP buffer sizes are small, the leads to massive memory
waste. It has been identified as a major source of OOMs on coordinating
nodes as Elasticsearch easily exhausts the heap for these network bytes.
This commit resolves the problem by placing a handler after the TLS
handler to copy these bytes to a more appropriate buffer size as
necessary. This comes after TLS, because TLS is a framing layer which
often resolves this problem for us (the 64KB buffer will be decoded
into a more appropriate buffer size). However, this extra handler will
solve it for the non-TLS pipelines.