Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer network bytes to smaller buffer #62673

Merged
merged 10 commits into from
Sep 30, 2020

Conversation

Tim-Brooks
Copy link
Contributor

Currently we read in 64KB blocks from the network. When TLS is not
enabled, these bytes are normally passed all the way to the application
layer (some exceptions: compression). For the HTTP layer this means that
these bytes can live throughout the entire lifecycle of an indexing
request.

The problem is that if the reads from the socket are small, this means
that 64KB buffers can be consumed by 1KB or smaller reads. If the socket
buffer or TCP buffer sizes are small, the leads to massive memory
waste. It has been identified as a major source of OOMs on coordinating
nodes as Elasticsearch easily exhausts the heap for these network bytes.

This commit resolves the problem by placing a handler after the TLS
handler to copy these bytes to a more appropriate buffer size as
necessary. This comes after TLS, because TLS is a framing layer which
often resolves this problem for us (the 64KB buffer will be decoded
into a more appropriate buffer size). However, this extra handler will
solve it for the non-TLS pipelines.

Currently we read in 64KB blocks from the network. When TLS is not
enabled, these bytes are normally passed all the way to the application
layer (some exceptions: compression). For the HTTP layer this means that
these bytes can live throughout the entire lifecycle of an indexing
request.

The problem is that if the reads from the socket are small, this means
that 64KB buffers can be consumed by 1KB or smaller reads. If the socket
buffer or TCP buffer sizes are small, the leads to massive memory
waste. It has been identified as a major source of OOMs on coordinating
nodes as Elasticsearch easily exhausts the heap for these network bytes.

This commit resolves the problem by placing a handler after the TLS
handler to copy these bytes to a more appropriate buffer size as
necessary. This comes after TLS, because TLS is a framing layer which
often resolves this problem for us (the 64KB buffer will be decoded
into a more appropriate buffer size). However, this extra handler will
solve it for the non-TLS pipelines.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Network)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 18, 2020
Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some smaller comments and a question.
Generally +1 on this, especially for searches we seem to be wasting a ton of bytes here.

// copy.
int estimatedSize = buf.maxFastWritableBytes() + buf.writerIndex();
if (estimatedSize > 1024 && buf.maxFastWritableBytes() >= buf.readableBytes()) {
ByteBuf newBuffer = ctx.alloc().heapBuffer(buf.readableBytes());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might give us more optimally sized buffers if we set the max capacity as well through
via ctx.alloc().heapBuffer(length, length); ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max capacity does not impact the allocation size. It is essentially a limit for future expansion (reallocations) of the buffer.

public class NettyByteBufSizer extends MessageToMessageDecoder<ByteBuf> {

@Override
protected void decode(ChannelHandlerContext ctx, ByteBuf buf, List<Object> out) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really do this in general? It seems only makes sense for REST handlers that don't copy the buffers to unpooled anyway (search and bulk only as of right now). Maybe we should just copy those requests to new pooled buffers of appropriate size and leave the rest of them alone since we're releasing them on the io thread right away anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the meeting, this is valuable as large messages can be aggregated for a period of time, hurting the memory ratios without this change.

@Tim-Brooks
Copy link
Contributor Author

This is the simpler PR that I think we preferred.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I'm +1 on this solution over the alternative. Just one question.

// twice as big as necessary to contain the data. If that is the case, allocate a new buffer and
// copy.
int estimatedSize = maxFastWritableBytes + buf.writerIndex();
if (estimatedSize > 1024 && maxFastWritableBytes >= readableBytes) {
Copy link
Member

@original-brownbear original-brownbear Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we, instead of rolling our own copying here, just use a call to ByteBuf#capacity and make Netty resize things? I.e. just do something like:

    @Override
    protected void decode(ChannelHandlerContext ctx, ByteBuf buf, List<Object> out) {
        final int readableBytes = buf.readableBytes();
        out.add(buf.discardReadBytes().capacity(readableBytes).retain()); 
    }

At least in some quick experimentation with the debugger that method seems to accomplish a similar if not the same thing we do here but with less copying in case the reader index is 0 (which it probably is most of the time?) but maybe I'm missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine. I guarded against less than 1024 because capacity will always copy small arrays which seems unnecessary.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but let us have @original-brownbear's approval too before merging.

Could we maybe add a test just testing the logic in NettyByteBufSizer in isolation, verifying that we get all the bytes out in a reasonably sized buffer?

@Tim-Brooks
Copy link
Contributor Author

Could we maybe add a test just testing the logic in NettyByteBufSizer in isolation, verifying that we get all the bytes out in a reasonably sized buffer?

I did but then removed it when I took Armin's suggestion. If we allow netty to control the resizing, it mutates the buffer in place and there is nothing to assert on. We depend that Netty internally reallocated and copied the bytes opposed to just adjusted indexes.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Tim!

@Tim-Brooks Tim-Brooks merged commit 1547bd6 into elastic:master Sep 30, 2020
Tim-Brooks added a commit that referenced this pull request Oct 1, 2020
Currently we read in 64KB blocks from the network. When TLS is not
enabled, these bytes are normally passed all the way to the application
layer (some exceptions: compression). For the HTTP layer this means that
these bytes can live throughout the entire lifecycle of an indexing
request.

The problem is that if the reads from the socket are small, this means
that 64KB buffers can be consumed by 1KB or smaller reads. If the socket
buffer or TCP buffer sizes are small, the leads to massive memory
waste. It has been identified as a major source of OOMs on coordinating
nodes as Elasticsearch easily exhausts the heap for these network bytes.

This commit resolves the problem by placing a handler after the TLS
handler to copy these bytes to a more appropriate buffer size as
necessary. This comes after TLS, because TLS is a framing layer which
often resolves this problem for us (the 64KB buffer will be decoded
into a more appropriate buffer size). However, this extra handler will
solve it for the non-TLS pipelines.
@Tim-Brooks
Copy link
Contributor Author

Still needs back port to 7.9.x.

Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this pull request Oct 5, 2020
Currently we read in 64KB blocks from the network. When TLS is not
enabled, these bytes are normally passed all the way to the application
layer (some exceptions: compression). For the HTTP layer this means that
these bytes can live throughout the entire lifecycle of an indexing
request.

The problem is that if the reads from the socket are small, this means
that 64KB buffers can be consumed by 1KB or smaller reads. If the socket
buffer or TCP buffer sizes are small, the leads to massive memory
waste. It has been identified as a major source of OOMs on coordinating
nodes as Elasticsearch easily exhausts the heap for these network bytes.

This commit resolves the problem by placing a handler after the TLS
handler to copy these bytes to a more appropriate buffer size as
necessary. This comes after TLS, because TLS is a framing layer which
often resolves this problem for us (the 64KB buffer will be decoded
into a more appropriate buffer size). However, this extra handler will
solve it for the non-TLS pipelines.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.9.3 v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants