Transfer network bytes to smaller buffer #62673

Tim-Brooks · 2020-09-18T23:21:31Z

Currently we read in 64KB blocks from the network. When TLS is not
enabled, these bytes are normally passed all the way to the application
layer (some exceptions: compression). For the HTTP layer this means that
these bytes can live throughout the entire lifecycle of an indexing
request.

The problem is that if the reads from the socket are small, this means
that 64KB buffers can be consumed by 1KB or smaller reads. If the socket
buffer or TCP buffer sizes are small, the leads to massive memory
waste. It has been identified as a major source of OOMs on coordinating
nodes as Elasticsearch easily exhausts the heap for these network bytes.

This commit resolves the problem by placing a handler after the TLS
handler to copy these bytes to a more appropriate buffer size as
necessary. This comes after TLS, because TLS is a framing layer which
often resolves this problem for us (the 64KB buffer will be decoded
into a more appropriate buffer size). However, this extra handler will
solve it for the non-TLS pipelines.

Currently we read in 64KB blocks from the network. When TLS is not enabled, these bytes are normally passed all the way to the application layer (some exceptions: compression). For the HTTP layer this means that these bytes can live throughout the entire lifecycle of an indexing request. The problem is that if the reads from the socket are small, this means that 64KB buffers can be consumed by 1KB or smaller reads. If the socket buffer or TCP buffer sizes are small, the leads to massive memory waste. It has been identified as a major source of OOMs on coordinating nodes as Elasticsearch easily exhausts the heap for these network bytes. This commit resolves the problem by placing a handler after the TLS handler to copy these bytes to a more appropriate buffer size as necessary. This comes after TLS, because TLS is a framing layer which often resolves this problem for us (the 64KB buffer will be decoded into a more appropriate buffer size). However, this extra handler will solve it for the non-TLS pipelines.

elasticmachine · 2020-09-18T23:21:33Z

Pinging @elastic/es-distributed (:Distributed/Network)

original-brownbear

Some smaller comments and a question.
Generally +1 on this, especially for searches we seem to be wasting a ton of bytes here.

modules/transport-netty4/src/main/java/org/elasticsearch/transport/NettyByteBufSizer.java

original-brownbear · 2020-09-19T07:52:57Z

modules/transport-netty4/src/main/java/org/elasticsearch/transport/NettyByteBufSizer.java

+        // copy.
+        int estimatedSize = buf.maxFastWritableBytes() + buf.writerIndex();
+        if (estimatedSize > 1024 && buf.maxFastWritableBytes() >= buf.readableBytes()) {
+            ByteBuf newBuffer = ctx.alloc().heapBuffer(buf.readableBytes());


Might give us more optimally sized buffers if we set the max capacity as well through
via ctx.alloc().heapBuffer(length, length); ?

The max capacity does not impact the allocation size. It is essentially a limit for future expansion (reallocations) of the buffer.

original-brownbear · 2020-09-19T07:54:45Z

modules/transport-netty4/src/main/java/org/elasticsearch/transport/NettyByteBufSizer.java

+public class NettyByteBufSizer extends MessageToMessageDecoder<ByteBuf> {
+
+    @Override
+    protected void decode(ChannelHandlerContext ctx, ByteBuf buf, List<Object> out) {


Should we really do this in general? It seems only makes sense for REST handlers that don't copy the buffers to unpooled anyway (search and bulk only as of right now). Maybe we should just copy those requests to new pooled buffers of appropriate size and leave the rest of them alone since we're releasing them on the io thread right away anyway?

As discussed in the meeting, this is valuable as large messages can be aggregated for a period of time, hurting the memory ratios without this change.

…e_smaller

Tim-Brooks · 2020-09-24T16:23:37Z

This is the simpler PR that I think we preferred.

original-brownbear

As discussed, I'm +1 on this solution over the alternative. Just one question.

original-brownbear · 2020-09-30T05:41:29Z

modules/transport-netty4/src/main/java/org/elasticsearch/transport/NettyByteBufSizer.java

+        // twice as big as necessary to contain the data. If that is the case, allocate a new buffer and
+        // copy.
+        int estimatedSize = maxFastWritableBytes + buf.writerIndex();
+        if (estimatedSize > 1024 && maxFastWritableBytes >= readableBytes) {


Couldn't we, instead of rolling our own copying here, just use a call to ByteBuf#capacity and make Netty resize things? I.e. just do something like:

@Override protected void decode(ChannelHandlerContext ctx, ByteBuf buf, List<Object> out) { final int readableBytes = buf.readableBytes(); out.add(buf.discardReadBytes().capacity(readableBytes).retain()); }

At least in some quick experimentation with the debugger that method seems to accomplish a similar if not the same thing we do here but with less copying in case the reader index is 0 (which it probably is most of the time?) but maybe I'm missing something?

This seems fine. I guarded against less than 1024 because capacity will always copy small arrays which seems unnecessary.

henningandersen

LGTM, but let us have @original-brownbear's approval too before merging.

Could we maybe add a test just testing the logic in NettyByteBufSizer in isolation, verifying that we get all the bytes out in a reasonably sized buffer?

…e_smaller

Tim-Brooks · 2020-09-30T16:46:37Z

Could we maybe add a test just testing the logic in NettyByteBufSizer in isolation, verifying that we get all the bytes out in a reasonably sized buffer?

I did but then removed it when I took Armin's suggestion. If we allow netty to control the resizing, it mutates the buffer in place and there is nothing to assert on. We depend that Netty internally reallocated and copied the bytes opposed to just adjusted indexes.

original-brownbear

LGTM, thanks Tim!

Currently we read in 64KB blocks from the network. When TLS is not enabled, these bytes are normally passed all the way to the application layer (some exceptions: compression). For the HTTP layer this means that these bytes can live throughout the entire lifecycle of an indexing request. The problem is that if the reads from the socket are small, this means that 64KB buffers can be consumed by 1KB or smaller reads. If the socket buffer or TCP buffer sizes are small, the leads to massive memory waste. It has been identified as a major source of OOMs on coordinating nodes as Elasticsearch easily exhausts the heap for these network bytes. This commit resolves the problem by placing a handler after the TLS handler to copy these bytes to a more appropriate buffer size as necessary. This comes after TLS, because TLS is a framing layer which often resolves this problem for us (the 64KB buffer will be decoded into a more appropriate buffer size). However, this extra handler will solve it for the non-TLS pipelines.

Tim-Brooks · 2020-10-01T16:53:44Z

Still needs back port to 7.9.x.

Currently we read in 64KB blocks from the network. When TLS is not enabled, these bytes are normally passed all the way to the application layer (some exceptions: compression). For the HTTP layer this means that these bytes can live throughout the entire lifecycle of an indexing request. The problem is that if the reads from the socket are small, this means that 64KB buffers can be consumed by 1KB or smaller reads. If the socket buffer or TCP buffer sizes are small, the leads to massive memory waste. It has been identified as a major source of OOMs on coordinating nodes as Elasticsearch easily exhausts the heap for these network bytes. This commit resolves the problem by placing a handler after the TLS handler to copy these bytes to a more appropriate buffer size as necessary. This comes after TLS, because TLS is a framing layer which often resolves this problem for us (the 64KB buffer will be decoded into a more appropriate buffer size). However, this extra handler will solve it for the non-TLS pipelines.

Tim-Brooks added 2 commits September 18, 2020 17:14

Fiux

4924f1d

Tim-Brooks added >non-issue :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.10.0 labels Sep 18, 2020

Tim-Brooks requested a review from original-brownbear September 18, 2020 23:21

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 18, 2020

Tim-Brooks added 2 commits September 18, 2020 17:26

Use estimated size

135ebe0

Checkstyle

73699e2

original-brownbear reviewed Sep 19, 2020

View reviewed changes

Tim-Brooks added 2 commits September 19, 2020 10:32

Merge remote-tracking branch 'upstream/master' into size_buffers_to_b…

56f92e6

…e_smaller

Variable names

806e766

Tim-Brooks added discuss team-discuss and removed discuss team-discuss labels Sep 22, 2020

Tim-Brooks requested review from henningandersen and original-brownbear September 24, 2020 16:17

Tim-Brooks added the v7.9.3 label Sep 29, 2020

original-brownbear reviewed Sep 30, 2020

View reviewed changes

henningandersen approved these changes Sep 30, 2020

View reviewed changes

Tim-Brooks added 4 commits September 30, 2020 10:13

Test

027cc63

Merge remote-tracking branch 'upstream/master' into size_buffers_to_b…

62b9ef6

…e_smaller

Merge remote-tracking branch 'upstream/master' into size_buffers_to_b…

c949ed9

…e_smaller

Change

6785615

Tim-Brooks requested a review from original-brownbear September 30, 2020 16:46

original-brownbear approved these changes Sep 30, 2020

View reviewed changes

Tim-Brooks merged commit 1547bd6 into elastic:master Sep 30, 2020

Tim-Brooks added the backport pending label Sep 30, 2020

Tim-Brooks removed the backport pending label Oct 5, 2020

original-brownbear mentioned this pull request Apr 25, 2021

Use NettyByteBufSizer for Outbound Connections #72193

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transfer network bytes to smaller buffer #62673

Transfer network bytes to smaller buffer #62673

Tim-Brooks commented Sep 18, 2020

elasticmachine commented Sep 18, 2020

original-brownbear left a comment

original-brownbear Sep 19, 2020

Tim-Brooks Sep 24, 2020

original-brownbear Sep 19, 2020

Tim-Brooks Sep 24, 2020

Tim-Brooks commented Sep 24, 2020

original-brownbear left a comment

original-brownbear Sep 30, 2020 •

edited

Loading

Tim-Brooks Sep 30, 2020

henningandersen left a comment •

edited

Loading

Tim-Brooks commented Sep 30, 2020

original-brownbear left a comment

Tim-Brooks commented Oct 1, 2020

Transfer network bytes to smaller buffer #62673

Transfer network bytes to smaller buffer #62673

Conversation

Tim-Brooks commented Sep 18, 2020

elasticmachine commented Sep 18, 2020

original-brownbear left a comment

Choose a reason for hiding this comment

original-brownbear Sep 19, 2020

Choose a reason for hiding this comment

Tim-Brooks Sep 24, 2020

Choose a reason for hiding this comment

original-brownbear Sep 19, 2020

Choose a reason for hiding this comment

Tim-Brooks Sep 24, 2020

Choose a reason for hiding this comment

Tim-Brooks commented Sep 24, 2020

original-brownbear left a comment

Choose a reason for hiding this comment

original-brownbear Sep 30, 2020 • edited Loading

Choose a reason for hiding this comment

Tim-Brooks Sep 30, 2020

Choose a reason for hiding this comment

henningandersen left a comment • edited Loading

Choose a reason for hiding this comment

Tim-Brooks commented Sep 30, 2020

original-brownbear left a comment

Choose a reason for hiding this comment

Tim-Brooks commented Oct 1, 2020

original-brownbear Sep 30, 2020 •

edited

Loading

henningandersen left a comment •

edited

Loading