Default network buffer size causes higher GC pressure than necessary #23185

danielmitterdorfer · 2017-02-15T13:33:42Z

Netty4HttpServerTransport uses the settings http.netty.receive_predictor_min and http.netty.receive_predictor_max to provide a properly configured RecvByteBufAllocator implementation to Netty. Their default value is controlled by the setting transport.netty.receive_predictor_size which varies between 64 kb and 512 kb (per allocated buffer).

The before-mentioned allocator is responsible for allocating memory buffers when handling incoming network packets and Netty will allocate one buffer per network packet.

We have run comparative benchmarks (nyc_taxis) track, once locally (i.e. via loopback) and once distributed (i.e. via a Ethernet) and analyzed the allocation behavior of Elasticsearch.

Network Connection Type	Bytes allocated outside of TLABs on network layer
Loopback	~ 78 GB
Ethernet	~ 2.13 TB

Note: On this particular machine transport.netty.receive_predictor_size was 512kb.

The root cause seems to be related to MTU (which differs greatly between loopback and regular network devices (65536 vs. 1500)). A smaller MTU leads to more network packets (but the buffer size stays the same) thus leading to more GC pressure.

In a custom build of Elasticsearch we set http.netty.receive_predictor_min to 5kb and http.netty.receive_predictor_max to 64kb and got comparable allocation behavior between local and distributed benchmarks.

Note: Our analysis focused only Netty4HttpServerTransport for a single Elasticsearch node. It is expected that Netty4Transport exhibits similar behavior and we should change the buffer size there too.

The text was updated successfully, but these errors were encountered:

danielmitterdorfer · 2017-02-15T18:14:07Z

To find a "good" new default value, I ran our whole benchmark suite against an instrumented build that records the number of bytes read on the network layer.

Each of the graphs below show a histogram of the number of bytes read during network packet handling for the respective benchmark.

If we settle on just a single value, I'd tend to set the default buffer size to 32kb as this should satisfy most allocation requests while not wasting too much memory. I will run further benchmarks to verify it.

jpountz · 2017-02-21T11:15:43Z

Should we move forward on this? It looks like a low hanging fruit that has the potential to make things significantly better for most users?

danielmitterdorfer · 2017-02-21T11:25:35Z

@jpountz I totally agree. I ran a lot of benchmarks over the last few days (until today) and came to the conclusion that 32kb is indeed a good value. I just wanted to ensure that we don't have unwanted side effects.

Expect a PR within the next hour.

Previously we calculated Netty' receive predictor size for HTTP and transport traffic based on available memory and worker nodes. This resulted in a receive predictor size between 64kb and 512kb. In our benchmarks this leads to increased GC pressure. With this commit we set Netty's receive predictor size to 32kb. This value is in a sweet spot between heap memory waste (-> GC pressure) and effect on request metrics (achieved throughput and latency numbers). Closes elastic#23185

Previously we calculated Netty' receive predictor size for HTTP and transport traffic based on available memory and worker nodes. This resulted in a receive predictor size between 64kb and 512kb. In our benchmarks this leads to increased GC pressure. With this commit we set Netty's receive predictor size to 32kb. This value is in a sweet spot between heap memory waste (-> GC pressure) and effect on request metrics (achieved throughput and latency numbers). Closes #23185

jasontedor · 2017-02-23T20:34:37Z

This is really great work @danielmitterdorfer, thank you for digging into this.

danielmitterdorfer · 2017-02-24T07:28:22Z

Thanks @jasontedor. I also checked our benchmarks and in almost all cases we see an upward trajectory. However, we have still one problematic case: outside of TLAB allocations for smaller heap sizes (2GB) for the full-text benchmark increased significantly causing a drop in indexing throughput. I am currently investigating what's causing this and running further tests. So: more digging ahead.

danielmitterdorfer · 2017-02-24T16:13:21Z

The investigation is not yet finished but I want to present a few intermediate results.

I ran the following benchmark:

esrally --pipeline=benchmark-only --track=pmc --challenge=append-no-conflicts-index-only --target-host=192.168.2.2:9200

and varied heap size (-Xms, -Xmx) between 1GB and 16GB as well as http.netty.receive_predictor_size from 16kb to 64kb.

Target machine specs:

CPU: Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz
64GB RAM
SSD
JDK 1.8.0_112-b15
OS Arch Linux with Kernel 4.8.7-1-ARCH
1 GBit Ethernet

I also attached flight recorder and produced GC logs for every trial run.

Here is a summary of the allocation statistics:

Note: object count, maximum object size, minimum object size, average object size, allocation rate for objects are metrics for allocations that happen outside of TLABs.

heap size [GB]	receive predictor size [KB]	median indexing throughput [docs/s]	object count	maximum object size	minimum object size	average object size	total memory allocated	allocation rate for objects	TLAB count	Maximum TLAB size	Minimum TLAB size	Average TLAB size	Total Memory Allocated for TLABs	Allocation Rate for TLABs
1	64	419	64262780	46.57 MB	16 bytes	3.07 kB	188.15 GB	135.44 MB/s	1080267	30.87 MB	152 bytes	280.91 kB	289.40 GB	208.33 MB/s
2	32	OOMEd	36108455	34.29 MB	16 bytes	1.89 kB	65.11 GB	113.81 MB/s	238169	31.48 MB	56 bytes	475.63 kB	108.03 GB	188.83 MB/s
2	64	1176	726691	46.57 MB	16 bytes	228.74 kB	158.52 GB	298.95 MB/s	689016	32.65 MB	232 bytes	433.96 kB	285.15 GB	537.75 MB/s
4	16	1195	683635	46.57 MB	16 bytes	182.84 kB	119.20 GB	231.08 MB/s	681134	28.87 MB	968 bytes	444.24 kB	288.57 GB	559.40 MB/s
4	32	1228	683848	46.57 MB	16 bytes	185.89 kB	121.24 GB	239.21 MB/s	707107	34.79 MB	328 bytes	425.68 kB	287.06 GB	566.40 MB/s
4	64	1223	758859	46.57 MB	16 bytes	224.56 kB	162.52 GB	319.36 MB/s	786162	26.51 MB	1.13 kB	376.33 kB	282.15 GB	554.45 MB/s
8	16	1215	682482	46.57 MB	16 bytes	184.81 kB	120.28 GB	231.39 MB/s	686732	30.01 MB	672 bytes	441.18 kB	288.94 GB	555.84 MB/s
8	32	1219	681496	46.57 MB	16 bytes	184.78 kB	120.09 GB	232.39 MB/s	674428	29.07 MB	776 bytes	449.35 kB	289.01 GB	559.25 MB/s
8	64	1185	778077	46.57 MB	16 bytes	216.13 kB	160.38 GB	308.89 MB/s	790786	32.90 MB	728 bytes	377.13 kB	284.42 GB	547.80 MB/s
16	16	1210	685745	46.57 MB	16 bytes	184.82 kB	120.87 GB	230.90 MB/s	689624	27.57 MB	1.18 kB	436.05 kB	286.78 GB	547.86 MB/s
16	32	1199	687613	46.57 MB	16 bytes	182.69 kB	119.80 GB	230.36 MB/s	695341	28.34 MB	264 bytes	437.61 kB	290.19 GB	557.98 MB/s
16	64	1187	732125	46.57 MB	16 bytes	224.57 kB	156.80 GB	301.79 MB/s	733220	25.59 MB	328 bytes	400.81 kB	280.27 GB	539.44 MB/s

Note: data for OOMED lines are (obviously) only until the OOME has occurred and are thus not complete.

It's evident that the current receive predictor size of 32kb leads to an excessive increase in allocated objects outside of TLABs (which is way more expensive than allocation within a TLAB) for smaller heap sizes of 1GB and 2GB and we have very small average object sizes in regions outside of TLABs.

I have a vague theory at the moment why this happens: When a request arrives, Netty places it into a buffer and accumulates these buffers (in a CompositeByteBuf) until all data from an HTTP request have been received. Now, the PMC track is a full-text benchmark and hence bulk requests are rather large. So we allocate a lot of smaller objects which then get occasionally moved to a larger buffer (the CompositeByteBuf). The JVM sizes TLABs adaptively based on recent allocation requests and with a receive predictor size we seem to have an allocation pattern (on smaller heaps) that tricks the JVM into making TLABs too small (I did one experiment running on a 2GB heap with -XX:MinTLABSize=65536 instead of the default -XX:MinTLABSize=2048 and got much saner results). I am not yet sure why this only happens with small heap sizes but I think it is related to the size of the young generation (or to be more precise: eden) but I did not test this yet.

Next steps

I may do further analysis of TLAB sizes based on the raw data from the GC log if that is necessary but first I will continue the experiments with our other tracks and also a track with a mixed workload which also models a more real-world use case.

Unfortunately, my original testing was based only 4GB heaps where a 32kb receive predictor size works really well for all our benchmarks. Based on this results where I look at a much broader range of heap sizes I think the pragmatic choice is 64kb.

jasontedor · 2017-02-24T16:46:24Z

Okay, thanks for the continued investigation @danielmitterdorfer. I think given what you're discovering here I would recant my previous assessment and this change should stay out of 5.3 and bake a little longer. Are you in agreement with that?

danielmitterdorfer · 2017-02-24T19:10:45Z

Yes @jasontedor I completely agree. I doesn't make sense to backport before we have a clear picture of the problem and a solution that works for all cases.

danielmitterdorfer · 2017-03-01T13:36:09Z

I also ran the nyc_taxis track once more and the patterns we see are similar.

Note: object count, average object size, allocation rate for objects are metrics for allocations that happen outside of TLABs.

heap size [GB]	receive predictor size [KB]	median indexing throughput [docs/s]	object count	average object size	allocation rate for objects	TLAB count	Maximum TLAB size	Minimum TLAB size	Average TLAB size	Allocation Rate for TLABs
1	64	42890	28689613	13.76 kB	41.60 MB/s	10472293	16.46 MB	56 bytes	434.13 kB	479.15 MB/s
2	32	56280	1697214	113.27 kB	60.79 MB/s	6087343	16.34 MB	184 bytes	739.82 kB	1.39 GB/s
2	64	54460	1814895	259.57 kB	142.51 MB/s	5808729	16.66 MB	312 bytes	779.47 kB	1.34 GB/s
4	16	59011	1517389	148.79 kB	73.82 MB/s	5916418	16.26 MB	144 bytes	752.56 kB	1.42 GB/s
4	32	57455	1580485	161.06 kB	82.04 MB/s	5937891	16.40 MB	72 bytes	762.93 kB	1.43 GB/s
4	64	56280	1615653	368.84 kB	188.13 MB/s	5634187	17.19 MB	176 bytes	800.91 kB	1.39 GB/s
8	16	60045	1402857	161.45 kB	68.64 MB/s	5672148	16.40 MB	56 bytes	765.59 kB	1.29 GB/s
8	32	61187	1460676	152.84 kB	75.56 MB/s	5824227	16.79 MB	72 bytes	757.33 kB	1.46 GB/s
8	64	57908	1507110	439.99 kB	216.89 MB/s	5329799	17.07 MB	56 bytes	832.73 kB	1.42 GB/s
16	16	60075	1365818	184.84 kB	85.07 MB/s	5745567	17.14 MB	104 bytes	762.81 kB	1.44 GB/s
16	32	59246	1420437	156.58 kB	73.64 MB/s	5863681	17.12 MB	96 bytes	753.00 kB	1.43 GB/s
16	64	58860	1524189	405.60 kB	179.12 MB/s	5499390	16.71 MB	112 bytes	812.92 kB	1.26 GB/s

Next up are more benchmarks with a mixed workload (indexing + querying).

I am also reopening the issue as we'll definitely iterate on the value that we set by default.

With this commit we change the default receive predictor size for Netty from 32kB to 64kB as our testing has shown that this leads to less allocations on smaller heaps like the default out of the box configuration and this value also works reasonably well for larger heaps. Closes elastic#23185

With this commit we change the default receive predictor size for Netty from 32kB to 64kB as our testing has shown that this leads to less allocations on smaller heaps like the default out of the box configuration and this value also works reasonably well for larger heaps. Closes #23185

danielmitterdorfer · 2017-03-13T16:06:29Z

I did run further benchmarks including a mixed workload with queries. The tl;dr is that the main contributor these problems is bulk indexing (and not so much querying). They also confirmed that a receive predictor size of 64kB is a sensible default value. Preparing the results will take a couple of days but I'll update the ticket.

In parallel I am also improving my understanding why a smaller receive predictor size (32kB) is causing much more trouble for some workloads (PMC) on smaller heaps (2GB) but it will take further experiments to isolate the cause.

I could not observe significant differences in TLAB sizes for different receive predictor size settings. However, with smaller receive predictor sizes Elasticsearch seems to fill TLABs faster during request processing. During indexing, Lucene needs to allocate an excessive amount of small objects - 3.1 million totaling at a measly 81MB - outside of TLABs so the JVM is swamped with tiny allocation requests outside of TLABs (which need synchronization across all Java threads). For comparison: The top 3 outside of TLAB allocation paths in Netty are roughly 1700 allocation requests totaling almost 30GB. With larger receive predictor sizes TLABs do not fill up that quickly letting room for Lucene to allocate memory within TLABs leading to vastly improved performance.

jpountz · 2017-03-14T00:54:29Z

Interesting, do you have insights into what these Lucene allocations are? I'm wondering whether those are allocations that actually happen in Lucene or objects that we allocated for Lucene but from Elasticsearch, such as field instances.

danielmitterdorfer · 2017-03-14T04:22:56Z

@jpountz Sure, I have flight recordings of all my experiments with allocation profiles enabled. Maybe I was not clear enough in my previous comment: I think that Lucene is the victim, not the culprit. If there is enough memory headroom, Lucene will happily allocate from within TLABs. The 3.1 million small objects are btw allocated in org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$PendingTerm.<init>(BytesRef, BlockTermState). But once again: I think this is just a symptom and by no means the cause.

danielmitterdorfer · 2017-03-28T13:54:48Z

As I've written in my previous comment, there is not enough memory headroom for Lucene during indexing. As the other major contributor to memory allocations is Netty, this led me to investigate the impact of Netty's recycler again (see also #22452). Unless explicitly noted otherwise, this discussion applies to a heap size of 2GB (i.e. -Xms2g, -Xmx2g) which is currently the default heap size for Elasticsearch.

The following screenshot shows heap usage (blue) and GC pauses (red) during a (roughly) 1 minute window during a benchmark with the PMC track with the following settings:

jvm.options: -Dio.netty.allocator.type=pooled (default value)
elasticsearch.yml: http.netty.receive_predictor_size: 32kb

Note: The node OOMEd during the benchmark.

For the current defaults, heap usage looks as follows:

jvm.options: -Dio.netty.allocator.type=pooled (default value)
elasticsearch.yml: http.netty.receive_predictor_size: 64kb (default value)

Median indexing throughput: 1176 docs/s

By disabling Netty's buffer pooling, we get, much saner results:

jvm.options: -Dio.netty.allocator.type=unpooled
elasticsearch.yml: http.netty.receive_predictor_size: 32kb

Median indexing throughput: 1191 docs/s

We see a similar behavior for the current default receive_predictor_size of 64kb:

jvm.options: -Dio.netty.allocator.type=unpooled
elasticsearch.yml: http.netty.receive_predictor_size: 64kb

Median indexing throughput: 1208 docs/s

So we can identify Netty's pooled allocator as the cause of high memory usage with smaller heaps (specifically: 2 GB). In the current configuration it leaves little headroom for further allocations. While it may make sense to investigate further in that direction, I think this warrants a new ticket as the purpose here was to find out what caused TLAB exhaustion.

I still need to wrap my head around why smaller buffers cause more memory pressure. The number of allocations and the allocation sizes look very similar.

jasontedor · 2017-03-28T15:01:05Z

So we can identify Netty's recycler as the cause of high memory usage with smaller heaps (specifically: 2 GB).

I'm confused, the recycler is disabled now and I do not see that you've enabled it. Did you mean buffer pool?

danielmitterdorfer · 2017-03-28T15:06:09Z

Yes, I meant the buffer pool and have corrected my comment above now. Thanks for double-checking!

jasontedor · 2017-03-28T15:09:53Z

Yes, I meant the buffer pool and have corrected my comment above now.

Okay, now it makes sense; thank you.

Previously we calculated Netty' receive predictor size for HTTP and transport traffic based on available memory and worker nodes. This resulted in a receive predictor size between 64kb and 512kb. In our benchmarks this leads to increased GC pressure. With this commit we set Netty's receive predictor size to 32kb. This value is in a sweet spot between heap memory waste (-> GC pressure) and effect on request metrics (achieved throughput and latency numbers). Closes #23185

With this commit we change the default receive predictor size for Netty from 32kB to 64kB as our testing has shown that this leads to less allocations on smaller heaps like the default out of the box configuration and this value also works reasonably well for larger heaps. Closes #23185

With this commit we simplify our network layer by only allowing to define a fixed receive predictor size instead of a minimum and maximum value. This also means that the following (previously undocumented) settings are removed: * http.netty.receive_predictor_min * http.netty.receive_predictor_max Using an adaptive sizing policy in the receive predictor is a very low-level optimization. The implications on allocation behavior are extremely hard to grasp (see our previous work in elastic#23185) and adaptive sizing would only be beneficial anyway if the message protocol allows very different message sizes (on network level). To determine whether these settings are beneficial, we ran the PMC and nyc_taxis benchmarks from our macrobenchmark suite with various heap settings (1GB, 2GB, 4GB, 8GB, 16GB). In one scenario we use the fixed receive predictor size (`http.netty.receive_predictor`) with 16kB, 32kB and 64kB. We contrasted this with `http.netty.receive_predictor_min` = 16KB and `http.netty.receive_predictor_max` = 64kB. The results (specifically indexing throughtput) were identical (accounting for natural run-to-run variance). In summary, these settings offer no benefit but only add complexity.

With this commit we simplify our network layer by only allowing to define a fixed receive predictor size instead of a minimum and maximum value. This also means that the following (previously undocumented) settings are removed: * http.netty.receive_predictor_min * http.netty.receive_predictor_max Using an adaptive sizing policy in the receive predictor is a very low-level optimization. The implications on allocation behavior are extremely hard to grasp (see our previous work in #23185) and adaptive sizing does not provide a lot of benefits (see benchmarks in #26165 for more details).

danielmitterdorfer added :Distributed Coordination/Network Http and internode communication implementations v6.0.0-alpha1 labels Feb 15, 2017

danielmitterdorfer self-assigned this Feb 15, 2017

clintongormley added the >enhancement label Feb 15, 2017

danielmitterdorfer mentioned this issue Feb 21, 2017

Set network receive predictor size to 32kb #23284

Merged

danielmitterdorfer closed this as completed in #23284 Feb 21, 2017

danielmitterdorfer reopened this Mar 1, 2017

danielmitterdorfer mentioned this issue Mar 11, 2017

Adjust default Netty receive predictor size to 64k #23542

Merged

danielmitterdorfer closed this as completed in #23542 Mar 12, 2017

danielmitterdorfer removed their assignment Mar 13, 2017

danielmitterdorfer mentioned this issue Aug 11, 2017

Allow only a fixed-size receive predictor #26165

Merged

mhl-b mentioned this issue Oct 7, 2024

Change transport.netty.receive_predictor_size default value to 32KB #113085

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default network buffer size causes higher GC pressure than necessary #23185

Default network buffer size causes higher GC pressure than necessary #23185

danielmitterdorfer commented Feb 15, 2017 •

edited

Loading

danielmitterdorfer commented Feb 15, 2017 •

edited

Loading

jpountz commented Feb 21, 2017

danielmitterdorfer commented Feb 21, 2017

jasontedor commented Feb 23, 2017

danielmitterdorfer commented Feb 24, 2017

danielmitterdorfer commented Feb 24, 2017 •

edited

Loading

jasontedor commented Feb 24, 2017

danielmitterdorfer commented Feb 24, 2017

danielmitterdorfer commented Mar 1, 2017 •

edited

Loading

danielmitterdorfer commented Mar 13, 2017 •

edited

Loading

jpountz commented Mar 14, 2017

danielmitterdorfer commented Mar 14, 2017

danielmitterdorfer commented Mar 28, 2017 •

edited

Loading

jasontedor commented Mar 28, 2017

danielmitterdorfer commented Mar 28, 2017

jasontedor commented Mar 28, 2017

Default network buffer size causes higher GC pressure than necessary #23185

Default network buffer size causes higher GC pressure than necessary #23185

Comments

danielmitterdorfer commented Feb 15, 2017 • edited Loading

danielmitterdorfer commented Feb 15, 2017 • edited Loading

jpountz commented Feb 21, 2017

danielmitterdorfer commented Feb 21, 2017

jasontedor commented Feb 23, 2017

danielmitterdorfer commented Feb 24, 2017

danielmitterdorfer commented Feb 24, 2017 • edited Loading

Next steps

jasontedor commented Feb 24, 2017

danielmitterdorfer commented Feb 24, 2017

danielmitterdorfer commented Mar 1, 2017 • edited Loading

danielmitterdorfer commented Mar 13, 2017 • edited Loading

jpountz commented Mar 14, 2017

danielmitterdorfer commented Mar 14, 2017

danielmitterdorfer commented Mar 28, 2017 • edited Loading

jasontedor commented Mar 28, 2017

danielmitterdorfer commented Mar 28, 2017

jasontedor commented Mar 28, 2017

danielmitterdorfer commented Feb 15, 2017 •

edited

Loading

danielmitterdorfer commented Feb 15, 2017 •

edited

Loading

danielmitterdorfer commented Feb 24, 2017 •

edited

Loading

danielmitterdorfer commented Mar 1, 2017 •

edited

Loading

danielmitterdorfer commented Mar 13, 2017 •

edited

Loading

danielmitterdorfer commented Mar 28, 2017 •

edited

Loading