Allow only a fixed-size receive predictor #26165

danielmitterdorfer · 2017-08-11T12:55:00Z

With this commit we simplify our network layer by only allowing to define a
fixed receive predictor size instead of an adaptive one. This also means that
the following (previously undocumented) settings are removed:

http.netty.receive_predictor_min
http.netty.receive_predictor_max

Using an adaptive sizing policy in the receive predictor is a very low-level
optimization. The implications on allocation behavior are extremely hard to grasp
(see our previous work in #23185) and adaptive sizing would only be beneficial
anyway if the message protocol allows very different message sizes (on network
level).

To determine whether these settings are beneficial, we ran the PMC and
nyc_taxis benchmarks from our macrobenchmark suite with various heap
settings (1GB, 2GB, 4GB, 8GB, 16GB). In one scenario we use the fixed
receive predictor size (http.netty.receive_predictor) with 16kB, 32kB
and 64kB. We contrasted this with http.netty.receive_predictor_min = 16KB and
http.netty.receive_predictor_max = 64kB. The results (specifically indexing
throughtput) were identical (accounting for natural run-to-run variance).

In summary, these settings offer no benefit but only add complexity.

With this commit we simplify our network layer by only allowing to define a fixed receive predictor size instead of a minimum and maximum value. This also means that the following (previously undocumented) settings are removed: * http.netty.receive_predictor_min * http.netty.receive_predictor_max Using an adaptive sizing policy in the receive predictor is a very low-level optimization. The implications on allocation behavior are extremely hard to grasp (see our previous work in elastic#23185) and adaptive sizing would only be beneficial anyway if the message protocol allows very different message sizes (on network level). To determine whether these settings are beneficial, we ran the PMC and nyc_taxis benchmarks from our macrobenchmark suite with various heap settings (1GB, 2GB, 4GB, 8GB, 16GB). In one scenario we use the fixed receive predictor size (`http.netty.receive_predictor`) with 16kB, 32kB and 64kB. We contrasted this with `http.netty.receive_predictor_min` = 16KB and `http.netty.receive_predictor_max` = 64kB. The results (specifically indexing throughtput) were identical (accounting for natural run-to-run variance). In summary, these settings offer no benefit but only add complexity.

…ctor-only

danielmitterdorfer · 2017-08-21T15:16:07Z

@jasontedor Could you please have a look?

jasontedor · 2017-08-21T15:20:00Z

I had a look and I have a question. For the PMC benchmark, are the payloads roughly the same size so that we would not expect receive prediction to have any impact anyway? What about a workload where the payloads are variable in size so that receive prediction does have an impact?

danielmitterdorfer · 2017-08-22T09:10:23Z

What about a workload where the payloads are variable in size so that receive prediction does have an impact?

Thanks for the feedback. I do not expect that we benefit from a variable size receive predictor even in that case because it's not the size of the HTTP request that matters so much but rather the number of bytes read per read from the network layer (i.e. io.netty.channel.nio.NioEventLoop#processSelectedKeys()). Consider this example (I totally made up these numbers):

We send one HTTP bulk request to Elasticsearch that is - say 1MB - in size. From the network layer we might get data in 16kB chunks. This is where the receive predictor comes into play. So what we'd need is variation on that level. So, even if we'd vary the size of the HTTP request, we'd still read chunks of roughly the same size, we'd just read more of them.

What does have an influence though are the receive buffer settings in the kernel. Here are histograms that show the number of bytes read at once (by logging lastBytesRead() every time readComplete() is called by Netty) for three different receive buffer settings in the kernel. I tested all scenarios with a receive predictor size of 512kB (to avoid that our network layer cannot read a full network buffer on OS level at once).

Benchmark

esrally --track=pmc --challenge=append-no-conflicts-index-only --pipeline=benchmark-only --target-hosts=192.168.2.2:9200

Scenarios

# defaults on this machine:
sudo sysctl -w net.ipv4.tcp_rmem='10240 87380 16777216'
sudo sysctl -w net.ipv4.route.flush=1

Results (shortened):

|   Lap |                        Metric |    Operation |      Value |   Unit |
|------:|------------------------------:|-------------:|-----------:|-------:|
|   All |                 Indexing time |              |    58.9725 |    min |
|   All |                    Merge time |              |    45.4068 |    min |
|   All |                  Refresh time |              |    6.82535 |    min |
|   All |                    Flush time |              |    4.22022 |    min |
|   All |           Merge throttle time |              |    30.1738 |    min |
|   All |            Total Young Gen GC |              |     35.272 |      s |
|   All |              Total Old Gen GC |              |      0.859 |      s |
|   All |                Min Throughput | index-append |    1093.17 | docs/s |
|   All |             Median Throughput | index-append |    1110.63 | docs/s |
|   All |                Max Throughput | index-append |    1133.78 | docs/s |
|   All |  50th percentile service time | index-append |    3270.25 |     ms |
|   All |  90th percentile service time | index-append |    4543.66 |     ms |
|   All |  99th percentile service time | index-append |    5451.14 |     ms |
|   All | 100th percentile service time | index-append |    5905.43 |     ms |
|   All |                    error rate | index-append |          0 |      % |

# low
sudo sysctl -w net.ipv4.tcp_rmem='10240 10240 10240'
sudo sysctl -w net.ipv4.route.flush=1

Results (shortened):


|   Lap |                        Metric |    Operation |      Value |   Unit |
|------:|------------------------------:|-------------:|-----------:|-------:|
|   All |                 Indexing time |              |    86.5975 |    min |
|   All |                    Merge time |              |    116.911 |    min |
|   All |                  Refresh time |              |    16.4986 |    min |
|   All |                    Flush time |              |     3.0464 |    min |
|   All |           Merge throttle time |              |    93.1672 |    min |
|   All |            Total Young Gen GC |              |    495.635 |      s |
|   All |              Total Old Gen GC |              |     24.899 |      s |
|   All |                Min Throughput | index-append |    436.204 | docs/s |
|   All |             Median Throughput | index-append |     448.28 | docs/s |
|   All |                Max Throughput | index-append |    459.248 | docs/s |
|   All |  50th percentile service time | index-append |    8733.15 |     ms |
|   All |  90th percentile service time | index-append |    11474.5 |     ms |
|   All |  99th percentile service time | index-append |    14170.9 |     ms |
|   All | 100th percentile service time | index-append |      19552 |     ms |
|   All |                    error rate | index-append |          0 |      % |

# high
sudo sysctl -w net.ipv4.tcp_rmem='16777216 16777216 16777216'
sudo sysctl -w net.ipv4.route.flush=1

Results (shortened):

|   Lap |                        Metric |    Operation |     Value |   Unit |
|------:|------------------------------:|-------------:|----------:|-------:|
|   All |                 Indexing time |              |   58.7263 |    min |
|   All |                    Merge time |              |   43.5618 |    min |
|   All |                  Refresh time |              |   6.77613 |    min |
|   All |                    Flush time |              |    4.2515 |    min |
|   All |           Merge throttle time |              |   28.7809 |    min |
|   All |            Total Young Gen GC |              |    35.991 |      s |
|   All |              Total Old Gen GC |              |     0.896 |      s |
|   All |                Min Throughput | index-append |   1116.48 | docs/s |
|   All |             Median Throughput | index-append |   1152.24 | docs/s |
|   All |                Max Throughput | index-append |   1173.17 | docs/s |
|   All |  50th percentile service time | index-append |   3225.68 |     ms |
|   All |  90th percentile service time | index-append |   4566.68 |     ms |
|   All |  99th percentile service time | index-append |   6687.75 |     ms |
|   All | 100th percentile service time | index-append |   7174.17 |     ms |
|   All |                    error rate | index-append |         0 |      % |

danielmitterdorfer · 2017-08-22T09:16:45Z

sudo sysctl -w net.ipv4.tcp_rmem='16777216 16777216 16777216'

I should have run this case actually only with 512k of OS receive buffer but given that we only see a small number of cases where we read that many bytes I think we're still ok here (i.e. no need to retest this case from my perspective).

jasontedor · 2017-08-28T16:49:35Z

I do not expect that we benefit from a variable size receive predictor even in that case because it's not the size of the HTTP request that matters so much but rather the number of bytes read per read from the network layer

@danielmitterdorfer I think you miss the point of my question because the size of the payload does matter. Imagine a mix of requests on the HTTP layer for stats or single-doc indexing requests (very small payloads, well under the OS buffer), and multiple-doc indexing requests (flirting around the OS receive buffer size).

danielmitterdorfer · 2017-10-09T13:59:46Z

It took a while but now I have some more numbers. I ran two different scenarios based on the pmc track:

    {
      "name": "append-and-search",
      "description": "",
      "index-settings": {
        "index.number_of_replicas": 0
      },
      "schedule": [
        {
          "parallel": {
            "completed-by": "index-append",
            "warmup-time-period": 240,
            "tasks": [
              {
                "operation": "index-append",
                "clients": 8
              },
              {
                "operation": "term",
                "clients": 1,
                "target-throughput": 20
              }
            ]
          }
        }
      ]
    }

    {
      "name": "append-and-info",
      "description": "",
      "index-settings": {
        "index.number_of_replicas": 0
      },
      "schedule": [
        {
          "parallel": {
            "completed-by": "index-append",
            "warmup-time-period": 240,
            "tasks": [
              {
                "operation": "index-append",
                "clients": 8
              },
              {
                "operation": "info",
                "clients": 1,
                "target-throughput": 100
              }
            ]
          }
        }
      ]
    }

(where info corresponds to RestMainAction, i.e. a call to "http://host:port/").

For append-and-search we see the following results (5 iterations):

Metric	64k (fixed)	1k – 512k
median indexing throughput [docs/s]	1371	1402
total young gen GC time [ms]	87227	84565
total old gen GC time [ms]	1819	1801

or graphically:

(service time percentile distribution is not shown in the table but from the graphs we can see that it is practically identical)

For append-and-info we see a similar picture:

Metric	64k (fixed)	1k – 512k
median indexing throughput [docs/s]	1411	1408
total young gen GC time [ms]	91679	81641
total old gen GC time [ms]	1869	1826

or graphically:

(service time percentile distribution is not shown in the table but from the graphs we can see that it is practically identical)

The only variable that we could influence is young gen collection time (reduction of 3% and 11% respectively) but we could not notice any improvement on throughput or service time. Given that I still consider these settings pretty esoteric, I'd like to simplify the config and remove them in favor of http.netty.receive_predictor.

jasontedor · 2017-10-09T20:53:40Z

Thanks for doing this work @danielmitterdorfer, I'm sure it was time consuming but I think it's important to deeply understand the impact before making this simplification. I think it's fair to say that we do now and can proceed with simplifying here.

jasontedor

LGTM.

jasontedor · 2017-10-09T20:54:35Z

While these are esoteric settings, I think that we need a proper deprecation/removal cycle on these. I'm okay with deprecating this in 6.1.0, and removing in 7.0.0.

danielmitterdorfer · 2017-10-10T09:22:33Z

Thanks for the review.

I think that we need a proper deprecation/removal cycle on these. I'm okay with deprecating this in 6.1.0, and removing in 7.0.0.

So just to double-check: These are undocumented settings so far. I'd introduce them in 6.1 in the docs and mark them immediately as deprecated or did you mean to just add a deprecated annotation on the respective fields in the source code?

jasontedor · 2017-10-10T09:40:55Z

I don't think we need to add them to the docs, adding deprecation logging in 6.1 is sufficient.

With this commit we deprecate the following settings: * http.netty.receive_predictor_min * http.netty.receive_predictor_max These settings are undocumented and will be removed with Elasticsearch 7.0. Relates #26165

danielmitterdorfer · 2017-10-10T12:19:24Z

Deprecated in 6.x with 0e14e71.

jasontedor · 2017-10-10T12:22:37Z

That is not what I meant by deprecation, I meant deprecation logging. Deprecating this in code has little benefit.

danielmitterdorfer · 2017-10-10T12:26:37Z

Sorry, I misread. I'll add deprecation logging.

* master: (22 commits) Allow only a fixed-size receive predictor (elastic#26165) Add Homebrew instructions to getting started ingest: Fix bug that prevent date_index_name processor from accepting timestamps specified as a json number Scripting: Fix expressions to temporarily support filter scripts (elastic#26824) Docs: Add note to contributing docs warning against tool based refactoring (elastic#26936) Fix thread context handling of headers overriding (elastic#26068) SearchWhileCreatingIndexIT: remove usage of _only_nodes update Lucene version for 6.0-RC2 version Calculate and cache result when advanceExact is called (elastic#26920) Test query builder bwc against previous supported versions instead of just the current version. Set minimum_master_nodes on rolling-upgrade test (elastic#26911) Return List instead of an array from settings (elastic#26903) remove _primary and _replica shard preferences (elastic#26791) fixing typo in datehistogram-aggregation.asciidoc (elastic#26924) [API] Added the `terminate_after` parameter to the REST spec for "Count" API Setup debug logging for qa.full-cluster-restart Enable BWC testing against other remotes Use LF line endings in Painless generated files (elastic#26822) [DOCS] Added info about snapshotting your data before an upgrade. Add documentation about disabling `_field_names`. (elastic#26813) ...

With this commit we mark the following settings as deprecated: * http.netty.receive_predictor_min * http.netty.receive_predictor_max The settings infrastructure then emits a deprecation warning upon startup if these settings are configured. Relates #26165

danielmitterdorfer · 2017-10-10T13:04:53Z

Added deprecation logging now in e48d747 (deprecation warnings are emitted by our settings infrastructure), e.g.:

[2017-10-10T14:53:50,657][WARN ][o.e.d.c.s.Settings       ] [http.netty.receive_predictor_max] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.

* master: Fix handling of paths containing parentheses Allow only a fixed-size receive predictor (elastic#26165) Add Homebrew instructions to getting started ingest: Fix bug that prevent date_index_name processor from accepting timestamps specified as a json number Scripting: Fix expressions to temporarily support filter scripts (elastic#26824) Docs: Add note to contributing docs warning against tool based refactoring (elastic#26936) Fix thread context handling of headers overriding (elastic#26068) SearchWhileCreatingIndexIT: remove usage of _only_nodes

* master: (35 commits) Create weights lazily in filter and filters aggregation (#26983) Use a dedicated ThreadGroup in rest sniffer (#26897) Fire global checkpoint sync under system context Update by Query is modified to accept short `script` parameter. (#26841) Cat shards bytes (#26952) Add support for parsing inline script (#23824) (#26846) Change default value to true for transpositions parameter of fuzzy query (#26901) Adding unreleased 5.6.4 version number to Version.java Rename TCPTransportTests to TcpTransportTests (#26954) Fix NPE for /_cat/indices when no primary shard (#26953) [DOCS] Fixed indentation of the definition list. Fix formatting in channel close test Check for closed connection while opening Clarify systemd overrides [DOCS] Plugin Installation for Windows (#21671) Painless: add tests for cached boxing (#24163) Don't detect source's XContentType in DocumentParser.parseDocument() (#26880) Fix handling of paths containing parentheses Allow only a fixed-size receive predictor (#26165) Add Homebrew instructions to getting started ...

danielmitterdorfer added :Distributed Coordination/Network Http and internode communication implementations >breaking >enhancement review v6.0.0 v6.1.0 v7.0.0 labels Aug 11, 2017

danielmitterdorfer requested a review from jasontedor August 16, 2017 12:49

Merge remote-tracking branch 'origin/master' into fixed-receive-predi…

a936ccf

…ctor-only

danielmitterdorfer removed the v6.0.0 label Oct 9, 2017

jasontedor approved these changes Oct 9, 2017

View reviewed changes

danielmitterdorfer removed the review label Oct 10, 2017

danielmitterdorfer merged commit e22844b into elastic:master Oct 10, 2017

danielmitterdorfer added the backport pending label Oct 10, 2017

danielmitterdorfer removed the backport pending label Oct 10, 2017

danielmitterdorfer added the backport pending label Oct 10, 2017

danielmitterdorfer removed the backport pending label Oct 10, 2017

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow only a fixed-size receive predictor #26165

Allow only a fixed-size receive predictor #26165

danielmitterdorfer commented Aug 11, 2017

danielmitterdorfer commented Aug 21, 2017

jasontedor commented Aug 21, 2017

danielmitterdorfer commented Aug 22, 2017 •

edited

Loading

danielmitterdorfer commented Aug 22, 2017

jasontedor commented Aug 28, 2017

danielmitterdorfer commented Oct 9, 2017 •

edited

Loading

jasontedor commented Oct 9, 2017

jasontedor left a comment

jasontedor commented Oct 9, 2017

danielmitterdorfer commented Oct 10, 2017

jasontedor commented Oct 10, 2017

danielmitterdorfer commented Oct 10, 2017

jasontedor commented Oct 10, 2017

danielmitterdorfer commented Oct 10, 2017

danielmitterdorfer commented Oct 10, 2017

Allow only a fixed-size receive predictor #26165

Allow only a fixed-size receive predictor #26165

Conversation

danielmitterdorfer commented Aug 11, 2017

danielmitterdorfer commented Aug 21, 2017

jasontedor commented Aug 21, 2017

danielmitterdorfer commented Aug 22, 2017 • edited Loading

Benchmark

Scenarios

danielmitterdorfer commented Aug 22, 2017

jasontedor commented Aug 28, 2017

danielmitterdorfer commented Oct 9, 2017 • edited Loading

jasontedor commented Oct 9, 2017

jasontedor left a comment

Choose a reason for hiding this comment

jasontedor commented Oct 9, 2017

danielmitterdorfer commented Oct 10, 2017

jasontedor commented Oct 10, 2017

danielmitterdorfer commented Oct 10, 2017

jasontedor commented Oct 10, 2017

danielmitterdorfer commented Oct 10, 2017

danielmitterdorfer commented Oct 10, 2017

danielmitterdorfer commented Aug 22, 2017 •

edited

Loading

danielmitterdorfer commented Oct 9, 2017 •

edited

Loading