Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow only a fixed-size receive predictor #26165

Conversation

danielmitterdorfer
Copy link
Member

With this commit we simplify our network layer by only allowing to define a
fixed receive predictor size instead of an adaptive one. This also means that
the following (previously undocumented) settings are removed:

  • http.netty.receive_predictor_min
  • http.netty.receive_predictor_max

Using an adaptive sizing policy in the receive predictor is a very low-level
optimization. The implications on allocation behavior are extremely hard to grasp
(see our previous work in #23185) and adaptive sizing would only be beneficial
anyway if the message protocol allows very different message sizes (on network
level).

To determine whether these settings are beneficial, we ran the PMC and
nyc_taxis benchmarks from our macrobenchmark suite with various heap
settings (1GB, 2GB, 4GB, 8GB, 16GB). In one scenario we use the fixed
receive predictor size (http.netty.receive_predictor) with 16kB, 32kB
and 64kB. We contrasted this with http.netty.receive_predictor_min = 16KB and
http.netty.receive_predictor_max = 64kB. The results (specifically indexing
throughtput) were identical (accounting for natural run-to-run variance).

In summary, these settings offer no benefit but only add complexity.

With this commit we simplify our network layer by only allowing to define a
fixed receive predictor size instead of a minimum and maximum value. This also
means that the following (previously undocumented) settings are removed:

* http.netty.receive_predictor_min
* http.netty.receive_predictor_max

Using an adaptive sizing policy in the receive predictor is a very low-level
optimization. The implications on allocation behavior are extremely hard to grasp
(see our previous work in elastic#23185) and adaptive sizing would only be beneficial
anyway if the message protocol allows very different message sizes (on network
level).

To determine whether these settings are beneficial, we ran the PMC and
nyc_taxis benchmarks from our macrobenchmark suite with various heap
settings (1GB, 2GB, 4GB, 8GB, 16GB). In one scenario we use the fixed
receive predictor size (`http.netty.receive_predictor`) with 16kB, 32kB
and 64kB. We contrasted this with `http.netty.receive_predictor_min` = 16KB and
`http.netty.receive_predictor_max` = 64kB. The results (specifically indexing
throughtput) were identical (accounting for natural run-to-run variance).

In summary, these settings offer no benefit but only add complexity.
@danielmitterdorfer
Copy link
Member Author

@jasontedor Could you please have a look?

@jasontedor
Copy link
Member

I had a look and I have a question. For the PMC benchmark, are the payloads roughly the same size so that we would not expect receive prediction to have any impact anyway? What about a workload where the payloads are variable in size so that receive prediction does have an impact?

@danielmitterdorfer
Copy link
Member Author

danielmitterdorfer commented Aug 22, 2017

What about a workload where the payloads are variable in size so that receive prediction does have an impact?

Thanks for the feedback. I do not expect that we benefit from a variable size receive predictor even in that case because it's not the size of the HTTP request that matters so much but rather the number of bytes read per read from the network layer (i.e. io.netty.channel.nio.NioEventLoop#processSelectedKeys()). Consider this example (I totally made up these numbers):

We send one HTTP bulk request to Elasticsearch that is - say 1MB - in size. From the network layer we might get data in 16kB chunks. This is where the receive predictor comes into play. So what we'd need is variation on that level. So, even if we'd vary the size of the HTTP request, we'd still read chunks of roughly the same size, we'd just read more of them.

What does have an influence though are the receive buffer settings in the kernel. Here are histograms that show the number of bytes read at once (by logging lastBytesRead() every time readComplete() is called by Netty) for three different receive buffer settings in the kernel. I tested all scenarios with a receive predictor size of 512kB (to avoid that our network layer cannot read a full network buffer on OS level at once).

Benchmark

esrally --track=pmc --challenge=append-no-conflicts-index-only --pipeline=benchmark-only --target-hosts=192.168.2.2:9200

Scenarios

# defaults on this machine:
sudo sysctl -w net.ipv4.tcp_rmem='10240 87380 16777216'
sudo sysctl -w net.ipv4.route.flush=1

default

Results (shortened):

|   Lap |                        Metric |    Operation |      Value |   Unit |
|------:|------------------------------:|-------------:|-----------:|-------:|
|   All |                 Indexing time |              |    58.9725 |    min |
|   All |                    Merge time |              |    45.4068 |    min |
|   All |                  Refresh time |              |    6.82535 |    min |
|   All |                    Flush time |              |    4.22022 |    min |
|   All |           Merge throttle time |              |    30.1738 |    min |
|   All |            Total Young Gen GC |              |     35.272 |      s |
|   All |              Total Old Gen GC |              |      0.859 |      s |
|   All |                Min Throughput | index-append |    1093.17 | docs/s |
|   All |             Median Throughput | index-append |    1110.63 | docs/s |
|   All |                Max Throughput | index-append |    1133.78 | docs/s |
|   All |  50th percentile service time | index-append |    3270.25 |     ms |
|   All |  90th percentile service time | index-append |    4543.66 |     ms |
|   All |  99th percentile service time | index-append |    5451.14 |     ms |
|   All | 100th percentile service time | index-append |    5905.43 |     ms |
|   All |                    error rate | index-append |          0 |      % |
# low
sudo sysctl -w net.ipv4.tcp_rmem='10240 10240 10240'
sudo sysctl -w net.ipv4.route.flush=1

low

Results (shortened):


|   Lap |                        Metric |    Operation |      Value |   Unit |
|------:|------------------------------:|-------------:|-----------:|-------:|
|   All |                 Indexing time |              |    86.5975 |    min |
|   All |                    Merge time |              |    116.911 |    min |
|   All |                  Refresh time |              |    16.4986 |    min |
|   All |                    Flush time |              |     3.0464 |    min |
|   All |           Merge throttle time |              |    93.1672 |    min |
|   All |            Total Young Gen GC |              |    495.635 |      s |
|   All |              Total Old Gen GC |              |     24.899 |      s |
|   All |                Min Throughput | index-append |    436.204 | docs/s |
|   All |             Median Throughput | index-append |     448.28 | docs/s |
|   All |                Max Throughput | index-append |    459.248 | docs/s |
|   All |  50th percentile service time | index-append |    8733.15 |     ms |
|   All |  90th percentile service time | index-append |    11474.5 |     ms |
|   All |  99th percentile service time | index-append |    14170.9 |     ms |
|   All | 100th percentile service time | index-append |      19552 |     ms |
|   All |                    error rate | index-append |          0 |      % |
# high
sudo sysctl -w net.ipv4.tcp_rmem='16777216 16777216 16777216'
sudo sysctl -w net.ipv4.route.flush=1

high

Results (shortened):

|   Lap |                        Metric |    Operation |     Value |   Unit |
|------:|------------------------------:|-------------:|----------:|-------:|
|   All |                 Indexing time |              |   58.7263 |    min |
|   All |                    Merge time |              |   43.5618 |    min |
|   All |                  Refresh time |              |   6.77613 |    min |
|   All |                    Flush time |              |    4.2515 |    min |
|   All |           Merge throttle time |              |   28.7809 |    min |
|   All |            Total Young Gen GC |              |    35.991 |      s |
|   All |              Total Old Gen GC |              |     0.896 |      s |
|   All |                Min Throughput | index-append |   1116.48 | docs/s |
|   All |             Median Throughput | index-append |   1152.24 | docs/s |
|   All |                Max Throughput | index-append |   1173.17 | docs/s |
|   All |  50th percentile service time | index-append |   3225.68 |     ms |
|   All |  90th percentile service time | index-append |   4566.68 |     ms |
|   All |  99th percentile service time | index-append |   6687.75 |     ms |
|   All | 100th percentile service time | index-append |   7174.17 |     ms |
|   All |                    error rate | index-append |         0 |      % |

@danielmitterdorfer
Copy link
Member Author

sudo sysctl -w net.ipv4.tcp_rmem='16777216 16777216 16777216'

I should have run this case actually only with 512k of OS receive buffer but given that we only see a small number of cases where we read that many bytes I think we're still ok here (i.e. no need to retest this case from my perspective).

@jasontedor
Copy link
Member

I do not expect that we benefit from a variable size receive predictor even in that case because it's not the size of the HTTP request that matters so much but rather the number of bytes read per read from the network layer

@danielmitterdorfer I think you miss the point of my question because the size of the payload does matter. Imagine a mix of requests on the HTTP layer for stats or single-doc indexing requests (very small payloads, well under the OS buffer), and multiple-doc indexing requests (flirting around the OS receive buffer size).

@danielmitterdorfer
Copy link
Member Author

danielmitterdorfer commented Oct 9, 2017

It took a while but now I have some more numbers. I ran two different scenarios based on the pmc track:

    {
      "name": "append-and-search",
      "description": "",
      "index-settings": {
        "index.number_of_replicas": 0
      },
      "schedule": [
        {
          "parallel": {
            "completed-by": "index-append",
            "warmup-time-period": 240,
            "tasks": [
              {
                "operation": "index-append",
                "clients": 8
              },
              {
                "operation": "term",
                "clients": 1,
                "target-throughput": 20
              }
            ]
          }
        }
      ]
    }
    {
      "name": "append-and-info",
      "description": "",
      "index-settings": {
        "index.number_of_replicas": 0
      },
      "schedule": [
        {
          "parallel": {
            "completed-by": "index-append",
            "warmup-time-period": 240,
            "tasks": [
              {
                "operation": "index-append",
                "clients": 8
              },
              {
                "operation": "info",
                "clients": 1,
                "target-throughput": 100
              }
            ]
          }
        }
      ]
    }

(where info corresponds to RestMainAction, i.e. a call to "http://host:port/").

For append-and-search we see the following results (5 iterations):

Metric 64k (fixed) 1k – 512k
median indexing throughput [docs/s] 1371 1402
total young gen GC time [ms] 87227 84565
total old gen GC time [ms] 1819 1801

or graphically:

gc_times

throughput_index-append

service_time_term

(service time percentile distribution is not shown in the table but from the graphs we can see that it is practically identical)

For append-and-info we see a similar picture:

Metric 64k (fixed) 1k – 512k
median indexing throughput [docs/s] 1411 1408
total young gen GC time [ms] 91679 81641
total old gen GC time [ms] 1869 1826

or graphically:

gc_times

throughput_index-append

service_time_info

(service time percentile distribution is not shown in the table but from the graphs we can see that it is practically identical)

The only variable that we could influence is young gen collection time (reduction of 3% and 11% respectively) but we could not notice any improvement on throughput or service time. Given that I still consider these settings pretty esoteric, I'd like to simplify the config and remove them in favor of http.netty.receive_predictor.

@jasontedor
Copy link
Member

Thanks for doing this work @danielmitterdorfer, I'm sure it was time consuming but I think it's important to deeply understand the impact before making this simplification. I think it's fair to say that we do now and can proceed with simplifying here.

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jasontedor
Copy link
Member

While these are esoteric settings, I think that we need a proper deprecation/removal cycle on these. I'm okay with deprecating this in 6.1.0, and removing in 7.0.0.

@danielmitterdorfer
Copy link
Member Author

Thanks for the review.

I think that we need a proper deprecation/removal cycle on these. I'm okay with deprecating this in 6.1.0, and removing in 7.0.0.

So just to double-check: These are undocumented settings so far. I'd introduce them in 6.1 in the docs and mark them immediately as deprecated or did you mean to just add a deprecated annotation on the respective fields in the source code?

@jasontedor
Copy link
Member

I don't think we need to add them to the docs, adding deprecation logging in 6.1 is sufficient.

@danielmitterdorfer danielmitterdorfer merged commit e22844b into elastic:master Oct 10, 2017
danielmitterdorfer added a commit that referenced this pull request Oct 10, 2017
With this commit we deprecate the following settings:

* http.netty.receive_predictor_min
* http.netty.receive_predictor_max

These settings are undocumented and will be removed with Elasticsearch
7.0.

Relates #26165
@danielmitterdorfer
Copy link
Member Author

Deprecated in 6.x with 0e14e71.

@jasontedor
Copy link
Member

That is not what I meant by deprecation, I meant deprecation logging. Deprecating this in code has little benefit.

@danielmitterdorfer
Copy link
Member Author

Sorry, I misread. I'll add deprecation logging.

jasontedor added a commit to olcbean/elasticsearch that referenced this pull request Oct 10, 2017
* master: (22 commits)
  Allow only a fixed-size receive predictor (elastic#26165)
  Add Homebrew instructions to getting started
  ingest: Fix bug that prevent date_index_name processor from accepting timestamps specified as a json number
  Scripting: Fix expressions to temporarily support filter scripts (elastic#26824)
  Docs: Add note to contributing docs warning against tool based refactoring (elastic#26936)
  Fix thread context handling of headers overriding (elastic#26068)
  SearchWhileCreatingIndexIT: remove usage of _only_nodes
  update Lucene version for 6.0-RC2 version
  Calculate and cache result when advanceExact is called (elastic#26920)
  Test query builder bwc against previous supported versions instead of just the current version.
  Set minimum_master_nodes on rolling-upgrade test (elastic#26911)
  Return List instead of an array from settings (elastic#26903)
  remove _primary and _replica shard preferences (elastic#26791)
  fixing typo in datehistogram-aggregation.asciidoc (elastic#26924)
  [API] Added the `terminate_after` parameter to the REST spec for "Count" API
  Setup debug logging for qa.full-cluster-restart
  Enable BWC testing against other remotes
  Use LF line endings in Painless generated files (elastic#26822)
  [DOCS] Added info about snapshotting your data before an upgrade.
  Add documentation about disabling `_field_names`. (elastic#26813)
  ...
danielmitterdorfer added a commit that referenced this pull request Oct 10, 2017
With this commit we mark the following settings as deprecated:

* http.netty.receive_predictor_min
* http.netty.receive_predictor_max

The settings infrastructure then emits a deprecation warning upon startup if
these settings are configured.

Relates #26165
@danielmitterdorfer
Copy link
Member Author

Added deprecation logging now in e48d747 (deprecation warnings are emitted by our settings infrastructure), e.g.:

[2017-10-10T14:53:50,657][WARN ][o.e.d.c.s.Settings       ] [http.netty.receive_predictor_max] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Oct 10, 2017
* master:
  Fix handling of paths containing parentheses
  Allow only a fixed-size receive predictor (elastic#26165)
  Add Homebrew instructions to getting started
  ingest: Fix bug that prevent date_index_name processor from accepting timestamps specified as a json number
  Scripting: Fix expressions to temporarily support filter scripts (elastic#26824)
  Docs: Add note to contributing docs warning against tool based refactoring (elastic#26936)
  Fix thread context handling of headers overriding (elastic#26068)
  SearchWhileCreatingIndexIT: remove usage of _only_nodes
jasontedor added a commit that referenced this pull request Oct 12, 2017
* master: (35 commits)
  Create weights lazily in filter and filters aggregation (#26983)
  Use a dedicated ThreadGroup in rest sniffer (#26897)
  Fire global checkpoint sync under system context
  Update by Query is modified to accept short `script` parameter. (#26841)
  Cat shards bytes (#26952)
  Add support for parsing inline script (#23824) (#26846)
  Change default value to true for transpositions parameter of fuzzy query (#26901)
  Adding unreleased 5.6.4 version number to Version.java
  Rename TCPTransportTests to TcpTransportTests (#26954)
  Fix NPE for /_cat/indices when no primary shard (#26953)
  [DOCS] Fixed indentation of the definition list.
  Fix formatting in channel close test
  Check for closed connection while opening
  Clarify systemd overrides
  [DOCS] Plugin Installation for Windows (#21671)
  Painless: add tests for cached boxing (#24163)
  Don't detect source's XContentType in DocumentParser.parseDocument() (#26880)
  Fix handling of paths containing parentheses
  Allow only a fixed-size receive predictor (#26165)
  Add Homebrew instructions to getting started
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants