Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add maxIdleSockets and idleSocketTimeout to Elasticsearch config #142019

Merged
merged 12 commits into from
Oct 10, 2022

Conversation

rudolf
Copy link
Contributor

@rudolf rudolf commented Sep 27, 2022

Summary

Closes #137673

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

@rudolf rudolf added Feature:elasticsearch Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Sep 27, 2022
@github-actions
Copy link
Contributor

Documentation preview:

@@ -38,8 +39,10 @@ export function parseClientOptions(
// fixes https://github.com/elastic/kibana/issues/101944
disablePrototypePoisoningProtection: true,
agent: {
maxSockets: config.maxSockets,
maxTotalSockets: config.maxSockets,
Copy link
Contributor Author

@rudolf rudolf Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxSockets only limits the maximum number of sockets per host. What we want, and what our documentation describes is the maximum amount of sockets that Kibana would open to ES. Because there's multiple ES client instances and hence agents, this still doesn't behave 100% as advertised, but at least maxTotalSockets should bring the behaviour somewhat closer to what we advertise. https://nodejs.org/api/http.html#agentmaxtotalsockets

keepAlive: config.keepAlive ?? true,
timeout: getDurationAsMs(config.idleSocketTimeout),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the documentation, it seems that Node itself does not close / destroy the socket.
It rather sends a timeout event, and then it's up to the user to end the connection.
Thus, I'm not sure the connections are being closed due to timeout ATM. I'll give it a try in local and get back to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're good, the Agent subscribes to the 'timeout' event and destroys the socket:
https://github.com/nodejs/node/blob/main/lib/_http_agent.js#L402-L414

keepAlive: config.keepAlive ?? true,
timeout: getDurationAsMs(config.idleSocketTimeout),
maxFreeSockets: config.maxIdleSockets,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import type { ElasticsearchClientConfig } from '@kbn/core-elasticsearch-server';
import { AgentOptions } from 'https';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although Elasticsearch-js exposes an HttpAgentOptions type, this type is not up to date with the Nodejs type.

@@ -37,6 +37,8 @@ export const configSchema = schema.object({
defaultValue: 'http://localhost:9200',
}),
maxSockets: schema.number({ defaultValue: Infinity, min: 1 }),
maxIdleSockets: schema.number({ defaultValue: 256, min: 1 }),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

256 is the Nodejs default so adding a default value for the config just makes this more explicit and decouples us from any changes to Nodejs itself.

@@ -37,6 +37,8 @@ export const configSchema = schema.object({
defaultValue: 'http://localhost:9200',
}),
maxSockets: schema.number({ defaultValue: Infinity, min: 1 }),
maxIdleSockets: schema.number({ defaultValue: 256, min: 1 }),
idleSocketTimeout: schema.duration({ defaultValue: '9m' }),
Copy link
Contributor Author

@rudolf rudolf Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the services forwarder has a timeout of 10minutes. But a race condition is possible where the proxy sends a close but there's some delay in the close packet reaching Kibana and at the same time Kibana might start a new request. So to be on the safe side we have a margin of > requestTimeout (30s by default) so that Kibana would close any idle sockets before the proxy might try to close them.

On cloud we can configure Kibana to default to 9minutes, but because this would be a behaviour change and might cause problems if there's a transparent proxy I choose to default to 60s.

(This image shows the old services-forwarder socket timeout of 30s)
Screenshot 2022-07-28 at 11 16 26

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware that your implementation is still setting '9m'

@rudolf
Copy link
Contributor Author

rudolf commented Sep 28, 2022

Apart from the configuration options this introduces a behaviour change in that sockets now have a 60s timeout instead of Elasticsearch-js's 30s default. Because we default to elasticsearch.maxSockets: Infinity nodejs won't actually re-use the sockets so this shouldn't impact the vast majority of clusters.

Tested this locally by creating load using https://github.com:elastic/kibana-k6-benchmarking and monitoring sockets with:

Kibana <-> Elasticsearch socket count:
watch "lsof -n -i -s | grep -e 'node.*127\.0\.0\.1:wap-wsp' | wc -l"

K6 <-> Kibana socket count:
watch "lsof -n -i -s | grep -e 'node.*127\.0\.0\.1:esmagent' | wc -l"

Some observations from the tests. When Kibana has a large amount of open sockets, the socket numbers continuously jump up and down by several hundreds e.g. 1500 sockets and 2s later 1200 sockets. This is because the default maxIdleSockets is 256 so when there's e.g. 2000 sockets open it quickly happens that more than 256 of them become free and then we close a ton of sockets just to re-open them a second or so later.

When I set a maxIdleSockets value that's really high (e.g. same as maxSockets or > 2000) then the amount of open sockets are much more stable. And this basically means sockets will only be closed if they're idle. Which seems preferable to me. If there's a sudden spike in traffic it means that for the next idleSocketTimeout there will be a large amount of unecessary sockets but it's not a huge issue in and of itself.

@rudolf rudolf marked this pull request as ready for review September 29, 2022 13:54
@rudolf rudolf requested review from a team as code owners September 29, 2022 13:54
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

docs/setup/settings.asciidoc Outdated Show resolved Hide resolved
Copy link
Member

@weltenwort weltenwort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitoring test changes LGTM

@rudolf
Copy link
Contributor Author

rudolf commented Oct 3, 2022

@elasticmachine merge upstream

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good to me, just have one comment.

As a side note, do we want to allow-list these new settings in Cloud?

Comment on lines -56 to -66
const config = Object.assign(
{},
DEFAULT_CONFIG,
this.agentOptions,
agentOptions,
connectionOpts.tls
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to get the exact implications of the changes in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had a way to create a default agent config passed into the agent manager constructor. But this functionality was never used and sometimes confused my while developing cause it wasn't obvious where the config of the final agent actually came from. It was just a small moment of "what's going on here" but it felt like if we're not using this "feature" then it's easier to reason about the code if we just remove it altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually there was several levels:

  1. DEFAULT_CONFIG static defaults in code
  2. this.agentOptions optional constructor paramater that can set defaults for any agent created by the agent factories of this agent manager
  3. finally the kibana.yml defaults that are passed into the agent factories when they are created.

So now we just have (3)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! (1)'s values are equivalent to NodeJs's defaults, and (2) was not used.

@rudolf rudolf requested a review from gsoldevila October 3, 2022 11:56
@gsoldevila
Copy link
Contributor

Overall looking good to me, just have one comment.

As a side note, do we want to allow-list these new settings in Cloud?

I would say it makes sense, so that we can tweak them once we have metrics about them.
At least, that was the reasoning when we allow-listed maxSockets.

LGTM!

@rudolf rudolf enabled auto-merge (squash) October 5, 2022 14:29
@rudolf
Copy link
Contributor Author

rudolf commented Oct 10, 2022

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/core-elasticsearch-server 51 53 +2
Unknown metric groups

API count

id before after diff
@kbn/core-elasticsearch-server 99 103 +4
core 2689 2691 +2
total +6

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@rudolf rudolf merged commit 4414692 into main Oct 10, 2022
@rudolf rudolf deleted the es-socket-config branch October 10, 2022 10:34
@kibanamachine kibanamachine added v8.6.0 backport:skip This commit does not require backporting labels Oct 10, 2022
WafaaNasr pushed a commit to WafaaNasr/kibana that referenced this pull request Oct 11, 2022
…stic#142019)

* Add maxIdleSockets and idleSocketTimeout to Elasticsearch config

* Simplify agent manager

* Fix types

* Fix types

* Reduce idleSocketTimeout default to 60s

* Fix tests

* Update docs/setup/settings.asciidoc

* Address review comments

Co-authored-by: Kibana Machine <[email protected]>
WafaaNasr pushed a commit to WafaaNasr/kibana that referenced this pull request Oct 14, 2022
…stic#142019)

* Add maxIdleSockets and idleSocketTimeout to Elasticsearch config

* Simplify agent manager

* Fix types

* Fix types

* Reduce idleSocketTimeout default to 60s

* Fix tests

* Update docs/setup/settings.asciidoc

* Address review comments

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:elasticsearch performance release_note:enhancement Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.6.0
Projects
None yet
7 participants