-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add maxIdleSockets and idleSocketTimeout to Elasticsearch config #142019
Conversation
Documentation preview: |
@@ -38,8 +39,10 @@ export function parseClientOptions( | |||
// fixes https://github.com/elastic/kibana/issues/101944 | |||
disablePrototypePoisoningProtection: true, | |||
agent: { | |||
maxSockets: config.maxSockets, | |||
maxTotalSockets: config.maxSockets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxSockets only limits the maximum number of sockets per host. What we want, and what our documentation describes is the maximum amount of sockets that Kibana would open to ES. Because there's multiple ES client instances and hence agents, this still doesn't behave 100% as advertised, but at least maxTotalSockets
should bring the behaviour somewhat closer to what we advertise. https://nodejs.org/api/http.html#agentmaxtotalsockets
keepAlive: config.keepAlive ?? true, | ||
timeout: getDurationAsMs(config.idleSocketTimeout), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the documentation, it seems that Node itself does not close / destroy the socket.
It rather sends a timeout
event, and then it's up to the user to end the connection.
Thus, I'm not sure the connections are being closed due to timeout ATM. I'll give it a try in local and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're good, the Agent subscribes to the 'timeout'
event and destroys the socket:
https://github.com/nodejs/node/blob/main/lib/_http_agent.js#L402-L414
keepAlive: config.keepAlive ?? true, | ||
timeout: getDurationAsMs(config.idleSocketTimeout), | ||
maxFreeSockets: config.maxIdleSockets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import type { ElasticsearchClientConfig } from '@kbn/core-elasticsearch-server'; | ||
import { AgentOptions } from 'https'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although Elasticsearch-js exposes an HttpAgentOptions
type, this type is not up to date with the Nodejs type.
@@ -37,6 +37,8 @@ export const configSchema = schema.object({ | |||
defaultValue: 'http://localhost:9200', | |||
}), | |||
maxSockets: schema.number({ defaultValue: Infinity, min: 1 }), | |||
maxIdleSockets: schema.number({ defaultValue: 256, min: 1 }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
256 is the Nodejs default so adding a default value for the config just makes this more explicit and decouples us from any changes to Nodejs itself.
@@ -37,6 +37,8 @@ export const configSchema = schema.object({ | |||
defaultValue: 'http://localhost:9200', | |||
}), | |||
maxSockets: schema.number({ defaultValue: Infinity, min: 1 }), | |||
maxIdleSockets: schema.number({ defaultValue: 256, min: 1 }), | |||
idleSocketTimeout: schema.duration({ defaultValue: '9m' }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the services forwarder has a timeout of 10minutes. But a race condition is possible where the proxy sends a close but there's some delay in the close packet reaching Kibana and at the same time Kibana might start a new request. So to be on the safe side we have a margin of > requestTimeout (30s by default) so that Kibana would close any idle sockets before the proxy might try to close them.
On cloud we can configure Kibana to default to 9minutes, but because this would be a behaviour change and might cause problems if there's a transparent proxy I choose to default to 60s.
(This image shows the old services-forwarder socket timeout of 30s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beware that your implementation is still setting '9m'
Apart from the configuration options this introduces a behaviour change in that sockets now have a 60s timeout instead of Elasticsearch-js's 30s default. Because we default to Tested this locally by creating load using Kibana <-> Elasticsearch socket count: K6 <-> Kibana socket count: Some observations from the tests. When Kibana has a large amount of open sockets, the socket numbers continuously jump up and down by several hundreds e.g. 1500 sockets and 2s later 1200 sockets. This is because the default maxIdleSockets is 256 so when there's e.g. 2000 sockets open it quickly happens that more than 256 of them become free and then we close a ton of sockets just to re-open them a second or so later. When I set a maxIdleSockets value that's really high (e.g. same as maxSockets or > 2000) then the amount of open sockets are much more stable. And this basically means sockets will only be closed if they're idle. Which seems preferable to me. If there's a sudden spike in traffic it means that for the next idleSocketTimeout there will be a large amount of unecessary sockets but it's not a huge issue in and of itself. |
Pinging @elastic/kibana-core (Team:Core) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
monitoring
test changes LGTM
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good to me, just have one comment.
As a side note, do we want to allow-list these new settings in Cloud?
const config = Object.assign( | ||
{}, | ||
DEFAULT_CONFIG, | ||
this.agentOptions, | ||
agentOptions, | ||
connectionOpts.tls | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure to get the exact implications of the changes in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we had a way to create a default agent config passed into the agent manager constructor. But this functionality was never used and sometimes confused my while developing cause it wasn't obvious where the config of the final agent actually came from. It was just a small moment of "what's going on here" but it felt like if we're not using this "feature" then it's easier to reason about the code if we just remove it altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually there was several levels:
DEFAULT_CONFIG
static defaults in codethis.agentOptions
optional constructor paramater that can set defaults for any agent created by the agent factories of this agent manager- finally the kibana.yml defaults that are passed into the agent factories when they are created.
So now we just have (3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! (1)'s values are equivalent to NodeJs's defaults, and (2) was not used.
packages/core/elasticsearch/core-elasticsearch-client-server-internal/src/agent_manager.ts
Outdated
Show resolved
Hide resolved
packages/core/elasticsearch/core-elasticsearch-client-server-internal/src/client_config.test.ts
Show resolved
Hide resolved
packages/core/elasticsearch/core-elasticsearch-client-server-internal/src/client_config.test.ts
Show resolved
Hide resolved
I would say it makes sense, so that we can tweak them once we have metrics about them. LGTM! |
@elasticmachine merge upstream |
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Unknown metric groupsAPI count
History
To update your PR or re-run it, just comment with: |
…stic#142019) * Add maxIdleSockets and idleSocketTimeout to Elasticsearch config * Simplify agent manager * Fix types * Fix types * Reduce idleSocketTimeout default to 60s * Fix tests * Update docs/setup/settings.asciidoc * Address review comments Co-authored-by: Kibana Machine <[email protected]>
…stic#142019) * Add maxIdleSockets and idleSocketTimeout to Elasticsearch config * Simplify agent manager * Fix types * Fix types * Reduce idleSocketTimeout default to 60s * Fix tests * Update docs/setup/settings.asciidoc * Address review comments Co-authored-by: Kibana Machine <[email protected]>
Summary
Closes #137673
Checklist
Delete any items that are not applicable to this PR.
Risk Matrix
Delete this section if it is not applicable to this PR.
Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.
When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:
For maintainers