Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch client's maxSockets defaults to 256 #112756

Closed
kobelb opened this issue Sep 21, 2021 · 24 comments · Fixed by #113644
Closed

Elasticsearch client's maxSockets defaults to 256 #112756

kobelb opened this issue Sep 21, 2021 · 24 comments · Fixed by #113644
Labels
bug Fixes for quality problems that affect the customer experience enhancement New value added to drive a business result Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@kobelb
Copy link
Contributor

kobelb commented Sep 21, 2021

The modern Elasticsearch client's http.Agent defaults to allowing a maximum of 256 sockets. However, the legacy Elasticsearch client allowed there to be an infinite number of sockets.

Was this accidentally changed with the introduction of the modern Elasticsearch client? If so, we should probably revert the accidental change. I do see benefit in allowing this to be configured via the kibana.yml so we or our users can alter this behavior.

/cc @pmuellr

@kobelb kobelb added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc enhancement New value added to drive a business result labels Sep 21, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@pgayvallet
Copy link
Contributor

cc @delvedor this should be as easy as changing the agent.maxSockets option we're using when instantiating the Client, right? I see this option is receiving number ? undefined. I'm guessing passing undefined will fallback to the default value of 256, so is there a way to provide a value to remove the limit on the number of sockets, or should we just use a very high number instead?

@delvedor
Copy link
Member

this should be as easy as changing the agent.maxSockets option we're using when instantiating the Client, right

Correct.

so is there a way to provide a value to remove the limit on the number of sockets, or should we just use a very high number instead?

You should configure a higher socket number. Be aware that if you configure Infinity, a new socket will be used for each request, increasing the memory pressure. I would recommend finding a good enough finite number, so you can keep the memory under control.

Some info about how the maxSocket option works can be found here: nodejs/node#26357.

@pgayvallet
Copy link
Contributor

@kobelb I guess the best option is, as you suggested, to allow users to configure this value via the config file. Would the current 256 client's default be suitable for the config's default value?

@mshustov
Copy link
Contributor

mshustov commented Sep 29, 2021

I guess the best option is, as you suggested, to allow users to configure this value via the config file.

Let me be the devil's advocate.
It is an option with great flexibility but not the best UX. First of all, a Kibanauser has to detect the problem with socket starvation somehow. It requires some effort to set up monitoring and keep an eye on it. Then when the user found out the problem, one needs to spend some time reading Kibana docs to find a solution (to increase maxSockets value).

I'd suggest increasing maxSockets in Core. There are at least two options:

  • set a static value. Even though, it might have a downside of increased memory consumption for all the Kibana instances.
  • change a value when a long queue in ConnectionPool is detected. In theory, we can use the same approach that we discussed for TLS cert hot-reloading Support hot-reloading of TLS certificates #110082 (comment)

@delvedor
Copy link
Member

The agent is configured to use the available sockets in a lifo fashion, which means that if there is a peak of requests and many sockets get opened, then most of those will be closed quickly, so it's not a big issue having a higher number of maxSockets. I agree with @mshustov that from a UX point of view this value should be configured in core. You can try with 512 and see how it goes.

@kobelb
Copy link
Contributor Author

kobelb commented Sep 29, 2021

That's great information, @delvedor. It's revealed another difference in the defaults between the modern and the legacy clients. The legacy Elasticsearch client defaults to keepAlive: true as well: https://github.com/elastic/elasticsearch-js-legacy/blob/f505170491887229a25ca5a015cf76e01161a94a/src/lib/connectors/http.js#L48

As a result, the legacy client will allocate as many sockets as needed when there's a burst of HTTP requests and reuse the sockets for subsequent requests. I'd prefer we stick with this behavior for Kibana's use of the modern Elasticsearch client as well. Elasticsearch supports a massive amount of open sockets for incoming requests and I don't think we should be setting maxSockets to a low number in Kibana and queuing the requests if Elasticsearch has capacity to respond to them.

@kobelb
Copy link
Contributor Author

kobelb commented Sep 30, 2021

@mshustov if we default to using maxSockets: Infinity and keepalive: true, are you opposed to us allowing users to configure these settings using their kibana.yml?

IMO, there's benefit in allowing users to do so. There are situations where it could benefit Kibana's performance by setting maxSockets to a lower number; however, this is largely dependent on how many Kibana nodes there are, how many Elasticsearch nodes there are, and the way that the servers are configured.

@mshustov
Copy link
Contributor

mshustov commented Oct 1, 2021

Elasticsearch supports a massive amount of open sockets for incoming requests and I don't think we should be setting maxSockets to a low number in Kibana and queuing the requests if Elasticsearch has capacity to respond to them.

I managed to find a precise number neither in ES docs https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#tcp-settings nor in ES codebase. Do you know if there is any limit at all? Maybe it's an environment-specific configuration value?

if we default to using maxSockets: Infinity

I'm not opposed to it considering that I was the default value in the Legacy ES client.

if we default to using maxSockets: Infinity and keepalive: true

ES client falls back to keepAlive: true by default https://github.com/elastic/elasticsearch-js/blob/3feda5d9f6633daca3c036d415c63388a57b809a/lib/Connection.js#L61
Note that the Kibana elasticsearch config doesn't accept the keepAlive flag.
https://github.com/elastic/kibana/blob/cc1598c1a6ac19d82ec47bf2bffce407b1b25dca/src/core/server/elasticsearch/elasticsearch_config.ts
This condition seems to be false by default

if (config.keepAlive) {
clientOptions.agent = {
keepAlive: config.keepAlive,
};
}

Kibana plugins can create a custom client instance with keepAlive: true but non of them use this option:

elaticsearch.createClient('my-client', { keepAlive:true });

are you opposed to us allowing users to configure these settings using their kibana.yml?

I'm not a big fan of increasing API surface until it's proven to be necessary.

There are situations where it could benefit Kibana's performance by setting maxSockets to a lower number;

Do you have examples in mind? I can think of higher memory pressure. If we see a real benefit in providing this config option, okay, let's add it. However, I don't remember we've ever had any requests to support adjusting keepAlive via config, so I don't think we should support it.

@kobelb do you want the Core team to take care of the enhancement or do you have the capacity to work on the task yourself?

@kobelb
Copy link
Contributor Author

kobelb commented Oct 1, 2021

I managed to find a precise number neither in ES docs https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#tcp-settings nor in ES codebase. Do you know if there is any limit at all? Maybe it's an environment-specific configuration value?

As far as I'm aware, it's environment-specific. Warning, I am not an expert of the following. For most network interfaces, there's a maximum limit of 65,535 sockets for each application, but to even get near this limit you have to first increase the maximum number of file descriptors that a process can use. I've had to further adjust my kernel settings to even get near this limit, the Gatling docs have some good docs on this: https://gatling.io/docs/gatling/reference/current/general/operations/

ES client falls back to keepAlive: true by default https://github.com/elastic/elasticsearch-js/blob/3feda5d9f6633daca3c036d415c63388a57b809a/lib/Connection.js#L61

@delvedor can you confirm this? Your prior comment made it sound like this wasn't the case.

Do you have examples in mind? I can think of higher memory pressure. If we see a real benefit in providing this config option, okay, let's add it. However, I don't remember we've ever had any requests to support adjusting keepAlive via config, so I don't think we should support it

I suggested that we add a maxSockets setting, not a keepAlive setting. When Kibana's communication with Elasticsearch occurs over TLS, there's rather significant overhead in establishing the TLS socket. Therefore, if there's a stampeding herd of outbound requests to Elasticsearch we can end up in a situation where we try to create so many new TLS sockets at the same time that we end up getting TLS handshake timeouts. In these situations, it can be beneficial to set maxSockets to a lower number, so that we don't try to establish so many new TLS sockets and we do HTTP request queueing in Kibana.

If we want to treat the addition of the maxSockets setting as an enhancement that should be added later, that's fine. However, it's a bug that we have the maxSockets currently hardcoded to 256 as this was done unintentionally and potentially has a significant impact on the number of concurrent users that a Kibana instance can support.

@kobelb do you want the Core team to take care of the enhancement or do you have the capacity to work on the task yourself?

I threw up #113644 to set the maxSockets to Infinity. I can open a separate enhancement issue to allow this to be configurable.

@delvedor
Copy link
Member

delvedor commented Oct 1, 2021

The client uses keepAlive: true by default. My point is that if you set maxSockets to Infinity, there is the risk of having a memory leak, as too many socket gets opened. If you configure a finite number there is the risk of request that will wait some more time, but the memory pressure is constant.

@rudolf
Copy link
Contributor

rudolf commented Nov 23, 2021

After a user reported their memory kept growing on 7.12 I did some investigation.

7.12 still had many uses of the legacy elasticsearch client which uses keepAlive: true. With the legacy elasticsearch client a loop that creates 10k ES API calls will create ~10K TCP connections. This introduces a huge event loop delay (probably because of the cost of the TLS handshake?). However, the worst part is that because of how malloc works, malloc assign a huge amount of "arenas" to Nodejs and then when the connections are closed cannot free them. In my tests this lead to > 2GB RSS while heap used was < 200MB and the RSS never goes down. In our users' case there was a 12GB RSS with just 2GB heap. (For a description of this problem see https://github.com/vmware/chap/blob/master/USERGUIDE.md#analyzing-memory-growth-due-to-free-allocations).

Until 7.16 #113644 the new platform elasticsearch client used keepAlive: false. So on 7.12 using the new platform elasticsearch client it had keepAlive: false which in my testing restricted open sockets to 258. Here the event loop stayed responsive.

So our current configuration of keepAlive: true, maxSockets: Infinity is problematic because of high memory consumption during spikes and because the memory is never freed. It also seems like it could lead to high event loop delays which possibly affects performance in worse ways than simply buffering requests until enough sockets are available.

Testing on 7.16 with the following snippet:

  const startTime = process.hrtime();
  for (let i = 0; i < 10; i++) {
    const promises = new Array(10000).fill(null).map(() => client.indices.get({ index: '*' }));
    await Promise.allSettled(promises);
  }
  console.log('10*10k spike lasted: ', process.hrtime(startTime));

Resulted in the following:
keepAlive: true, maxSockets: 1000
219s, event loop delay < 2s

keepAlive: false, maxSockets: 1000
266s, event loop delay < 2s

keepAlive: true, maxSockets: Infinity
1004s, event loop delay several spikes > 20s

keepAlive: false, maxSockets: Infinity
932s, event loop delay several spikes > 10s

This is an artificial scenario, so it might be that real world conditions are different, but it seems like a smaller number of maxSockets generally improves performance in terms of total time to complete and the event loop delay. So I think it's worthwhile prioritizing defaulting to 1000 and making this configurable.

It would also be very helpful to have visibility into the amount of Elasticsearch connections from Kibana in e.g. the status API and monitoring UI.

@kobelb
Copy link
Contributor Author

kobelb commented Nov 23, 2021

This is a great discovery, @rudolf. It makes me think that we need to add a configurable maxSockets setting sooner rather than later so we can specify this when it makes sense.

Is increased memory usage the only indicator that we have that this is an issue?

@rudolf
Copy link
Contributor

rudolf commented Nov 23, 2021

Is increased memory usage the only indicator that we have that this is an issue?

yeah, it's probably causing spikes in event loop delay, but we don't have a way to confirm and/or trace this under real world conditions.

@kobelb
Copy link
Contributor Author

kobelb commented Nov 23, 2021

@rudolf do we log anything when the outbound request is initiated before the socket is created? If so, we should be able to see a large amount of outbound request, right?

@rudolf
Copy link
Contributor

rudolf commented Nov 23, 2021

with elasticsearch.logQueries we only log when the request completes, but it can give us some idea of the volume. Didn't think about it previously but we could also use cloud proxy logs, if we aggregate by cluster_id and kibana IP into small buckets like 5 minutes we could get an idea of which kibana's had the most completing requests in a short period. If we start with the top results we could see if they have corresponding event loop delays and also get an idea for just how many outgoing requests could be created in such a burst.

@mshustov
Copy link
Contributor

mshustov commented Nov 24, 2021

Until 7.16 #113644 the new platform elasticsearch client used keepAlive: false.

The ES client falls back to keepAlive:true if not specified
https://github.com/elastic/elasticsearch-js/blob/3feda5d9f6633daca3c036d415c63388a57b809a/lib/Connection.js#L61

#113644 made this logic explicit but at the same time it changed maxSockets: Infinity

This is an artificial scenario, so it might be that real world conditions are different, but it seems like a smaller number of maxSockets generally improves performance in terms of total time to complete and the event loop delay. So I think it's worthwhile prioritizing defaulting to 1000 and making this configurable.

I'm not sure yet how to prove 1000 is a good default, we need more real-world data to define a reasonable value.
In the meantime, maybe we can start with making maxSocket configurable #119612

@kobelb
Copy link
Contributor Author

kobelb commented Nov 29, 2021

In the meantime, maybe we can start with making maxSocket configurable #119612

👍

@rudolf
Copy link
Contributor

rudolf commented Nov 29, 2021

By querying our logs for high concurrent connections I came across several clusters affected by #96123

My current assumption is that this bug is the biggest reason for high concurrent connection count and associated memory growth we've seen. I'm waiting to hear back from users reporting the high memory consumption to confirm that the workaround solves their problems.

@kobelb
Copy link
Contributor Author

kobelb commented Mar 9, 2022

This change came up again when we were looking into some SDHs with net/http: TLS handshake timeout errors.

Until 7.16 #113644 the new platform elasticsearch client used keepAlive: false. So on 7.12 using the new platform elasticsearch client it had keepAlive: false which in my testing restricted open sockets to 258. Here the event loop stayed responsive.

I just double-checked the options that are passed when the legacy and new Elasticsearch clients create an instance of http(s).Agent, and I see the following in 7.15 and 7.12:

legacy

    keepAlive: true,
    keepAliveMsecs: 1000,
    maxSockets: Infinity,
    maxFreeSockets: 256,
    freeSocketKeepAliveTimeout: 60000,

new

    keepAlive: true,
    keepAliveMsecs: 1000,
    maxSockets: 256,
    maxFreeSockets: 256,
    scheduling: 'lifo'

@rudolf what am I missing? Where were you seeing keepalive being false?

@rudolf
Copy link
Contributor

rudolf commented May 20, 2022

I think I might have made a mistake in my notes. From my artificial testing scenario the keepAlive value has a very small effect. The real problem is the maxSockets: Infinity causes long event loop delays and memory spikes.

@pmuellr
Copy link
Member

pmuellr commented May 23, 2022

So, do we believe setting this value to non-infinite will cause most/all of the socket hang up issues we see in deployments to go away? What value would be good to use here? 256?

@rudolf
Copy link
Contributor

rudolf commented May 24, 2022

Socket hang ups are most likely caused by large event loop delays. A spike in requests/new sockets from the Elasticsearch client is one cause of such large event loop delays, but it's hard to say how many real world socket timeouts are due to this particular event loop hog.

Either way, Infinity isn't helping us and we should choose something lower as default.

It would be helpful to create a benchmark to try to tune this variable, but the best value will depend very much on the exact load on Kibana. If each outgoing requests has a large CPU overhead (such as receiving a large response payload and processing it with a tight loop), then a lower maxSockets value will protect Kibana from event loop spikes and socket timeouts. But if each outgoing request has a low CPU overhead then throughput could be increased by increasing maxSockets. (and this uncertainty is probably the main reason we haven't done anything to fix this yet)

It might be useful to add logging for when we're reaching the limit for the http agent socket pool and requests start being queued. That way we at least have a signal that the socket pool could be limiting performance and if event loop delays are low we could recommend a customer increase this. This would also give us some real world data to go by to try to tune it. If we can get the logging in place I would feel much more confident in shipping a lower value like 1000.

@pmuellr
Copy link
Member

pmuellr commented Jul 8, 2022

I had a very busy deployment running, that was generating ~10 socket hang up (SHU) related errors per hour. Decided to try turning down the maxSockets to see if it would help.

Doesn't appear to have helped. Getting about the same number of SHU / hr, so far. And performance of the rules has degraded - execution duration has gone up

The SHU's in this case may be due to how busy I made this deployment. I have 100 rules running, 1s interval, and task manager set to 50 workers (default: 10) and a 1s polling interval (default: 3s). There's just too much stuff running concurrently in this deployment. May be too artificial ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience enhancement New value added to drive a business result Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants