Improve performance of Cat Nodes API #99744

NEUpanning · 2023-09-21T10:32:13Z

Description

We found that executing the Cat Nodes API (query parameters do not matter) on the coordinate node of a large cluster can require a huge amount of CPU.This could have a significant impact on cluster stability.I reproduced this problem in the cluster with 200 data nodes and 140k shards.
When I used the 'top' command to query, the result showed that CPU usage fluctuated between 726% and 1173% for 3 seconds.
The most CPU usage comes from ShardStats.<init> and DiscoveryNode.writeTo. ShardStats.<init> is called when coordinate node deserializes response that is responded by other nodes. DiscoveryNode.writeTo is called when coordinate node serializes request that will be sent to other nodes.Here is Flame Graph

Several superficial ideas try to solve this issue:

Coordinate node could construct NodesStatsRequest#indices based on query parameters to filter indices stats rather than calling NodesStatsRequest.indices(true) that contains all indices stats.For instance if users call _cat/nodes?h=m,coordinate node should not fetch indices stats from other nodes.This would avoid a lot of unnecessary deserialization of the response content from ShardStats.<init>.
Set NodeInfoRequest#concreteNodes to null after using NodeInfoRequest#concreteNodes to build iterator that used to send requests.This would avoid unnecessary serialization of the request content from DiscoveryNode.writeTo.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-09-21T23:15:49Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2023-09-22T04:46:11Z

Pinging @elastic/es-data-management (Team:Data Management)

DaveCTurner · 2023-09-22T05:02:39Z

The _cat APIs are intended to be convenient wrappers around the more flexible JSON APIs, intended for occasional use by humans. I'm a little impressed that it only takes 3s to respond in such a huge cluster. If you need more control over the execution, use the JSON APIs directly.

For instance if users call _cat/nodes?h=m, coordinate node should not fetch indices stats from other nodes. This would avoid a lot of unnecessary deserialization of the response content from ShardStats..

This seems like a nice idea but it would be fragile, we'd need to keep track of the columns that included stats to know whether or not the stats request was needed, and make sure to keep that list up to date as the columns change over time. Also the _cat APIs generally don't have a way to get a list of the requested columns when processing the request. So this idea seems possible but a little tricky and requiring quite some effort.

Set NodeInfoRequest#concreteNodes to null

IMO this is a valid point, although I would not want to solve it as described. Today every node-level TransportNodesInfoAction.NodeInfoRequest carries the entire top-level NodesInfoRequest, but it only needs to carry the metrics. We should trim these things down for sure.

DaveCTurner · 2023-09-22T05:10:51Z

Just to add: another idea would be to indicate in the stats request that we don't care about stats for individual shards, we are only going to use a summary. That'd save a bunch of effort and network traffic with the nodes stats and indices stats APIs too.

NEUpanning · 2023-09-22T11:04:42Z

@DaveCTurner Thanks for the reply.

In the most scenarios we will use the JSON APIs instead of _cat/nodes.

If you need more control over the execution, use the JSON APIs directly.

This is a great idea that is more elegant than i think and avoids unnecessary serialization of the request content from DiscoveryNode.writeTo.

Today every node-level TransportNodesInfoAction.NodeInfoRequest carries the entire top-level NodesInfoRequest, but it only needs to carry the metrics. We should trim these things down for sure.

This is also a great idea that would trim useless shard stats from responses and avoid unnecessary deserialization of ShardStats.<init>.

another idea would be to indicate in the stats request that we don't care about stats for individual shards, we are only going to use a summary.

I think implementaion of these ideas could solve this issue.

NEUpanning · 2023-09-22T11:36:55Z

I don't quite understand why indices stats APIs can be optimized.I see coordinate node needs shard stats to build indices stats.

That'd save a bunch of effort and network traffic with the nodes stats and indices stats APIs too.

NEUpanning · 2023-09-22T11:42:16Z

I would like to do it, do you think I could give it a try?

I think implementaion of these ideas could solve this issue.

DaveCTurner · 2023-09-22T11:47:07Z

I see coordinate node needs shard stats to build indices stats.

I think we can reduce the work here if the user specifies ?level=cluster, but I'm not sure it's worth the effort.

I would like to do it, do you think I could give it a try?

Sure, go for it. I recommend you don't do it all in one PR tho, try and separate the independent changes out to make them easier to review.

NEUpanning · 2023-09-27T07:45:47Z

@DaveCTurner I've opened a pull request(#99938) for this idea . Could you please have a look when you have some time? Thanks

Today every node-level TransportNodesInfoAction.NodeInfoRequest carries the entire top-level NodesInfoRequest, but it only needs to carry the metrics.

…equest (#99938) There's no need to include the whole top-level `NodesInfoRequest` in the requests for info from individual nodes, and this can add substantial overhead if there are lots of nodes in the cluster. With this commit we drop the wrapper in favour of just the parts of the top-level request needed for the node-level processing. Relates #99744

…equest (elastic#99938) There's no need to include the whole top-level `NodesInfoRequest` in the requests for info from individual nodes, and this can add substantial overhead if there are lots of nodes in the cluster. With this commit we drop the wrapper in favour of just the parts of the top-level request needed for the node-level processing. Relates elastic#99744

NEUpanning · 2023-10-07T07:37:15Z

another idea would be to indicate in the stats request that we don't care about stats for individual shards, we are only going to use a summary.

I've opened a pull request for this idea in #100466.

DaveCTurner · 2023-10-07T07:57:35Z

Thanks @NEUpanning, I'll take a look next week. You might also be interested in #90631 which is kind of the same thing but for the GET _cluster/health API.

NEUpanning · 2023-10-07T08:25:19Z

I would like to resolve this issue. After that PR is merged, I will try it using the similar approach.

… only fetch a summary (#100466) relates #99744

NEUpanning · 2023-10-13T08:01:07Z

After we have implemented these ideas mentioned above, the CPU usage and cost time of fetching nodes stats (without shards-level stats) reduce to 1/1000th of their original levels in the cluster with 200 data nodes and 140k shards. So this issue is closed as completed.

DaveCTurner · 2023-10-13T09:05:28Z

Nice work @NEUpanning, thanks for the report, the fixes, and for confirming that the problems are fixed.

NEUpanning · 2023-10-13T09:24:52Z

Thanks again David. Thanks for your help and patience in code review.

NEUpanning added >enhancement needs:triage Requires assignment of a team area label labels Sep 21, 2023

arteam added the :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Sep 21, 2023

elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 21, 2023

elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Sep 21, 2023

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 22, 2023

NEUpanning mentioned this issue Sep 27, 2023

Prune unnecessary information from TransportNodesInfoAction.NodeInfoRequest #99938

Merged

NEUpanning mentioned this issue Oct 7, 2023

Introduce includeShardsStats in the stats request to indicate that we only fetch a summary #100466

Merged

elasticsearchmachine pushed a commit that referenced this issue Oct 13, 2023

Introduce includeShardsStats in the stats request to indicate that we…

9003cbe

… only fetch a summary (#100466) relates #99744

NEUpanning closed this as completed Oct 13, 2023

This was referenced Oct 14, 2023

Create new feature interface for querying features on a node/cluster #100330

Closed

Do not send full top-level TransportNodesAction request to individual nodes #100878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of Cat Nodes API #99744

Improve performance of Cat Nodes API #99744

NEUpanning commented Sep 21, 2023 •

edited

Loading

elasticsearchmachine commented Sep 21, 2023

elasticsearchmachine commented Sep 22, 2023

DaveCTurner commented Sep 22, 2023

DaveCTurner commented Sep 22, 2023

NEUpanning commented Sep 22, 2023

NEUpanning commented Sep 22, 2023

NEUpanning commented Sep 22, 2023

DaveCTurner commented Sep 22, 2023

NEUpanning commented Sep 27, 2023

NEUpanning commented Oct 7, 2023

DaveCTurner commented Oct 7, 2023

NEUpanning commented Oct 7, 2023

NEUpanning commented Oct 13, 2023 •

edited

Loading

DaveCTurner commented Oct 13, 2023

NEUpanning commented Oct 13, 2023

Improve performance of Cat Nodes API #99744

Improve performance of Cat Nodes API #99744

Comments

NEUpanning commented Sep 21, 2023 • edited Loading

Description

Several superficial ideas try to solve this issue:

elasticsearchmachine commented Sep 21, 2023

elasticsearchmachine commented Sep 22, 2023

DaveCTurner commented Sep 22, 2023

DaveCTurner commented Sep 22, 2023

NEUpanning commented Sep 22, 2023

NEUpanning commented Sep 22, 2023

NEUpanning commented Sep 22, 2023

DaveCTurner commented Sep 22, 2023

NEUpanning commented Sep 27, 2023

NEUpanning commented Oct 7, 2023

DaveCTurner commented Oct 7, 2023

NEUpanning commented Oct 7, 2023

NEUpanning commented Oct 13, 2023 • edited Loading

DaveCTurner commented Oct 13, 2023

NEUpanning commented Oct 13, 2023

NEUpanning commented Sep 21, 2023 •

edited

Loading

NEUpanning commented Oct 13, 2023 •

edited

Loading