-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Admission Controller framework for OpenSearch #8910
Comments
Tagging @jainankitk @reta @sohami @dblock |
💯 see the usefulness of such a feature, thanks @bharath-techie, at high level it looks great
This looks more like reactive and not proactive approach. Could we collect the metric from the nodes periodically and base admission decisions on that? (something like adaptive load balancing [1] as an example). The issues I have seen in the past is one single query could bring all affected nodes into endless GC cycles but it seems like we won't be able to detect that till the responses from the nodes are received. [1] https://doc.akka.io/docs/akka/2.8.2/cluster-metrics.html#adaptive-load-balancing
So we have this new types of nodes, the extensions nodes (more generally, extensions as separate processes), and these guys could significantly impact search or indexing if are part of the relevant processing pipeline. It might be too much to ask from extension developers to provide resource utilization (can we back it in to the sdk?) but I believe the extensions nodes should be factored into admission controller in some way (track latency?). What do you think? |
Thanks @bharath-techie for the proposal. This should help with increasing the overall resiliency of the OpenSearch cluster.
|
Yeah agreed on it being a bit reactive , we evaluated making separate calls periodically to downstream nodes as well - but we thought that we can live with the first search query to pass through AC framework if we don't have data and then we can rely on existing search backpressure to cancel such resource intensive queries in the node, let me know your thoughts on this.
Yeah since we are planning to extend ResponseCollectorService, we can see how to utilize/extend existing data such as 'ResponseTime', 'ServiceTime' in such cases. |
Some of this is covered in indexing AC doc - please refer #8911 cc : @ajaymovva
We are planning to add support based on TransportActionNames ( so we can use prefix for any search / index actions if needed ) in data nodes or any RestEndPoints in case of coordinator.
We have a half open state throughout - Ac framework will allow requests periodically ( or request count based ) to check if target nodes are still in stress - which should solve this to some extent. |
Thanks @bharath-techie for the proposal. This looks great.
I am thinking if each node takes the local decision based on its own resource view then that should be good for the admission control decision at the node level. Building the downstream nodes view of resource utilization may not be needed for Admission Control mechanism. Even if we have that, it may be out of sync with the downstream node current utilization and probably may not be very accurate to make AC decisions ? Note: I am assuming that admission control will come in post the routing decision is made to either fail or admit the request However, it can be useful for routing decisions and will be relevant to certain request types. For example: In case of search (it can help with ranking the shard replicas) whereas for others like indexing it doesn't necessarily matter since we need to route the request to all the shards. Also with SegRep model it may not be needed for indexing request.
|
@sohami I don't think that the distributed framework can be used to build the node view since this is also reactive and never collects back the state from the downstream node. (Though we are talking about this in this proposal #8918 if traces can be communicated back to the coordinator node.) |
Thanks for the comments @sohami . Yeah the overhead of polling (all node to all node communication) is one of the reasons we didn't go for perioidic api / checks. It doesn't scale well with the size of clusters. Tradeoff is that rejections are best effort basis in coordinator. We can look into the hybrid approach incrementally, we can make the decision based on whether there is a lack of accuracy during benchmarks. |
Yes 'other approaches' section has the information on that. |
Yes, for indexing, we need to route the request for all the shards, but this approach will help to proactively reject the request at the coordinator before it lands on the target node. For every indexing request, we will evaluate the nodes (primary and replica) stats where the request will land based on the coordinator cluster state and take action at the coordinator.
For the plain SegRep, we still write to replica translogs, so this approach will be useful. |
Background
Currently, in opensearch, we have various rejection mechanisms for in-flight requests and new requests when cluster is overloaded.
We have backpressure mechanisms which are reactive to node duress conditions, circuit breakers in data nodes which rejects requests based on real time memory usage. We also have queue size based rejections. But there are different gaps in each of the above solutions.
We don’t have a state based admission control framework which is capable of rejecting incoming requests before we start the execution chain.
Existing throttling solutions
1. Search backpressure
Search backpressure in opensearch currently cancels resource intensive tasks when node is in duress. Coordinator node cancels the search tasks and data nodes cancel the shard tasks.
Challenges
1.1 There is no server side rejection mechanism for incoming requests when a target node is in duress. Refer Server side rejection of in-coming search requests based on resource consumption #1180
1.2 The existing routing methods don’t intelligently route away from stressed nodes based on their resource utilization. Refer similar issue Adaptive replica Selection - Factor in node resource utilisation #1183
2. Indexing backpressure
Opensearch has node level and shard level indexing backpressure that dynamically rejects indexing requests when the cluster is under strain.
Challenges
3. Circuit breaker
We have circuit breakers in data nodes which rejects requests based on real time memory usage. This is the last line of defence and it prevents nodes from further going down.
Challenges
4. Queue sized based rejection
In OpenSearch, we currently have different queues for different operations such as search, indexing etc. And we reject new requests if the respective queue is full.
Challenges
Proposal
We propose to implement Admission control framework for OpenSearch which rejects incoming requests based on the resource utilization stats of the nodes. This will allow real-time, state-based admission control on the nodes.
We will build a new admission control core plugin which can help in intercepting and rejecting requests in rest layer and transport layer. We will extend ‘ResponseCollector’ service to maintain performance utilization of downstream nodes in coordinator node.
Goals and benefits
High level design
Admission control plugin
We can add a new admission control core opensearch module / plugin that extends ‘NetworkPlugin’ which intercepts the requests at rest layer and transport layer to perform rejections.
Admission control service
Building resource utilization view
Response collector service
We’ll extend the existing ‘ResponseCollectorService’ to collect performance statistics such as CPU, JVM and IO of downstream nodes in coordinator node.
We also will collect node unresponsiveness / timeouts when a request fails which will be treated with more severity.
The coordinator can use this service at any time to get the resource utilization of the downstream nodes.
Local node resource monitoring
We will reuse node stats monitors such as process, jvm, fs which already monitors node resources at 1 second interval.
Track the resource utilization of the downstream nodes
We will enhance the search and indexing flows to get the downstream node performance stats.
Approach 1 - Use the thread context to get the required stats from downstream nodes
Pros
This approach has no regression / backward compatibility risks as we don’t alter any schema
Risks
We need to check if there are any security implications in carrying perf stats as part of threadcontext
Approach 2 - Schema change
Search flow
Indexing flow
Risks
Other approaches considered
We can enhance follower check / leader check APIs to propagate performance stats of the nodes to all other nodes.
Cons
This builds a dependency on cluster manager and might have an impact on cluster manager node’s performance.
These health checks are very critical , and so any regression will be quite problematic
Search flow enhancements - #8913
Indexing flow enhancements - #8911
Co-authored by @ajaymovva
The text was updated successfully, but these errors were encountered: