Search flow enhancements for admission controller #8913
Labels
enhancement
Enhancement or improvement to existing feature or request
Search
Search query, autocomplete ...etc
Overview
The parent RFC #8910 discusses admission control framework which limits and restricts the new incoming requests early when a node begins to go under stress.
As part of it , we will enhance search flow to intelligently route the requests away from the stressed nodes.
We also need to enhance the coordinator node (TransportSearchAction) logic to mark shard requests for rejection if all primary and replicas of index shard belong to nodes under stress.
Routing enhancements
Enhance ARS ( Adaptive replica selection)
We will make adjustments to the ARS ranking algorithm based on resource utilization of target nodes which will help in proactively rerouting the requests away from nodes with high stress.
Ranking algorithm changes
We will increase the rank by a factor for each resource utilization threshold breached , so the nodes with high resource utilization will have higher ranks.
We increase the rank calculated by ARS by a multiplier.
Rank = current rank * performance based multiplier
Examples :
When I/O is greater than 95% ,
Rank = rank * 1.7 ( rank is increased by 70% )
When CPU and IO both are beyond the thresholds ,
Rank = rank * 1.7 * 1.7 ( rank is increased by 70% twice for each resource utilization threshold breached )
Stats adjustment post ranking
We also need to adjust the stats of the bad nodes post ranking , otherwise we will end up herding all the requests to the good node.
So for each new request that doesn’t get routed to bad nodes,
Resource utilization stat of bad node = Resource utilization stat * Reduction factor
Reduction factor can be configured based on how soon we want to normalize the stats of bad nodes.
Weighted round robin routing
For routing based on weights, we can use weighted ARS instead of weighted round robin routing since ARS already has the enhancements mentioned above, and it'll provide fairness to each new request.
Rejection of search requests
Rejection in coordinator node
When primary and replicas of the search shard requests are all in stressed nodes, we can fail fast the shard request in coordinator.
Approach
We build set of ‘SearchShardIterator’ as part of ‘executeSearch’ in ‘TransportSearchAction’ in which we’ll execute the search request.
Similar to ‘skip’ option in ‘SearchShardIterator’ , we can provide a new option ‘failFast’ , based on which we can fail fast the shard in the coordinator , mark the shard as failed as part of search response , and skip sending the request to the actual data nodes.
Rejection in target nodes
We can reject incoming requests if the data node is in stress, this will be an extension to existing search backpressure.
Co-authored by : @ajaymovva
The text was updated successfully, but these errors were encountered: