[Meta] BackPressure in the OpenSearch Query (Search) path #1042

getsaurabh02 · 2021-08-03T13:36:48Z

Is your feature request related to a problem? Please describe.
While serving search workload from open search cluster, the current protection mechanism on data nodes such as ThreadPoolQueueSize and CircuitBreakers are not fully efficient to protect the cluster against traffic surge, partial failures, slow node or a single rogue (resource-guzzling) query.

Problem Statement in Detail

1. Count of requests in the search queue is not the accurate reflection of the load on the nodes.
Search queue sizes are effectively fixed, and the count of requests in the search queue is not the accurate reflection of the actual load on the node. There is a need to estimate the cost of the query (based on query context) and map it against the available resources on the node, to take an admission decision. This is applicable for individual search task (shard-search-request) on the data nodes as well. Essentially we need to model the queue sizes into resource maps to selectively control the inflow.

2. Single rogue query can adversely impact data nodes and can create cascading affect of node failures.
Single bad query today can potential end up tripping the data nodes and create a cascading effect of failures across the cluster. We need to come up with fail safe mechanism where problematic queries are proactively identified, blocked or terminated in cases where a data node is running in duress. This is an extension to point 1 above, where we need checkpointing, and terminate a rogue query, in case it has breached its original estimated by some degree.

3. A single slow path can bottleneck coordinators and impact the overall service quality and availability
Each of the fanout transport request from coordinator to data nodes for transport actions such as indices:data/read/search[phase/query uses the async-io from the coordinator. While coordinator is still aggregating or awaiting response from data nodes, the ongoing execution is not accounted in the search queue size on the coordinator node. Due to this reason coordinator continues to process more incoming requests, creating more pressure on the downstream data nodes. This lack of backpressure often results into huge number of in-flight requests on the coordinator node, if data nodes are taking longer to respond back. Since coordinator manages the request contexts, partial results, failure retries, a single slow path in the cluster can bottleneck coordinators and impact the overall service quality and availability.

4. Force retries under heavy workload can worsen the situation.
Once coordinator receives responses for shard requests from data nodes, the response handling part execute-next OR the failure handling part retry-failed-shard is executed on the search thread pool. In scenarios where an individual shard search request had failed, such as due to connection/gc issue, the subsequent retry on the other shard (replica) happens via forking the request and submitting it forcefully on the search thread-pool. In such scenarios the current search threadpool size (load) is not honoured by design to reduce wastage of work already done. In non-happy scenarios, this is desirable. However, under heavy workload scenarios, if the node is already under duress this force retry actually worsen the situation.

Describe the solution you'd like

Aligned on the similar principles of how we are addressing the Back-pressure in the Indexing Path #478, the new back-pressure model in the Search/Query path will need to account for the inflight requests in the system and their associated cost based on the query contexts. Considerations need to be provided based on the query attributes, search phases and the cardinality of requests across its lifecycle at different node roles i.e. coordinator and data. Based on these parameters cost of a new request can be evaluated against the existing workload on the node, and a throttling decision can be made under duress. This should allow shard search requests to fail fast on the problematic path (node/shard) based on recent/outstanding requests, such that these requests instead can be retried on another active path (one of the replica), or returned with partial partial results (best-effort), without melting down the nodes.

In addition to smart rejections as a prevention mechanism, the new system should also leverage the search cancellation model extensively, to checkpoint and cancel the tasks (and sub-tasks) associated with the problematic/stuck queries, in order to provide a predictable recovery. This will allow cluster to continue doing the useful work as possible.

Proposal in detail

To manage resource utilization effectively across the cluster
Instead of limiting the requests based on the queue occupancy, we will utilize a Cost Model to identify how cheap or expensive a query is in terms of its resource consumption. Query cost on a coordinator node will be a function of the query type, its constructs (contexts and clauses), aggregation type, number of buckets, cardinality/fan-out (number of shard to search) and documents to be returned as part of the response. While query cost on the data nodes will be function of data to be scanned, documents sizes etc. This will help estimate and compute the cost of a query at intermediate stages and map them to the key resources on the node such as memory, compute and disk/network IO. Resource availability will be modelled as different TokenBuckets, and a request is permitted if the total available tokens across bucket is greater than the estimated cost of the request. The total available tokens on the node will always be finite based on the instance types, and token exhaustion will be considered as a soft limit breach situation. Burst support will be provided by the accounting model, to aid and support transient spikes, while also to confirm the state of duress by observing additional metrics, before taking a throttling decision for a request.
To recover from duress situation
In addition to original estimation, the Intermediate resource consumption of a query will be calculated at different stages, such as during partial query/fetch result aggregation on the coordinator. This will help compare and check-point the original estimates against the actual utilisation, and take decisions on burst vs throttle, based on the overall load on the node. Highly resource guzzling queries under duress situations can be proactively cancelled to prevent the complete meltdown of the host. The intermediate checkpointing and cancellation will allow recovery for continuation of useful work.
To build backpressure for node to node resiliency
We need to come up with a dedicated shard tracking structures, maintained per Primary and Replica Shards, on every Coordinator node, to track the complete life-cycle of a search operation tasks across those shards. This has to be take account of every request, at the granularity of different phases such as query and fetch phase. Also, each data node will have a similar tracking structures per shard on the node, maintained and updated at the transport action layer, to track the inflight search operations on the node. This will allow maintain the end to end view of the request, along with the current state, thereby allowing an easy identification of the stuck or slow tasks in the system. System can track the lingering requests on the node, in case of single black/grey shards issues, as well as issues due to multiple shards on the node. This will allow the back-pressure to propagate early, from the data nodes to the coordinator nodes, while preventing build up on coordinator. Additionally, coordinators can prevent new requests taking down the problematic path. Coordinators in such cases can take early decisions to either fast fail (parital-results), or retry on another shard, until system recovers.

The text was updated successfully, but these errors were encountered:

getsaurabh02 · 2021-08-31T07:37:43Z

Based on further thoughts breaking down the tasks as below:

1. Visibility metrics required for Backpressure framework

1.1 Resource Tracking framework

Build a resource tracking framework for search requests (queries), which tracks resource consumption on OpenSearch nodes, for various Search operations at different levels of granularity -

i. Individual Search Request (Rest) - On the Coordinator Node across phases (such as search and query phase) for end to end resource tracking from coordinator perspective.
ii. Shard Search Requests (Transport) - On the Data Node per phase, for discrete search task tracking.
iii. Shard level Aggregated View - Total consumption of resources mapped to every shard for searches on the node.
iv. Node level Aggregated View - Total consumption of resources for all search request on the node.

Characteristics:

Resources to track for each of the above proposed metrics
- Memory Consumption (JVM) : On the similar lines of #1009 we want to enable memory tracking.
- CPU Consumption : TBD
Frequent checkpointing and footprint update during the progress of a search phase: The resource tracking should continuously be done as the search request progresses, and need not necessarily wait for a particular phase to complete. This is important as a single phase execution can consume significant amount of resources itself. So, tracking resources within a phase itself as it progresses becomes important for these metrics to represent the accurate state on the node.
Data Node feedback to build Coordinator state : Have a capability for data node to piggyback on the response and send its current shard utilisation state to coordinator. This can later feed into coordinator state to take adaptive and short circuit routing decisions (covered as point 2 below).

2. Back Pressure accuracy and fairness

2.1 Server side rejection of in-coming search requests based on resource consumption

Currently, Search rejections are solely based on the number of tasks in queue for Search ThreadPool. That doesn’t provide fairness in rejections as multiple smaller queries can exhaust this limit but are not resource intensive and the node can take in much more requests and vice versa. Essentially count is not the reflection of actual work.

Hence based on metrics in point 1 above, we want to build a frame which can perform more informed rejections based on point in time resource utilisation. The new model will take the admission decision for search requests on the node. These admission decisions or rejection limits can be added at:

Shard level - With support for burst, if resources are available at the node level.
Node level - This will be the final guard rail (last line of defence) and model will not be admitting any new requests once this limit is breached.

The above will provide the required isolation of accounting and fairness in the rejections which is currently not there. This is still a reactive back-pressure mechanism as it only focusses on the current consumption and does not estimate the future work.

2.2 Server side Cancellation of in-flight search requests based on resource consumption

Cancelling on-going most resource intensive search requests on a shard/node, if the resource limits for that shard/node have started breaching the assigned limits (point 2.1), and there is no recovery for a certain time threshold. The backpressure model should support identification of queries which are most resource guzzling with minimal wasteful work. These can then be cancelled for recovering a node under load and continue doing useful work.

2.3 Query resource consumption estimation

Improve the framework added as part of (point 2.1) to also estimate the cost of new search queries, based on their potential resource utilisation. It can be achieved by looking at the past query patterns, with similar query constructs and the actual data on shard. This will help in building a pro-active back-pressure model for search requests, where estimates will be compared against the available resources during the admission decision for finer .

3. Single Slow Path bottle-necks

3.1 Adaptive replica Selection - Factor in node resource utilisation

Currently, for Adaptive replica selection the tasks in search thread-pool queue are factored in while coming up with the ranking for the target shards. To be more accurate on this we can factor in the resource utilisation for search requests, and take better routing decisions on the coordinator. This will help avoiding the shards under stress giving them time to recover, while also allowing fast-fail decisions (point 2.2) on coordinator, if required.

4. Server side rejections to account for Request Priority

4.1 Server side rejections based on Implicit Request Priority

With point 2 above, we have the ability to reject and cancel the search requests on the server side based upon point in time resource consumption. This point in-time state should also be considered while admitting the forced tasks, such as retries and fetch operations, as these forced tasks can increase the load further on the nodes.

Back-pressure frame should factor in the point in-time resource consumption for forced search request and take a decision whether to admit the query or not. This will be based on the point in time load Forced will act as priority scheme for the search request but will still not guarantee whether the request will be admitted or not and this will be responsibility of back-pressure framework to decide.

There’s a GitHub issue already for something similar - #1017

4.2 Server side rejections based on Explicit Request Priority

Extending on point 4.1 above, with #1140 we propose customers also to have the ability to provide the priority of a requests, as a request field. The priority in the request is specified for a dynamic queue such as one of (CRITICAL, HIGHEST, HIGH, NORMAL, LOW, LOWEST). Backpressure framework should be able to factor in this priority, and honour it while taking rejection decisions. This will provide deeper isolation within queries, while letting the low priority queries being rejected at first under duress, until there are serious reasons to reject high priority ones.

tushar-kharbanda72 · 2021-08-31T10:35:42Z

getsaurabh02 added the enhancement Enhancement or improvement to existing feature or request label Aug 3, 2021

This was referenced Aug 24, 2021

Query Prioritization #1140

Closed

Query prioritisation support #1017

Open

reta mentioned this issue Dec 1, 2021

[RFC] Add metrics and tracing framework in Opensearch #1061

Open

malpani mentioned this issue Dec 6, 2021

Track resource consumption for query and fetch phases #1575

Closed

5 tasks

getsaurabh02 mentioned this issue Mar 23, 2022

[RFC] Backpressure in Search Path #1329

Open

This was referenced Nov 9, 2022

Add search backpressure cancellation at the coordinator level #5173

Closed

Search Query Runtime Cost Calculation #5174

Open

anasalkouz mentioned this issue Nov 16, 2022

Top N Resource Intensive Search Queries #5175

Open

andrross added the Roadmap:Stability/Availability/Resiliency Project-wide roadmap label label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta] BackPressure in the OpenSearch Query (Search) path #1042

[Meta] BackPressure in the OpenSearch Query (Search) path #1042

getsaurabh02 commented Aug 3, 2021

getsaurabh02 commented Aug 31, 2021

tushar-kharbanda72 commented Aug 31, 2021

[Meta] BackPressure in the OpenSearch Query (Search) path #1042

[Meta] BackPressure in the OpenSearch Query (Search) path #1042

Comments

getsaurabh02 commented Aug 3, 2021

getsaurabh02 commented Aug 31, 2021

1. Visibility metrics required for Backpressure framework

2. Back Pressure accuracy and fairness

3. Single Slow Path bottle-necks

4. Server side rejections to account for Request Priority

tushar-kharbanda72 commented Aug 31, 2021