-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server side Cancellation of in-flight search requests based on resource consumption #1181
Comments
Following are some detailed thoughts on proposed approach: Problem StatementMany times, a single search query which is resource intensive can guzzle a lot of resources and a bunch of such queries can degrade the performance of the cluster. Currently we do not have a mechanism to identify and terminate the problematic queries when a node is in duress. Existing mechanisms like circuit breaker, thread pool size threshold act as a blanket mechanism and does not specifically target the problematic queries alone. GoalsMilestone 1: Identify and reject the on-going resource intensive tasks on a shard/node if they have breached limits and does not recover within a certain threshold. It only rejects the task on a particular shard and other shard tasks can still execute successfully. Non-GoalsWe are not targeting to build backpressure for spikes in search request rate as a part of this task. It would be handled as a part of rejection of incoming requests (#1180) task. Key Considerations
Proposed Approach[Following sections describe the approach for Milestone 1] Measure the resource consumption (CPU, Heap Memory) at frequent checkpoints within query phase of shard search request. If the node is in duress (JVM MP above threshold, CPU Utilization reached threshold) and if the total heap memory occupied by search shard tasks is >= 5% of total heap, then check the following criteria for each Search Task — CPU cycles spent, heap memory occupied by the task. If the task has been exceeded CPU cycles threshold and is among the top tasks based on heap memory occupied with huge variance from average resource consumption, then we will cancel the search task. Different checkpoints to consider in Query Phase Approach 1 - Using multiple checkpoints to track within the same task thread: Approach 2 - Using separate observer thread Deciding if node is in duress Current JVM MP on the node and CPU utilization are used as criteria to determine if the node is in duress. Identifying top resource consuming tasks When TaskResourceTrackingService measures the resource stats, it will also keep track of top-N tasks based on the heap memory consumption. This would be used to identify and cancel the top resource intensive tasks if the variance is considerably higher. Why not add cancellation logic in Fetch phase also? Every search request goes through two phases - Query and Fetch phase. Query phase is responsible for doing the actual search and get the matching document ids from each shard. Fetch phase enriches the document ids with document information. Query phase is usually very heavy and resource consumption varies depending upon the nature of the query and the workload and hence we track query phase extensively. PoC TestingDid code changes for PoC Testing which included following logic — Heap to track the top-N requests, measure resource utilization after every sub-phase, cancel the top most resource consuming query. (Did not include logic for duration of running request, variance logic) Executed two different types of queries - Light and heavy as follows: Light weight query:
Comparatively heavy aggregation query:
While the queries were getting executed, the top queries consuming lot of heap were getting cancelled as below, whereas the light-weight queries were always successful:
Other Approaches considered
Points to Note
Open Items
I will evaluate more on both the approaches and more details. Metrics to be addedFollowing would be rolling window metric for every 1 minute and would be exposed through stats API.
Additional changes
Please share your thoughts on the above. |
cc-ing folks for comments @reta @dblock @andrross @nknize @sruti1312 @getsaurabh02 |
More thoughts on the above open item: Approach for Resource Consumption TrackingWe need to track the resource consumption for the currently executing tasks which can be done in two ways — Track at different milestones within the same task or use a separate observer thread for monitoring.
Preferred approach - Separate observer thread is preferred due to its low performance overhead, simplicity and code maintainability. Although tracking using cancellable listener would enable tracking more closely (multiple invocations within a second), we do not gain much advantage by tracking at less than a second frequency. Also, if a task gets stuck it cannot be identified until it reaches the next checkpoint. Implementation DetailsA separate observer thread will be running which will execute the following logic every 1 second.
resourceConsumptionCurrentTasksAvg would be much lower and we may think of any task is nearing completion as rogue query. Hence we are taking into account both completed tasks average and currently executing tasks average.Configurable SettingsAll the below settings would be dynamically configurable.
PoC Test ResultsDid sample runs on the nyc_taxis workload with both the approaches. Please find the comparison test results here.Also find the comparison with base run for both approaches here: compare with same thread, compare with observer
Few possible scenarios and how they are handled
|
Thanks @nssuresh2007
+1 to that, collecting metrics in separate thread makes a lot of sense.
On a general note, I think the cancellation logic should be smarter than just making the decision based on collected metrics. As an example, there could only one heavy query (scattered across many tasks) which could keep the cluster busy for a long time. Yes, it is probably not good overall, but if cluster is able to fulfill it (w/o timeout), it is not as bad as it looks. Consulting the picture at large - what cluster is doing - is probably necessary, otherwise such queries will be always cancelled, despite cluster doing nothing, no matter which replica they are retried on. Also, the cancellation should take the age of the search request into account, it is highly likely that serving recent requests is more valuable, since the client may already gave up waiting for the older ones. This is somewhat related to Now, the most important question to me is: what should the user do when the search request execution is cancelled with I think one of the missed features, which could enormously help in decision making, is the search query cost estimation: taking the search query and cluster configuration + topology, estimate the resource consumption (costs). This is obviously out of scope of this particular issue. |
There's a lot of great stuff in this proposal. A complement to a resource consumption cancellation could be a quality of service-based evaluation. Meaning that instead of detecting the increase in resource consumption as a red flag (rightfully noted as potentially not a problem), we would attempt to detect that the cluster is deteriorating in its ability to provide a certain quality of service, which would cause the limits of what is acceptable resource consumption to be lowered, and more heavy requests to be cancelled. This is because I want consistency in my service. If a cluster serves X requests successfully with Y quality (time, resources), I want to ensure that if I see X + 1 requests, that 1 addition does not produce less than Y quality overall for each of X requests, and would prefer reject or cancel that 1 extra request before quality degrades. So, since I think we all want to see a "Task was cancelled before it took the cluster down" error message, consider the alternative or complementary QoS measure to the above proposals of adjusting thresholds. Having moving averages as proposed above is possibly not the right metric. What we will want to look at will be percentiles of operations succeeding within a certain threshold. |
Thanks a lot for the feedback and your time @reta and @dblock! Few points I wanted to add @reta:
Decision to trigger task cancellation is considered only if the node is in duress and the search tasks have contributed a significant portion to it. Hence such queries would not get cancelled on a normally operating cluster.
Yes, it is a very valid point. We can use the elapsed time for each task to prioritize older requests to be cancelled.
We expect the client behavior to be similar to when ESRejectedExecutionException is thrown by the cluster. It would mean that cluster is overloaded and expect customer to retry with sufficient backoff. In case if partial results are allowed, we would return results only from other shards where tasks were not cancelled.
Since it depends on the workload (indexing and search), recommendation to increase heap size might not be applicable always. Please let me know your thoughts. Just a few follow-up thoughts on your comments @dblock:
There are multiple parts that we want to build as a part of Search Back-pressure as mentioned here: #1042 We are targeting to only recover a node in duress with task (1) above by cancelling resource guzzling tasks. Unlike Indexing operation, resource utilization by search request is hard to estimate since it depends on multiple factors like query type, its constructs (contexts and clauses), aggregation type, number of buckets, cardinality/fan-out (number of shard to search) and documents to be returned as part of the response.
We are trying to identify most resource guzzling task among the currently executing ones by checking the variance of resource consumption of each task from the average. Please let me know your thoughts. |
Agree with @dblock on the QoS, which is also the eventual goal once we are able to add more granular instrumentation on the latency breakdown across various layers and N/W interactions. This is also pretty challenging since the QoS could degrade not just because of an overload but also due to I/O slowdown. Some of the work to track gray failures is being taken up as a part of #4244. Once we have that level of details, we could maybe take the right action more deterministically. I think the current proposal lays down steps to prevent a cluster from getting into an unmanageable state by applying load shedding mechanisms, allowing the cluster to recover |
@tushar-kharbanda72 : As today Sep 07, is the code freeze date for OpenSearch. If this is not planned for 2.3, can you please update the label accordingly. |
@reta @dblock @Bukhtawar @sachinpkale Following are the additional metadata that would be added to the stats API
Kindly let me know your comments on the above. |
I think the shape of this response doesn't align with other APIs, but I could be wrong
|
@dblock, thanks a lot for your feedback. I have updated the structure as follows to address your comments and also have revamped the structure to make it easily extensible for future needs (i.e. currently we only emit stats on shard search task, in future we may also add stats on co-ordinator tasks).
Metrics behavior in Shadow vs Enforced modeIn Enforced mode, all the stats present in the response would be populated. Response to your comments:
Removed the redundant "search_task" text in every field.
Updated.
Removed the limit section since they are already listed as a part of cluster settings API and not duplicating them here again. Kindly let me know your thoughts. |
I like this better! |
I really like this proposal! I have a couple questions/comments:
|
I think we're missing logging for cancelled tasks similar to slow query logging. Users will want to turn on something like "cancellable tasks logging" to enable dumping the query body in the logs upon cancellation to debug whether the cancellation is typical for one problematic query or index. |
in |
We are proposing the following update to the stats API structure in order to address the following points:
Please let me know if you have any comments on the below stats API structure:
|
@tushar-kharbanda72 I've added this to the roadmap per the 2.4 label and wanted to confirm this is on track for 2.4. Thanks! |
@elfisher Yes, we are on track for 2.4 release for Milestone 1 (Tracking and cancellation of search shard tasks alone. Does not include co-ordinator tasks). |
In general agree with the approach above and the direction it is taking. A point to consider is asynchronous or long running query cancelling them without consideration of priority or ability to constraint/sandbox a query might be an over simplification of the issue. I might want to run a query that is going to be long running and scanning a large data set. It might be slow and take time but the results are important. Cancelling should be employed while identifying a rouge query / task. I understand that is the primary intention here. But if we want to automate this we need to consider priority and kind of query as well so QoS and Query Cost might be high but required. I think the uber goal would be to be able to prioritize and control resource consumption at execution. Might not directly fit in to this issue but should be something to consider as we look at Query execution overall. |
@tushar-kharbanda72 do you still track this for 2.4 release? code freeze on 11/3 |
@nssuresh2007 are you on track for 2.4 release? Today is the code freeze. |
@anasalkouz Yes, as per the plan, code changes for Milestone 1 of this issue are merged to 2.4. |
Performance comparison after the Search Backpressure changesSummary: We did not see any degradation in performance due to Search BP. Domain SetupData nodes: r5d.large, 2 nodes Benchmark command used:
Baseline: Without Search BP changes Detailed Results
|
How Search BP behaved with Rogue Query:We added the following rogue query to the nyc_taxis workload (
Summary
Test setupUsed locust with following configuration:
This test used same queries used by opensearch-benchmark tool. With Search BP enabled
With Search BP disabled
Note: OpenSearch crashed on one of the instances due to heap space limit reached. |
@rramachand21 Is this still on track for v2.6.0? Code freeze is today (Feb 21, 2023) Also, i'm assuming this is for Milestone 2 (#1329) since #1181 (comment) notes that Milestone 1 was released with v2.4.0 |
@kartg yes, this is on track for v2.6.0, and yes, it's for milestone 2. |
Bumping tag to 2.7.0 since #6455 was merged in for 2.6 |
Hi @tushar-kharbanda72, This issue will be marked for next-release |
Tagging it for |
Is your feature request related to a problem? Please describe.
#1042 aims to build back-pressure support for Search requests. This will help in recovering a node which is running short on system resources and the already running search requests are not finishing and making things worse.
Describe the solution you'd like
Cancelling on-going most resource intensive search requests on a shard/node, if the resource limits for that shard/node have started breaching the assigned limits (#1180), and there is no recovery for a certain time threshold. The back-pressure model should support identification of queries which are most resource guzzling with minimal wasteful work. These can then be cancelled for recovering a node under load and continue doing useful work.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
#1180 - This issue covers rejection of incoming requests.
The text was updated successfully, but these errors were encountered: