Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server side Cancellation of in-flight search requests based on resource consumption #1181

Open
tushar-kharbanda72 opened this issue Aug 31, 2021 · 29 comments
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request v2.8.0 'Issues and PRs related to version v2.8.0'

Comments

@tushar-kharbanda72
Copy link
Contributor

tushar-kharbanda72 commented Aug 31, 2021

Is your feature request related to a problem? Please describe.
#1042 aims to build back-pressure support for Search requests. This will help in recovering a node which is running short on system resources and the already running search requests are not finishing and making things worse.

Describe the solution you'd like

Cancelling on-going most resource intensive search requests on a shard/node, if the resource limits for that shard/node have started breaching the assigned limits (#1180), and there is no recovery for a certain time threshold. The back-pressure model should support identification of queries which are most resource guzzling with minimal wasteful work. These can then be cancelled for recovering a node under load and continue doing useful work.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
#1180 - This issue covers rejection of incoming requests.

@nssuresh2007
Copy link

nssuresh2007 commented Aug 3, 2022

Following are some detailed thoughts on proposed approach:

Problem Statement

Many times, a single search query which is resource intensive can guzzle a lot of resources and a bunch of such queries can degrade the performance of the cluster. Currently we do not have a mechanism to identify and terminate the problematic queries when a node is in duress. Existing mechanisms like circuit breaker, thread pool size threshold act as a blanket mechanism and does not specifically target the problematic queries alone.

Goals

Milestone 1: Identify and reject the on-going resource intensive tasks on a shard/node if they have breached limits and does not recover within a certain threshold. It only rejects the task on a particular shard and other shard tasks can still execute successfully.
Milestone 2: In previous milestone, we are only tracking shard level tasks. Now, include tracking for co-ordinator task resource consumption and cancellation logic based on the threshold.
Milestone 3: Build aggregated view of resource consumption for a query by rolling up the consumption stats from all the shards and aggregating them under the parent task id. This aggregated view can be used to build guard-rails to track and cancel the request which consumes lots of resources across nodes.

Non-Goals

We are not targeting to build backpressure for spikes in search request rate as a part of this task. It would be handled as a part of rejection of incoming requests (#1180) task.

Key Considerations

  • Resource Tracking Framework would be utilized to provides stats on task resource consumption.
  • Initially, we are targeting the shard search task and not the co-ordinator task.

Proposed Approach

[Following sections describe the approach for Milestone 1]

Measure the resource consumption (CPU, Heap Memory) at frequent checkpoints within query phase of shard search request. If the node is in duress (JVM MP above threshold, CPU Utilization reached threshold) and if the total heap memory occupied by search shard tasks is >= 5% of total heap, then check the following criteria for each Search Task — CPU cycles spent, heap memory occupied by the task. If the task has been exceeded CPU cycles threshold and is among the top tasks based on heap memory occupied with huge variance from average resource consumption, then we will cancel the search task.

Different checkpoints to consider in Query Phase

Approach 1 - Using multiple checkpoints to track within the same task thread:
Query phase in each search shard task undergoes different sub-phases (Pre-aggregation, Search, Rescoring, Suggest, Post-aggregation) and we will be checkpointing after each phase is completed.
Among these, search phase does the actual lucene search and is very intensive. Hence we cannot checkpoint only after the search is completed and we will add a cancellable callback which would periodically checkpoint during actual search execution itself.

Approach 2 - Using separate observer thread
We will be using separate observer thread to monitor the tasks at a fixed frequency. We will not evaluate through different checkpoints, but track tasks at a fixed frequency.

Deciding if node is in duress

Current JVM MP on the node and CPU utilization are used as criteria to determine if the node is in duress.

Identifying top resource consuming tasks

When TaskResourceTrackingService measures the resource stats, it will also keep track of top-N tasks based on the heap memory consumption. This would be used to identify and cancel the top resource intensive tasks if the variance is considerably higher.

Why not add cancellation logic in Fetch phase also?

Every search request goes through two phases - Query and Fetch phase. Query phase is responsible for doing the actual search and get the matching document ids from each shard. Fetch phase enriches the document ids with document information. Query phase is usually very heavy and resource consumption varies depending upon the nature of the query and the workload and hence we track query phase extensively.
Also, once query phase is completed, search query is about to get finished and we do not want to cancel it which would result in wastage of the resource that executed till now.

PoC Testing

Did code changes for PoC Testing which included following logic — Heap to track the top-N requests, measure resource utilization after every sub-phase, cancel the top most resource consuming query. (Did not include logic for duration of running request, variance logic)

Executed two different types of queries - Light and heavy as follows:

Light weight query:

curl "localhost:9200/_search?q=*:*"

Comparatively heavy aggregation query:

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' 
-d'{"query": {"range":{"dropoff_datetime":{"from":"01/01/2015","to":"21/01/2015",
"include_lower":true,"include_upper":true,"format":"dd/MM/yyyy","boost":1.0}}},
"aggregations":{"dropoffs_over_time":{"date_histogram":{"field":"dropoff_datetime",
"calendar_interval":"day","offset":0,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0}}}}'

While the queries were getting executed, the top queries consuming lot of heap were getting cancelled as below, whereas the light-weight queries were always successful:

[2022-07-28T18:29:08,400][INFO ][o.o.s.q.QueryPhase       ] [ip-172-31-51-111.us-west-2.compute.internal] This task 1317 is currently the highest resource consuming task and it is being cancelled
[2022-07-28T18:29:08,401][TRACE][o.o.s.SearchService      ] [ip-172-31-51-111.us-west-2.compute.internal] Query phase failed
org.opensearch.tasks.TaskCancelledException: Task is cancelled due to high resource consumption
        at org.opensearch.search.query.QueryPhase.measureResourceConsumption(QueryPhase.java:157) ~[opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.search.query.QueryPhase.execute(QueryPhase.java:187) ~[opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:455) ~[opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:523) [opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:490) [opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:73) [opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88) [opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) [opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) [opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.1.0.jar:2.1.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[ec2-user@ip-172-31-51-111 opensearch]$ curl -X GET "localhost:9200/_search?pretty" 
-H 'Content-Type: application/json' 
-d'{"query": {"range":{"dropoff_datetime":{"from":"01/01/2015","to":"21/01/2015",
"include_lower":true,"include_upper":true,"format":"dd/MM/yyyy","boost":1.0}}},
"aggregations":{"dropoffs_over_time":{"date_histogram":{"field":"dropoff_datetime",
"calendar_interval":"day","offset":0,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0}}}}'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "task_cancelled_exception",
        "reason" : "Task is cancelled due to high resource consumption"
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "nyc_taxis",
        "node" : "MGkMkg9wREW3IVewZ7U_jw",
        "reason" : {
          "type" : "task_cancelled_exception",
          "reason" : "Task is cancelled due to high resource consumption"
        }
      }
    ]
  },
  "status" : 500
}

Other Approaches considered

  • Measure resource consumption and add cancellation logic at end of each phase/sub-phases.
    • As mentioned above, Query Phase does all the heavy-lifting and measuring at the end of the phase is too late to cancel the request.
  • Cancelling based on fixed thresholds for per-request resource consumption.
    • Each workload is different based on index and hardware configuration and fixed thresholds for per-request resource consumption might not be apt for all diverse workloads.

Points to Note

  • [Milestone 1] If the search task is cancelled, it should be retried on another replica.
  • [Milestone 1] Should the tracking of top-N requests can be moved to shard level, instead of node level? — Tracking at shard level for top-N might be over-complicating and does not add much value. Instead, we can just track the total heap memory occupied by search requests at shard level alone.
  • [Milestone 3] We can also expose configuration for users to set a threshold for resource consumption beyond which the request will be cancelled.
  • [Milestone 3] Cancelled queries can be added to slow logs, so that it is easier for users to debug.

Open Items

  • From where can we measure the resource consumption for currently executing tasks? Should it be done on a separate observer thread? Or, can it happen on the same tasks’ thread?
    • Separate observer thread: Using a separate observer thread to checkpoint resource consumption periodically at a fixed frequency. This would avoid any performance impact due to measuring resource consumption since it runs on a separate thread. It also provides improved code maintainability, since it does not require code interleaving within search logic.
    • Using same task thread: We can measure resource consumption of task within the same task itself at different milestones. Additional overhead of measuring resource consumption needs to be handled by same task thread.

I will evaluate more on both the approaches and more details.

Metrics to be added

Following would be rolling window metric for every 1 minute and would be exposed through stats API.

  • ExpensiveTaskCancelledCount - Counter which indicates the number of expensive tasks that are cancelled.
  • TopNTaskCPUUtilization - Absolute value of CPU utilization for all top-N requests for every minute.
  • TopNTaskHeapMem - Absolute value of Heap Mem utilization for all top-N requests for every minute.

Additional changes

  • Shard level tracking of resource consumption — Resource consumption is currently measured per task and we do not build a shard-specific view to see how much heap memory is occupied by each shard. Similar to Indexing Backpressure ([Meta] Shard level Indexing Back-Pressure  #478), we can cancel the requests based on the pressure on a shard.

Please share your thoughts on the above.

@Bukhtawar Bukhtawar added the discuss Issues intended to help drive brainstorming and decision making label Aug 10, 2022
@Bukhtawar
Copy link
Collaborator

cc-ing folks for comments @reta @dblock @andrross @nknize @sruti1312 @getsaurabh02

@nssuresh2007
Copy link

nssuresh2007 commented Aug 10, 2022

More thoughts on the above open item:

Approach for Resource Consumption Tracking

We need to track the resource consumption for the currently executing tasks which can be done in two ways — Track at different milestones within the same task or use a separate observer thread for monitoring.
  • Use same task thread to monitor at different milestones — Within each task thread, at different milestones of query phase execution we will periodically measure the resource consumption by the same task and check if it has exceeded the threshold. If yes, task will throw rejected execution exception.
  • Using separate observer thread — At a fixed frequency (every sec), separate observer thread will be tracking the resource consumption of each task and will cancel the task if it breaches the threshold.
  Same task thread, milestone based monitoring Separate observer thread  
Performance overhead Comparitively higher, because each search task does additional work to track itself at different milestones. Lower because a separate thread does the tracking and search task thread does not do much additional work.  
Accuracy More accurate since query phase cancellable listener gets executed multiple times within a second. But in case if a search request is stuck in a phase, we would not be able to track it until it reaches next checkpoint. Less accurate since it would track the consumption of each thread depends on the frequency of execution.Also more number of tasks would mean that tracking for each task would become less frequent.  
Code Maintainability Creates slightly additional headache because the query phase logic is interleaved with monitoring logic. Any refactoring needs to take care of both these factors. Observer thread logic is totally indepdendent of search. Hence does not add to maintainence overhead.  

Preferred approach - Separate observer thread is preferred due to its low performance overhead, simplicity and code maintainability. Although tracking using cancellable listener would enable tracking more closely (multiple invocations within a second), we do not gain much advantage by tracking at less than a second frequency. Also, if a task gets stuck it cannot be identified until it reaches the next checkpoint.

Implementation Details

A separate observer thread will be running which will execute the following logic every 1 second.
  • Consider the node is in duress if one of the below thresholds are met:
    • JVM MP on the node (> 70%) for 3 minutes
    • CPU utilization on the node (> 80%) for 3 minutes
  • Following values are continuously tracked:
    • resourceConsumptionCompletedTasksMovingAvg — Measures the moving average of the resource consumption of the tasks once the task completed. It maintains a window of last 100 tasks. This value would be maintained and exposed by TaskResourceTrackingService.
    • resourceConsumptionCurrentTasksAvg — Measures the average resource consumption of the currently executing query phase tasks at that point of time.
  • If node is in duress:
    • Check if any of the currently executing tasks have exceeded the configurable limits of ShardSearchTaskMemoryThreshold or ShardSearchCPUCyclesThreshold. If yes, cancel the task.
    • Check if at least 5% of total heap is occupied by currently executing search tasks or if 10% CPU contributed by search tasks. If not, return. [This implies that search requests are not contributing to the heap memory pressure]
    • Get the top 10 tasks based on resource consumption and execute the following steps:
      • Get the max of the two values above and compare the it with each of the top 10 tasks. If the variance is higher than the threshold (1.5x if it is completedTasksAvg or 2x if it is currentTasksAvg) then cancel the task.
Note: If most of the tasks have just started executing, then the resourceConsumptionCurrentTasksAvg would be much lower and we may think of any task is nearing completion as rogue query. Hence we are taking into account both completed tasks average and currently executing tasks average.

Configurable Settings

All the below settings would be dynamically configurable.
  • JVM MP Threshold (Default 70%)
  • CPU Utilization Threshold (Default 80%)
  • CompletedTasksVarianceThreshold (Default 1.5)
  • RunningTasksVarianceThreshold (Default 2.0)
  • ShardSearchTaskMemoryThreshold (Default 1% of total heap)
  • ShardSearchCPUCyclesThreshold (Default 10 secs)

PoC Test Results

Did sample runs on the nyc_taxis workload with both the approaches. Please find the comparison test results here.
Also find the comparison with base run for both approaches here: compare with same thread, compare with observer
  • We could see that using separate observer thread has lesser performance overhead.
  • When the workload had mix of multiple query workloads, the heavy workload queries like aggregation queries had a few task rejections and the lightweight queries continued to execute and did not face any errors.
  • Due to GC running, could observe huge fluctuation in JVM MP. Should we consider subsequent datapoints to decide if JVM is under pressure?

Few possible scenarios and how they are handled

  • Cluster receives a heterogenous mix of queries and also has some rogue queries which consumes lot of resources. Rogue queries are expected to get cancelled.
    • resourceConsumptionCompletedTasksMovingAvg tracks the average consumption of completed tasks. If the rogue queries spike beyond the variance threshold, they will be cancelled.
  • Cluster initially receives light-weight search requests alone. After more than 100 requests, heavy-workload requests start flowing. We should not be cancelling all heavy-workload tasks.
    • resourceConsumptionCompletedTasksMovingAvg will be low due to average from light-weight requests, but resourceConsumptionCurrentTasksAvg would be higher due to heavy workload. We will compare the max of these two and use resourceConsumptionCurrentTasksAvg to measure variance of each task. Hence heavy-workload request would not get cancelled unless any of the task is consuming 2x more than the average among heavy-workload requests.
  • Cluster receives an almost homogenous search workload. But some tasks might be starting just now and some may be nearing completion. Naturaly, the tasks nearing completion might have more resource consumption, but they should not be mistaken to be resource-intensive and cancelled.
    • resourceConsumptionCurrentTasksAvg would be lesser, but resourceConsumptionCompletedTasksMovingAvg would be higher since it tracks the resource consumption at completion. Since we would be comparing the variance of each task with resourceConsumptionCompletedTasksMovingAvg, even tasks which have just started would not get cancelled.
  • Cluster initially receives all heavy requests. After sometime, it starts receiving only the light requests.
    • Completed Task moving average would be higher, but current task average is lesser. We will compare the max of these two values and we will still consider the completed tasks average which would not be cancelling any light-weight tasks.
  • Cluster receives a good mix of light and heavy search queries. We should be cancelling only the high resource intensive queries if any and no other queries should be impacted.
    • Completed Task Avg would be higher due to heavy search queries. Unless any of the task has a huge variance from the average, no task would get cancelled.

@Bukhtawar Bukhtawar added the v2.3.0 'Issues and PRs related to version v2.3.0' label Aug 12, 2022
@reta
Copy link
Collaborator

reta commented Aug 15, 2022

Thanks @nssuresh2007

Preferred approach - Separate observer thread is preferred due to its low performance overhead, simplicity and code maintainability.

+1 to that, collecting metrics in separate thread makes a lot of sense.

A separate observer thread will be running which will execute the following logic every 1 second.

On a general note, I think the cancellation logic should be smarter than just making the decision based on collected metrics. As an example, there could only one heavy query (scattered across many tasks) which could keep the cluster busy for a long time. Yes, it is probably not good overall, but if cluster is able to fulfill it (w/o timeout), it is not as bad as it looks. Consulting the picture at large - what cluster is doing - is probably necessary, otherwise such queries will be always cancelled, despite cluster doing nothing, no matter which replica they are retried on.

Also, the cancellation should take the age of the search request into account, it is highly likely that serving recent requests is more valuable, since the client may already gave up waiting for the older ones. This is somewhat related to some tasks might be starting just now and some may be nearing completion because by and large, it is not possible to tell if the task is nearing the completion (without heavy instrumentation of the steps being executed), the heuristic in this case is the age of the task / search request. If timeout is configured, this is yet another mechanism which could impact the cancellation decision vs rejecting new search requests: otherwise there could be long running queries which will have no chances to complete despite the desire of the user to wait.

Now, the most important question to me is: what should the user do when the search request execution is cancelled with "Task is cancelled due to high resource consumption"? She will probably retry it (may be even several times), is that what we want? Out of all, heap consumption is probably the most critical indicator of a coming trouble - should we recommend increasing the heap size? In general, we need to provide meaningful advice what the user should do to succeed.

I think one of the missed features, which could enormously help in decision making, is the search query cost estimation: taking the search query and cluster configuration + topology, estimate the resource consumption (costs). This is obviously out of scope of this particular issue.

@dblock
Copy link
Member

dblock commented Aug 16, 2022

There's a lot of great stuff in this proposal.

A complement to a resource consumption cancellation could be a quality of service-based evaluation. Meaning that instead of detecting the increase in resource consumption as a red flag (rightfully noted as potentially not a problem), we would attempt to detect that the cluster is deteriorating in its ability to provide a certain quality of service, which would cause the limits of what is acceptable resource consumption to be lowered, and more heavy requests to be cancelled.

This is because I want consistency in my service. If a cluster serves X requests successfully with Y quality (time, resources), I want to ensure that if I see X + 1 requests, that 1 addition does not produce less than Y quality overall for each of X requests, and would prefer reject or cancel that 1 extra request before quality degrades.

So, since I think we all want to see a "Task was cancelled before it took the cluster down" error message, consider the alternative or complementary QoS measure to the above proposals of adjusting thresholds. Having moving averages as proposed above is possibly not the right metric. What we will want to look at will be percentiles of operations succeeding within a certain threshold.

@nssuresh2007
Copy link

nssuresh2007 commented Aug 22, 2022

Thanks a lot for the feedback and your time @reta and @dblock!

Few points I wanted to add @reta:

Consulting the picture at large - what cluster is doing - is probably necessary, otherwise such queries will be always cancelled, despite cluster doing nothing, no matter which replica they are retried on.

Decision to trigger task cancellation is considered only if the node is in duress and the search tasks have contributed a significant portion to it. Hence such queries would not get cancelled on a normally operating cluster.

Also, the cancellation should take the age of the search request into account, it is highly likely that serving recent requests is more valuable, since the client may already gave up waiting for the older ones.

Yes, it is a very valid point. We can use the elapsed time for each task to prioritize older requests to be cancelled.

Now, the most important question to me is: what should the user do when the search request execution is cancelled with "Task is cancelled due to high resource consumption"? She will probably retry it (may be even several times), is that what we want?

We expect the client behavior to be similar to when ESRejectedExecutionException is thrown by the cluster. It would mean that cluster is overloaded and expect customer to retry with sufficient backoff. In case if partial results are allowed, we would return results only from other shards where tasks were not cancelled.

Should we recommend increasing the heap size?

Since it depends on the workload (indexing and search), recommendation to increase heap size might not be applicable always. Please let me know your thoughts.

Just a few follow-up thoughts on your comments @dblock:

A complement to a resource consumption cancellation could be a quality of service-based evaluation. So, since I think we all want to see a "Task was cancelled before it took the cluster down" error message, consider the alternative or complementary QoS measure to the above proposals of adjusting thresholds.

There are multiple parts that we want to build as a part of Search Back-pressure as mentioned here: #1042
(1) Recovering a node under duress - Once a node has gone into duress, we will identify and cancel most resource guzzling tasks to prevent node meltdown and recover.
(2) Back-pressure for node to node resiliency - Track the heap memory consumption at shard level, build co-ordinator level view of consumption for each search request and take decisions on whether we can accept or reject the request.
(3) Manage resource utilization effectively across the cluster - Estimate query cost for each incoming query and check if sufficient resources are available before admitting the request.

We are targeting to only recover a node in duress with task (1) above by cancelling resource guzzling tasks. Unlike Indexing operation, resource utilization by search request is hard to estimate since it depends on multiple factors like query type, its constructs (contexts and clauses), aggregation type, number of buckets, cardinality/fan-out (number of shard to search) and documents to be returned as part of the response.
Hence while working on (3), we will build query estimation and will check if QoS has deteriorated or not and take decisions accordingly.
This issue is targeting only task (1) above where we are only reactively trying to recover the node after it has gone to distress.
Please let me know if that makes sense.

Having moving averages as proposed above is possibly not the right metric.

We are trying to identify most resource guzzling task among the currently executing ones by checking the variance of resource consumption of each task from the average.
In order to ensure that any task nearing completion is not mistaken to be rogue task when compared to tasks that have just started execution, we use moving average of tasks resource completion as a reference for minimal value.

Please let me know your thoughts.

@Bukhtawar
Copy link
Collaborator

Agree with @dblock on the QoS, which is also the eventual goal once we are able to add more granular instrumentation on the latency breakdown across various layers and N/W interactions. This is also pretty challenging since the QoS could degrade not just because of an overload but also due to I/O slowdown. Some of the work to track gray failures is being taken up as a part of #4244. Once we have that level of details, we could maybe take the right action more deterministically.

I think the current proposal lays down steps to prevent a cluster from getting into an unmanageable state by applying load shedding mechanisms, allowing the cluster to recover

@dreamer-89
Copy link
Member

@tushar-kharbanda72 : As today Sep 07, is the code freeze date for OpenSearch. If this is not planned for 2.3, can you please update the label accordingly.

@Bukhtawar Bukhtawar added v2.4.0 'Issues and PRs related to version v2.4.0' and removed v2.3.0 'Issues and PRs related to version v2.3.0' labels Sep 8, 2022
@nssuresh2007
Copy link

@reta @dblock @Bukhtawar @sachinpkale
Would like to have your inputs on the metrics that would be added to the stats API as a part of this change. The purpose of these metrics would be to give the user some idea on why requests are rejected or not and what is the current task resource consumption looks like.

Following are the additional metadata that would be added to the stats API

"search_backpressure": {
    "stats": {
        "node_stats": {
            "<node_id_1>": {
               "current_max_search_task_heap_memory_consumed": 0,
                "current_avg_search_task_heap_memory_consumed": 0,
                "current_max_search_task_cpu_time_consumed": 0,
               "current_avg_search_task_cpu_time_consumed": 0,
               "current_total_search_task_heap_memory_consumed": 0
            },
           "<node_id_2>": {
               "current_max_search_task_heap_memory_consumed": 0,
               "current_avg_search_task_heap_memory_consumed": 0,
               "current_max_search_task_cpu_time_consumed": 0,
               "current_avg_search_task_cpu_time_consumed": 0,
               "current_total_search_task_heap_memory_consumed": 0
            }
        },
        "cancellation_stats": {
            "<node_id_1>": {
                "search_task_cancellation_count": 0,
                "last_cancelled_task_memory_consumed": 0,
                "last_cancelled_task_cpu_consumed": 0
            },
           "<node_id_2>": {
                "search_task_cancellation_count": 0,
                "last_cancelled_task_memory_consumed": 0,
                "last_cancelled_task_cpu_consumed": 0
           }
        }
    },
    "limits": {
        "search_task_memory_limit_bytes": 0,
        "search_task_cpu_time_limit": 0
    }
    "enabled": true,
    "enforced": false
}

cancellation_stats - Stats from the time OpenSearch process came up.

  • search_task_cancellation_count — Count of task cancellations done on that node till now from the process started.
  • last_cancelled_task_memory_consumed — Heap memory consumption in bytes for the most recent cancelled task.
  • last_cancelled_task_cpu_consumed — CPU cycles spent in nano secs for the most recent cancelled task.

node_stats - Current stats at that point of time.

  • current_max_search_task_heap_memory_consumed — Maximum of heap consumption in bytes per task of currently running tasks on that node.
  • current_avg_search_task_heap_memory_consumed — Average of heap consumption in bytes per task of currently running tasks on that node.
  • current_avg_search_task_cpu_time_consumed — Average of CPU cycles in nano secs per task of currently running tasks on that node.
  • current_max_search_task_cpu_time_consumed — Maximum CPU cycles in nano secs spent by a task among the currently running tasks on that node.
  • current_total_search_task_heap_memory_consumed — Current total heap memory in bytes consumed by all running tasks.

enabled — Whether search backpressure is enabled or not.
enforced — If set to true, would cancel tasks based on the criteria. If false, emits only logs and metrics which is useful for Shadow mode.

Kindly let me know your comments on the above.

@dblock
Copy link
Member

dblock commented Sep 13, 2022

I think the shape of this response doesn't align with other APIs, but I could be wrong

  • is search_task in current_max_search_task_heap_memory_consumed redundant, looks like other APIs would call this current_max_heap_memory_consumed
  • in the stats doc I see _in_metric everywhere, so should it be current_max_heap_memory_consumed_in_bytes?
  • existing APIs seem to have some logical grouping (e.g. "limit" vs. "current")

@nssuresh2007
Copy link

nssuresh2007 commented Sep 15, 2022

@dblock, thanks a lot for your feedback.

I have updated the structure as follows to address your comments and also have revamped the structure to make it easily extensible for future needs (i.e. currently we only emit stats on shard search task, in future we may also add stats on co-ordinator tasks).

"search_backpressure": {
    "current_stats": {
        "search_shard_task": {
            "heap_memory_consumed_bytes": {
                "current_avg": 0,
                "current_max": 0,
                "rolling_avg": 0
            },
            "cpu_time_consumed_nanos": {
                "current_max": 0,
                "current_avg": 0
            },
            "elapsed_time_nanos": {
                "current_max": 0,
                "current_avg": 0
            }
        }
    },
    "cancellation_stats": {
        "search_shard_task": {
            "cancellation_count": 0,
            "cancellation_breakup": {
                "heap_memory_limits": 0,
                "cpu_cycle_limits": 0,
                "elapsed_time_limits": 0
            },
            "cancellation_limit_reached_count": 0,
            "last_cancelled_task": {
                "memory_consumed_bytes": 0,
                "cpu_consumed_nanos": 0,
                "elapsed_time_nanos": 0
            }
        }
    },
    "enabled": true,
    "enforced": true
}
  • current_stats - Section which would contain the stats at that point of time.
  • cancellation_stats - Section which would contain the stats on the cancellation of tasks from the start-up time of the process.
  • task_cancellation_count — Count of task cancellations done on that node till now from the time process started.
  • last_cancelled_task — Section which contains the resource consumption stats on the most recent cancelled task.
  • search_shard_task — Section which would contain the stats specific to search shard task.
  • heap_memory_consumed_bytes — Section containing the stats on heap memory consumed by the task in bytes.
  • cpu_time_consumed_nanos — Section containing stats on CPU cycles consumed by the task in nano secs.
  • elapsed_time_nanos — Section containing stats on elapsed time by the task in nano secs.
  • current_max - Maximum value of resource consumption among currently executing tasks. (Applicable for heap_memory_consumed, cpu_time_consumed, elapsed_time)
  • current_avg - Average value of resource consumption among currently executing tasks. (Applicable for heap_memory_consumed, cpu_time_consumed, elapsed_time)
  • rolling_avg - Average of task heap memory consumption of most recent tasks within the rolling window. (Applicable for heap_memory_consumed section alone)
  • enabled — Whether search backpressure is enabled or not.
  • enforced — If set to true, would cancel tasks based on the criteria. If false, emits only logs and metrics which is useful for Shadow mode.
Metrics behavior in Shadow vs Enforced mode

In Enforced mode, all the stats present in the response would be populated.
But in Shadow mode, all stats under cancellation_stats section would not be populated and value would always be zero, since actual task cancellation does not happen. Values under current_stats would still be relevant and correctly populated in shadow mode as well.

Response to your comments:

is search_task in current_max_search_task_heap_memory_consumed redundant, looks like other APIs would call this current_max_heap_memory_consumed

Removed the redundant "search_task" text in every field.

in the stats doc I see _in_metric everywhere, so should it be current_max_heap_memory_consumed_in_bytes?

Updated.

existing APIs seem to have some logical grouping (e.g. "limit" vs. "current")

Removed the limit section since they are already listed as a part of cluster settings API and not duplicating them here again.

Kindly let me know your thoughts.

@dblock
Copy link
Member

dblock commented Sep 16, 2022

I like this better!

@elfisher
Copy link

elfisher commented Sep 27, 2022

I really like this proposal! I have a couple questions/comments:

  1. I think we should aim to have milestone 2 done before we mark this as production ready. Milestone 1 looks like it might leave too many on going tasks across shards even if one search on a shard is cancelled, and so it is difficult to reconcile what the user would expect (partial results, query failure, or something else).
  2. Is the task cancellation manual via an API, automatic, or both? I think we would want both to be available. Automatic will prevent cluster outages, but as an administrator, one might want the ability to cancel a long running query independent of cluster health in order to avoid that query from taking up resources that they would rather have allocated to other queries.
  3. I agree with @dblock that we should be also looking at the QoS based approach alongside the one we have documented here.

@dblock
Copy link
Member

dblock commented Sep 27, 2022

I think we're missing logging for cancelled tasks similar to slow query logging. Users will want to turn on something like "cancellable tasks logging" to enable dumping the query body in the logs upon cancellation to debug whether the cancellation is typical for one problematic query or index.

@kgcreative
Copy link
Member

in search_shard_task -- I'm not seeing a shard ID being returned. I imagine this is because you are querying a specific shard already, but including this information in the response might be helpful

@nssuresh2007
Copy link

We are proposing the following update to the stats API structure in order to address the following points:

  1. Include shard ID information in the cancellation breakup which would be useful. (Addresses comment from @kgcreative)
  2. search_shard_task is made the top-level node within which all meta data related to search shard task is added. With this, we are no longer tied to use the same structure for future milestones and thus is more flexible.
  3. current_stats is renamed to resource_tracker_stats since this section now contains the cancellation_count as well and hence it no longer contains only "current" stats.

Please let me know if you have any comments on the below stats API structure:

"search_backpressure": {
    "search_shard_task": {
        "resource_tracker_stats": {
            "heap_memory_consumed_bytes": {
                "current_avg": 0,
                "current_max": 0,
                "rolling_avg": 0,
                "cancellation_count": 0
            },
            "cpu_time_consumed_nanos": {
                "current_max": 0,
                "current_avg": 0,
                "cancellation_count": 0
            },
            "elapsed_time_nanos": {
                "current_max": 0,
                "current_avg": 0,
                "cancellation_count": 0
            }
        },
        "cancellation_stats": {
            "cancellation_count": 0,
            "shard_cancellation_count": {
                "<shard_id_1>": 0,
                "<shard_id_2>": 0
            },
            "cancellation_limit_reached_count": 0,
            "last_cancelled_task": {
                "memory_consumed_bytes": 0,
                "cpu_consumed_nanos": 0,
                "elapsed_time_nanos": 0
            }
        }
    }
    "enabled": true,
    "enforced": true
}

@elfisher
Copy link

@tushar-kharbanda72 I've added this to the roadmap per the 2.4 label and wanted to confirm this is on track for 2.4. Thanks!

@nssuresh2007
Copy link

@elfisher Yes, we are on track for 2.4 release for Milestone 1 (Tracking and cancellation of search shard tasks alone. Does not include co-ordinator tasks).

@rohin
Copy link

rohin commented Oct 26, 2022

In general agree with the approach above and the direction it is taking. A point to consider is asynchronous or long running query cancelling them without consideration of priority or ability to constraint/sandbox a query might be an over simplification of the issue. I might want to run a query that is going to be long running and scanning a large data set. It might be slow and take time but the results are important. Cancelling should be employed while identifying a rouge query / task. I understand that is the primary intention here. But if we want to automate this we need to consider priority and kind of query as well so QoS and Query Cost might be high but required. I think the uber goal would be to be able to prioritize and control resource consumption at execution. Might not directly fit in to this issue but should be something to consider as we look at Query execution overall.

@anasalkouz
Copy link
Member

@tushar-kharbanda72 do you still track this for 2.4 release? code freeze on 11/3
Is there anything pending? otherwise, feel free to close it.

@anasalkouz
Copy link
Member

anasalkouz commented Nov 3, 2022

@nssuresh2007 are you on track for 2.4 release? Today is the code freeze.

@nssuresh2007
Copy link

@anasalkouz Yes, as per the plan, code changes for Milestone 1 of this issue are merged to 2.4.

@nssuresh2007
Copy link

Performance comparison after the Search Backpressure changes

Summary: We did not see any degradation in performance due to Search BP.

Domain Setup

Data nodes: r5d.large, 2 nodes
Master node: r5d.large
nyc_taxis workload
5 primary shards, 1 replica shard

Benchmark command used:

opensearch-benchmark execute_test --workload nyc_taxis --include-tasks="default, range, distance_amount_agg, autohisto_agg, date_histogram_agg" --pipeline=benchmark-only --target-hosts=172.31.24.13:9200,172.31.19.56:9200 --workload-params "search_clients:10"

Baseline: Without Search BP changes
Contender: With Search BP enabled and enforced

Detailed Results

[ec2-user@ip-172-31-31-148 ~]$ opensearch-benchmark compare --baseline 127a9078-dad7-4f87-8c51-5c5e89fae478 --contender ef93e21a-8263-44a3-a886[113/1984]
4f7

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/


Comparing baseline
  TestExecution ID: 127a9078-dad7-4f87-8c51-5c5e89fae478
  TestExecution timestamp: 2022-11-17 09:16:43
  TestProcedure: append-no-conflicts
  ProvisionConfigInstance: external

with contender
  TestExecution ID: ef93e21a-8263-44a3-a886-84d912bb34f7
  TestExecution timestamp: 2022-11-17 10:11:13
  TestProcedure: append-no-conflicts
  ProvisionConfigInstance: external

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                        Metric |                Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|--------------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |                     |           0 |           0 |        0 |    min |
|             Min cumulative indexing time across primary shard |                     |           0 |           0 |        0 |    min |
|          Median cumulative indexing time across primary shard |                     |           0 |           0 |        0 |    min |
|             Max cumulative indexing time across primary shard |                     |           0 |           0 |        0 |    min |
|           Cumulative indexing throttle time of primary shards |                     |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |                     |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |                     |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |                     |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |                     |           0 |           0 |        0 |    min |
|                      Cumulative merge count of primary shards |                     |           0 |           0 |        0 |        |
|                Min cumulative merge time across primary shard |                     |           0 |           0 |        0 |    min |
|             Median cumulative merge time across primary shard |                     |           0 |           0 |        0 |    min |
|                Max cumulative merge time across primary shard |                     |           0 |           0 |        0 |    min |
|              Cumulative merge throttle time of primary shards |                     |           0 |           0 |        0 |    min |
|       Min cumulative merge throttle time across primary shard |                     |           0 |           0 |        0 |    min |
|    Median cumulative merge throttle time across primary shard |                     |           0 |           0 |        0 |    min |
|       Max cumulative merge throttle time across primary shard |                     |           0 |           0 |        0 |    min |
|                     Cumulative refresh time of primary shards |                     |           0 |           0 |        0 |    min |
|                    Cumulative refresh count of primary shards |                     |          16 |          16 |        0 |        |
|              Min cumulative refresh time across primary shard |                     |           0 |           0 |        0 |    min |
|           Median cumulative refresh time across primary shard |                     |           0 |           0 |        0 |    min |
|              Max cumulative refresh time across primary shard |                     |           0 |           0 |        0 |    min |
|                       Cumulative flush time of primary shards |                     |           0 |           0 |        0 |    min |
|                      Cumulative flush count of primary shards |                     |           4 |           4 |        0 |        |
|                Min cumulative flush time across primary shard |                     |           0 |           0 |        0 |    min |
|             Median cumulative flush time across primary shard |                     |           0 |           0 |        0 |    min |
|                Max cumulative flush time across primary shard |                     |           0 |           0 |        0 |    min |
|                                       Total Young Gen GC time |                     |       3.157 |       1.695 |   -1.462 |      s |
|                                      Total Young Gen GC count |                     |         714 |         177 |     -537 |        |
|                                         Total Old Gen GC time |                     |           0 |           0 |        0 |      s |
|                                        Total Old Gen GC count |                     |           0 |           0 |        0 |        |
|                                                    Store size |                     |     45.1477 |     45.1477 |        0 |     GB |
|                                                 Translog size |                     | 4.09782e-07 | 4.09782e-07 |        0 |     GB |
|                                        Heap used for segments |                     |           0 |           0 |        0 |     MB |
|                                      Heap used for doc values |                     |           0 |           0 |        0 |     MB |
|                                           Heap used for terms |                     |           0 |           0 |        0 |     MB |
|                                           Heap used for norms |                     |           0 |           0 |        0 |     MB |
|                                          Heap used for points |                     |           0 |           0 |        0 |     MB |
|                                   Heap used for stored fields |                     |           0 |           0 |        0 |     MB |         [42/1984]
|                                                 Segment count |                     |         100 |         100 |        0 |        |
|                                                Min Throughput |             default |     2.69059 |     2.67181 | -0.01878 |  ops/s |
|                                               Mean Throughput |             default |     2.86302 |     2.83447 | -0.02856 |  ops/s |
|                                             Median Throughput |             default |     2.88521 |     2.84319 | -0.04202 |  ops/s |
|                                                Max Throughput |             default |     2.93564 |     2.91102 | -0.02462 |  ops/s |
|                                       50th percentile latency |             default |     5876.45 |        5252 | -624.449 |     ms |
|                                       90th percentile latency |             default |     10509.5 |     8915.29 | -1594.18 |     ms |
|                                       99th percentile latency |             default |     16759.6 |     11515.6 | -5243.97 |     ms |
|                                     99.9th percentile latency |             default |     19499.4 |     12009.4 | -7489.99 |     ms |
|                                      100th percentile latency |             default |     19502.8 |     13962.1 | -5540.67 |     ms |
|                                  50th percentile service time |             default |     1088.66 |     967.335 | -121.329 |     ms |
|                                  90th percentile service time |             default |     7847.06 |     6894.62 | -952.439 |     ms |
|                                  99th percentile service time |             default |       11214 |     11282.5 |  68.4265 |     ms |
|                                99.9th percentile service time |             default |     12032.7 |     11944.4 |  -88.277 |     ms |
|                                 100th percentile service time |             default |     12305.8 |     11966.6 |  -339.18 |     ms |
|                                                    error rate |             default |           0 |           0 |        0 |      % |
|                                                Min Throughput |               range |    0.694275 |    0.688415 | -0.00586 |  ops/s |
|                                               Mean Throughput |               range |    0.696822 |     0.69355 | -0.00327 |  ops/s |
|                                             Median Throughput |               range |    0.697106 |    0.694122 | -0.00298 |  ops/s |
|                                                Max Throughput |               range |    0.698065 |    0.696061 |   -0.002 |  ops/s |
|                                       50th percentile latency |               range |      4482.7 |     3537.39 | -945.315 |     ms |
|                                       90th percentile latency |               range |        7434 |     6644.22 |  -789.78 |     ms |
|                                       99th percentile latency |               range |     11593.8 |     10287.5 | -1306.29 |     ms |
|                                     99.9th percentile latency |               range |     12456.5 |     11315.5 | -1140.94 |     ms |
|                                      100th percentile latency |               range |     12460.8 |     11315.7 | -1145.16 |     ms |
|                                  50th percentile service time |               range |     4478.09 |      3534.5 | -943.581 |     ms |
|                                  90th percentile service time |               range |     7429.25 |     6637.51 | -791.743 |     ms |
|                                  99th percentile service time |               range |     11590.1 |     10284.6 | -1305.59 |     ms |
|                                99.9th percentile service time |               range |     12456.2 |       11308 | -1148.19 |     ms |
|                                 100th percentile service time |               range |       12459 |     11314.3 | -1144.64 |     ms |
|                                                    error rate |               range |           0 |           0 |        0 |      % |
|                                                Min Throughput | distance_amount_agg |     1.94796 |      1.9031 | -0.04486 |  ops/s |
|                                               Mean Throughput | distance_amount_agg |     1.98886 |     1.95379 | -0.03507 |  ops/s |
|                                             Median Throughput | distance_amount_agg |     1.99014 |     1.95777 | -0.03237 |  ops/s |
|                                                Max Throughput | distance_amount_agg |     1.99999 |     1.97455 | -0.02545 |  ops/s |
|                                       50th percentile latency | distance_amount_agg |     4270.56 |     3670.89 | -599.661 |     ms |
|                                       90th percentile latency | distance_amount_agg |     7183.21 |     6759.99 | -423.225 |     ms |
|                                       99th percentile latency | distance_amount_agg |      9472.4 |      8844.1 | -628.297 |     ms |
|                                     99.9th percentile latency | distance_amount_agg |     10226.2 |     9790.65 | -435.596 |     ms |
|                                      100th percentile latency | distance_amount_agg |     10285.7 |     9790.78 | -494.954 |     ms |
|                                  50th percentile service time | distance_amount_agg |     3631.92 |     3152.59 | -479.324 |     ms |
|                                  90th percentile service time | distance_amount_agg |     7078.86 |     6639.41 | -439.452 |     ms |          [0/1984]
|                                  99th percentile service time | distance_amount_agg |     9393.15 |     8843.08 | -550.071 |     ms |
|                                99.9th percentile service time | distance_amount_agg |      9896.1 |     9789.52 | -106.583 |     ms |
|                                 100th percentile service time | distance_amount_agg |     9896.23 |     9790.48 | -105.753 |     ms |
|                                                    error rate | distance_amount_agg |           0 |           0 |        0 |      % |
|                                                Min Throughput |       autohisto_agg |     1.46191 |     1.45333 | -0.00858 |  ops/s |
|                                               Mean Throughput |       autohisto_agg |     1.48084 |     1.47418 | -0.00666 |  ops/s |
|                                             Median Throughput |       autohisto_agg |     1.48235 |     1.47627 | -0.00607 |  ops/s |
|                                                Max Throughput |       autohisto_agg |     1.49368 |     1.48551 | -0.00817 |  ops/s |
|                                       50th percentile latency |       autohisto_agg |     5598.59 |     5020.03 | -578.563 |     ms |
|                                       90th percentile latency |       autohisto_agg |      7406.7 |     6705.89 | -700.807 |     ms |
|                                       99th percentile latency |       autohisto_agg |     8952.92 |     7160.13 | -1792.78 |     ms |
|                                     99.9th percentile latency |       autohisto_agg |     11580.6 |       11251 | -329.612 |     ms |
|                                      100th percentile latency |       autohisto_agg |     12109.1 |     11375.5 | -733.623 |     ms |
|                                  50th percentile service time |       autohisto_agg |     5148.51 |     4952.59 | -195.914 |     ms |
|                                  90th percentile service time |       autohisto_agg |     7391.63 |     6704.59 | -687.043 |     ms |
|                                  99th percentile service time |       autohisto_agg |     8734.02 |     7158.87 | -1575.15 |     ms |
|                                99.9th percentile service time |       autohisto_agg |     11579.5 |     11249.2 | -330.245 |     ms |
|                                 100th percentile service time |       autohisto_agg |     12106.3 |     11374.1 | -732.222 |     ms |
|                                                    error rate |       autohisto_agg |           0 |           0 |        0 |      % |
|                                                Min Throughput |  date_histogram_agg |     1.47286 |     1.45333 | -0.01953 |  ops/s |
|                                               Mean Throughput |  date_histogram_agg |     1.48684 |     1.47402 | -0.01282 |  ops/s |
|                                             Median Throughput |  date_histogram_agg |     1.48777 |     1.47616 | -0.01161 |  ops/s |
|                                                Max Throughput |  date_histogram_agg |     1.49736 |     1.48482 | -0.01254 |  ops/s |
|                                       50th percentile latency |  date_histogram_agg |     5802.18 |     4954.77 | -847.412 |     ms |
|                                       90th percentile latency |  date_histogram_agg |      7516.1 |     6699.56 |  -816.54 |     ms |
|                                       99th percentile latency |  date_histogram_agg |     8955.02 |     7149.38 | -1805.64 |     ms |
|                                     99.9th percentile latency |  date_histogram_agg |     11201.3 |     11243.9 |  42.5528 |     ms |
|                                      100th percentile latency |  date_histogram_agg |     11576.5 |     11375.5 | -201.003 |     ms |
|                                  50th percentile service time |  date_histogram_agg |     5369.37 |     4869.96 | -499.414 |     ms |
|                                  90th percentile service time |  date_histogram_agg |     7515.15 |     6699.08 | -816.063 |     ms |
|                                  99th percentile service time |  date_histogram_agg |      8908.5 |      7148.5 |    -1760 |     ms |
|                                99.9th percentile service time |  date_histogram_agg |     11101.4 |     11241.7 |  140.382 |     ms |
|                                 100th percentile service time |  date_histogram_agg |     11576.1 |     11374.4 | -201.663 |     ms |
|                                                    error rate |  date_histogram_agg |           0 |           0 |        0 |      % |

@nssuresh2007
Copy link

How Search BP behaved with Rogue Query:

We added the following rogue query to the nyc_taxis workload (date_histogram_agg_rogue below):

{"query": {"range":{"dropoff_datetime":{"from":"01/01/2015","to":"15/01/2015",
"include_lower":true,"include_upper":true,"format":"dd/MM/yyyy","boost":1.0}}},
"aggregations":{"dropoffs_over_time":{"date_histogram":{"field":"dropoff_datetime",
"calendar_interval":"second","offset":0,"order":{"_key":"asc"},"keyed":false,
"min_doc_count":0}}}}

Summary

  • Without Search BP, the rogue query was occupying lot of heap and eventually brought down one of the nodes.
  • When Search BP was enabled, it was able to cancel the tasks from the rogue query and kept the instance from going down.
  • Since one of the node went down during the course of test (when Search BP disabled), it did not make sense to compare the failure rate and latencies between the two tests.

Test setup

Used locust with following configuration:

locustfile = ~/locust/locustfile.py
headless = true
users = 50
spawn-rate = 10
run-time = 1h

This test used same queries used by opensearch-benchmark tool.

With Search BP enabled

Type     Name                                                                   # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|---------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     auto_histo_agg                                                            908   98(10.79%) |  27443       1  124668  24000 |    0.25        0.03
POST     date_histogram_agg                                                        863    75(8.69%) |  39613       1  144642  39000 |    0.24        0.02
POST     date_histogram_agg_rogue                                                  180   37(20.56%) | 126966       1  334989 126000 |    0.05        0.01
POST     default                                                                   928    45(4.85%) |  34300       1  145420  31000 |    0.26        0.01
POST     distance_histo                                                            874  107(12.24%) |  26328       1  122862  23000 |    0.24        0.03
POST     range                                                                     942    37(3.93%) |  36249       1  133501  35000 |    0.26        0.01
--------|---------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                               4695   399(8.50%) |  36410       1  334989  30000 |    1.31        0.11
Type     Name                                                                           50%    66%    75%    80%    90%    95%    98%    99%  99.9%   100% # reqs
--------|-------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     auto_histo_agg                                                               24000  32000  38000  41000  54000  70000  87000 103000 125000 125000 125000    908
POST     date_histogram_agg                                                           39000  48000  55000  59000  76000  89000 106000 118000 145000 145000 145000    863
POST     date_histogram_agg_rogue                                                    129000 151000 162000 177000 211000 244000 268000 275000 335000 335000 335000    180
POST     default                                                                      31000  42000  48000  53000  71000  87000 102000 110000 145000 145000 145000    928
POST     distance_histo                                                               23000  31000  37000  40000  55000  70000  88000  96000 123000 123000 123000    874
POST     range                                                                        35000  45000  50000  56000  73000  86000 105000 113000 134000 134000 134000    942
--------|-------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                   30000  41000  48000  54000  74000  96000 131000 159000 257000 335000 335000   4695
Error report
#occurrences      Error                                                                                               
------------------|--------------------------------------------------------------------------------------------------------------------------------------
45                 POST default: HTTPError('429 Client Error: Too Many Requests for url: default')                     
33                 POST range: HTTPError('429 Client Error: Too Many Requests for url: range')                         
73                 POST date_histogram_agg: HTTPError('429 Client Error: Too Many Requests for url: date_histogram_agg')
98                 POST auto_histo_agg: HTTPError('429 Client Error: Too Many Requests for url: auto_histo_agg')       
103                POST distance_histo: HTTPError('429 Client Error: Too Many Requests for url: distance_histo')       
26                 POST date_histogram_agg_rogue: HTTPError('429 Client Error: Too Many Requests for url: date_histogram_agg_rogue')
4                  POST range: HTTPError('500 Server Error: Internal Server Error for url: range')                     
4                  POST distance_histo: HTTPError('500 Server Error: Internal Server Error for url: distance_histo')   
11                 POST date_histogram_agg_rogue: HTTPError('500 Server Error: Internal Server Error for url: date_histogram_agg_rogue')
2                  POST date_histogram_agg: HTTPError('500 Server Error: Internal Server Error for url: date_histogram_agg')
------------------|--------------------------------------------------------------------------------------------------------------------------------------

With Search BP disabled

Type     Name                                                                   # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|---------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     auto_histo_agg                                                           1550 1113(71.81%) |  17322       0  242124      2 |    0.43        0.31
POST     date_histogram_agg                                                       1550 1129(72.84%) |  22824       0  269577      2 |    0.43        0.31
POST     date_histogram_agg_rogue                                                  273  213(78.02%) |  44443       0  279034      2 |    0.08        0.06
POST     default                                                                  1522  908(59.66%) |  22112       0  240342      2 |    0.42        0.25
POST     distance_histo                                                           1588 1106(69.65%) |  15864       0  220096      2 |    0.44        0.31
POST     range                                                                    1560  930(59.62%) |  22026       0  241322      2 |    0.43        0.26
--------|---------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                               8043 5399(67.13%) |  20834       0  279034      2 |    2.23        1.5
Response time percentiles (approximated)
Type     Name                                                                           50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|-------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     auto_histo_agg                                                                   2  25000  35000  41000  49000  57000  73000  85000 240000 242000 242000   1550
POST     date_histogram_agg                                                               2  34000  47000  52000  66000  80000  94000 123000 243000 270000 270000   1550
POST     date_histogram_agg_rogue                                                         2  44000  68000 106000 141000 191000 269000 270000 279000 279000 279000    273
POST     default                                                                          2  36000  48000  52000  65000  79000  89000 110000 238000 240000 240000   1522
POST     distance_histo                                                                   2  21000  33000  37000  47000  55000  69000  78000 220000 220000 220000   1588
POST     range                                                                            2  37000  48000  52000  66000  80000  90000 101000 190000 241000 241000   1560
--------|-------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                       2 29000  42000  47000  60000  75000  94000 129000 243000 279000 279000   8043
Error report
#occurrences      Error                                                                                               
------------------|--------------------------------------------------------------------------------------------------------------------------------------
168                POST range: HTTPError('429 Client Error: Too Many Requests for url: range')                         
355                POST auto_histo_agg: HTTPError('429 Client Error: Too Many Requests for url: auto_histo_agg')       
367                POST date_histogram_agg: HTTPError('429 Client Error: Too Many Requests for url: date_histogram_agg')
333                POST distance_histo: HTTPError('429 Client Error: Too Many Requests for url: distance_histo')       
155                POST default: HTTPError('429 Client Error: Too Many Requests for url: default')                     
77                 POST date_histogram_agg_rogue: HTTPError('429 Client Error: Too Many Requests for url: date_histogram_agg_rogue')
7                  POST distance_histo: RemoteDisconnected('Remote end closed connection without response')            
7                  POST auto_histo_agg: RemoteDisconnected('Remote end closed connection without response')            
2                  POST default: ConnectionResetError(104, 'Connection reset by peer')                                 
1                  POST range: ConnectionResetError(104, 'Connection reset by peer')                                   
9                  POST date_histogram_agg: RemoteDisconnected('Remote end closed connection without response')        
3                  POST range: RemoteDisconnected('Remote end closed connection without response')                     
5                  POST default: RemoteDisconnected('Remote end closed connection without response')                   
6                  POST date_histogram_agg_rogue: RemoteDisconnected('Remote end closed connection without response')  
6                  POST auto_histo_agg: ConnectionResetError(104, 'Connection reset by peer')                          
3                  POST date_histogram_agg: ConnectionResetError(104, 'Connection reset by peer')                      
1                  POST distance_histo: ConnectionResetError(104, 'Connection reset by peer')                          
745                POST auto_histo_agg: ConnectionRefusedError(111, 'Connection refused')                              
758                POST range: ConnectionRefusedError(111, 'Connection refused')                                       
750                POST date_histogram_agg: ConnectionRefusedError(111, 'Connection refused')                          
765                POST distance_histo: ConnectionRefusedError(111, 'Connection refused')                              
746                POST default: ConnectionRefusedError(111, 'Connection refused')                                     
130                POST date_histogram_agg_rogue: ConnectionRefusedError(111, 'Connection refused')                    
------------------|--------------------------------------------------------------------------------------------------------------------------------------

Note: OpenSearch crashed on one of the instances due to heap space limit reached.

@rramachand21 rramachand21 added v2.5.0 'Issues and PRs related to version v2.5.0' v2.6.0 'Issues and PRs related to version v2.6.0' and removed v2.4.0 'Issues and PRs related to version v2.4.0' v2.5.0 'Issues and PRs related to version v2.5.0' labels Jan 17, 2023
@kartg
Copy link
Member

kartg commented Feb 21, 2023

@rramachand21 Is this still on track for v2.6.0? Code freeze is today (Feb 21, 2023)

Also, i'm assuming this is for Milestone 2 (#1329) since #1181 (comment) notes that Milestone 1 was released with v2.4.0

@PritLadani
Copy link
Contributor

@kartg yes, this is on track for v2.6.0, and yes, it's for milestone 2.

@kartg
Copy link
Member

kartg commented Feb 23, 2023

Bumping tag to 2.7.0 since #6455 was merged in for 2.6

@DarshitChanpura
Copy link
Member

Hi @tushar-kharbanda72, This issue will be marked for next-release v2.8.0 on (Apr 17) as that is the code-freeze date for v2.7.0. Please let me know if otherwise.

@DarshitChanpura
Copy link
Member

Tagging it for v2.8.0

@DarshitChanpura DarshitChanpura added v2.8.0 'Issues and PRs related to version v2.8.0' and removed v2.7.0 labels Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request v2.8.0 'Issues and PRs related to version v2.8.0'
Projects
None yet
Development

No branches or pull requests