Test impact of using `refresh: false` for task manager internals #99444

dgieselaar · 2021-05-06T07:01:47Z

In #99160, we are considering setting refresh to false for Task Manager internal operations, like creating/updating/deleting tasks. By default, the Saved Objects client will use wait_for, which means that it will keep a connection open to Elasticsearch until a shard gets refreshed. By default, this is 1 second. This means that workers are not freed up as quickly as they could be, and can have a negative impact on the rate of tasks that can be executed.

We should test not just the functionality, but also the performance impact, by running a small (local) load test.

We should also do a larger-scale load test, but given the complexity, we will address that separately.

cc @pmuellr @gmmorris

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-05-06T07:01:48Z

Pinging @elastic/apm-ui (Team:apm)

elasticmachine · 2021-05-06T07:02:07Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

gmmorris · 2021-05-06T10:34:56Z

Thank Dario, added to out project for triage 👍

I would separate the additional instrumentation from this change - these are two distinct deliverables, and tying them together will make it harder to tell what the source of an issue is if we deliver them together.
Both changes have potential downstream impact in terms of perf and lifecycle eccentricities, so I'd like to keep them apart if possible.

mikecote · 2021-05-12T15:48:39Z

For this issue, we should help finalize this PR #99919 and do some load testing before the end of the release cycle (#95194).

YulNaumenko · 2021-06-07T22:32:56Z

This issue seems to be a part of the current one or should be implemented as the next step.

YulNaumenko · 2021-06-11T18:34:25Z

Based on the discussion with Dario, here is some details about setup a local environment:

Configure your local Kibana kibana.dev.yml with the next options:

elastic.apm.transactionSampleRate: 1
elastic.apm.breakdownMetrics: true
elastic.apm.active: true
elastic.apm.environment: yuliia
elastic.apm.disableInstrumentations:
  - http
  - https

yarn es snapshot --ssl
yarn start --ssl
Create Rule with actions. I'm using es-apm-sys-sim -r 40 15 es-apm-sys-sim https://elastic:changeme@localhost:9200 for the data
Makes sure it run successfully
Open https://ela.st/kibana-ops-ci-stats
Select your local environment name from the list of the environments:

8. Navigate to APM Transactions and select transaction type `taskManager run`:

9. Select some of the longest running the transactions could be improved:

mikecote · 2021-07-14T11:39:18Z

This issue didn't make it part of the 7.15 planning. @YulNaumenko, how much effort is left before resolving this? We are planning to move this to the backlog for now.

YulNaumenko · 2021-07-19T18:16:09Z

This issue didn't make it part of the 7.15 planning. @YulNaumenko, how much effort is left before resolving this? We are planning to move this to the backlog for now.

The most effort currently is on the functional and performance testing. It looks like loe:week

ymao1 · 2022-12-08T14:18:41Z

Closing as done

sorenlouv · 2022-12-09T01:06:08Z

@ymao1 Did this get implemented or what was the outcome of the perf test?

mikecote · 2022-12-12T12:19:06Z

@ymao1 Did this get implemented or what was the outcome of the perf test?

@sqren I did some tests during my April ON-week and couldn't find places that were missing refresh: false. So we decided to close it during last week's grooming session. Everything should be 🔥 fast 😎 .

dgieselaar · 2022-12-12T12:22:50Z

@mikecote just to check: the default from the SO client is 'wait_for', no? Do you mean that the Alerting/Task Manager sets it to refresh: false where possible? It still would have been nice to see the impact :)

mikecote · 2022-12-12T12:49:06Z

@mikecote just to check: the default from the SO client is 'wait_for', no? Do you mean that the Alerting/Task Manager sets it to refresh: false where possible?

That's correct, so we had to change a bunch of places to refresh: false explicitly where we didn't rely on/need the data to be searchable right away. We've done those changes a long time ago (7.10 - 7.11 / Alerting GA) and have been holding onto this issue to see if there are places we've missed. Based on some testing earlier this year, it's all good! So unfortunately we don't have anything to compare as Task Manager has been running with these optimizations since the beginning of alerting.

It still would have been nice to see the impact :)

I think it was bad enough when using refresh: wait_for that we couldn't run many tasks per minute nor GA anything 🙈

dgieselaar · 2022-12-12T12:54:12Z

The linked PR in this ticket is tagged as 7.14. How do you reconcile that with:

So unfortunately we don't have anything to compare as Task Manager has been running with these optimizations since the beginning of alerting.

Do you define "beginning of alerting" as GA? Because I cannot see how it is true in any other case. When did alerting go GA?

mikecote · 2022-12-12T13:20:21Z

There might be a few refresh missing or that have been added over time but there isn't a specific release where we would see significant changes to alerting/task manager performance. Alerting went GA in 7.11.

Could be cool to compare 7.11 to now though if someone had spare cycles.

dgieselaar · 2022-12-12T13:31:39Z

@mikecote have you looked at APM traces? Or more generally, how have you verified your assumptions are correct? E.g., the PR I put up as a result of me looking into this has 4 changes - out of those 4, only one place actually seems to have added a refresh: false. Those changes were made based on looking at an actual trace and identifying bottlenecks. Maybe that one change is all we can do but this feels hand-wavy to me.

dgieselaar added the Team:APM All issues that need APM UI Team support label May 6, 2021

dgieselaar assigned dgieselaar and pmuellr May 6, 2021

dgieselaar added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label May 6, 2021

dgieselaar changed the title ~~Test impact of removing wait_for for task manager internals~~ Test impact of using refresh: false for task manager internals May 6, 2021

mikecote unassigned pmuellr and dgieselaar May 12, 2021

sorenlouv added the [zube]: Backlog label May 27, 2021

YulNaumenko mentioned this issue Jun 10, 2021

[actions] every action execution showing evidence of refresh: wait_for #99101

Closed

YulNaumenko self-assigned this Jun 28, 2021

gmmorris added the Feature:Task Manager label Jul 2, 2021

gmmorris added the resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility label Jul 15, 2021

YulNaumenko removed their assignment Jul 19, 2021

gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Aug 11, 2021

gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021

gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021

gmmorris added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Oct 1, 2021

xcrzx mentioned this issue Nov 15, 2021

[Security Solution] Improve RuleExecutionLog performance #118511

Closed

4 tasks

XavierM added this to AppEx: ResponseOps - Execution & Connectors Jan 6, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

mikecote moved this to Todo in AppEx: ResponseOps - Execution & Connectors Aug 11, 2022

mikecote added the performance label Aug 11, 2022

ymao1 closed this as completed Dec 8, 2022

Repository owner moved this from Todo to Done in AppEx: ResponseOps - Execution & Connectors Dec 8, 2022

zube bot added [zube]: Done and removed [zube]: Backlog labels Dec 8, 2022

banderror mentioned this issue Dec 19, 2022

[ResponseOps] Add support for the "running" flag to the rule object #147759

Closed

zube bot removed the [zube]: Done label Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test impact of using `refresh: false` for task manager internals #99444

Test impact of using `refresh: false` for task manager internals #99444

dgieselaar commented May 6, 2021

elasticmachine commented May 6, 2021

elasticmachine commented May 6, 2021

gmmorris commented May 6, 2021

mikecote commented May 12, 2021

YulNaumenko commented Jun 7, 2021 •

edited

Loading

YulNaumenko commented Jun 11, 2021

mikecote commented Jul 14, 2021

YulNaumenko commented Jul 19, 2021

ymao1 commented Dec 8, 2022

sorenlouv commented Dec 9, 2022

mikecote commented Dec 12, 2022

dgieselaar commented Dec 12, 2022

mikecote commented Dec 12, 2022 •

edited

Loading

dgieselaar commented Dec 12, 2022

mikecote commented Dec 12, 2022

dgieselaar commented Dec 12, 2022

Test impact of using refresh: false for task manager internals #99444

Test impact of using refresh: false for task manager internals #99444

Comments

dgieselaar commented May 6, 2021

elasticmachine commented May 6, 2021

elasticmachine commented May 6, 2021

gmmorris commented May 6, 2021

mikecote commented May 12, 2021

YulNaumenko commented Jun 7, 2021 • edited Loading

YulNaumenko commented Jun 11, 2021

mikecote commented Jul 14, 2021

YulNaumenko commented Jul 19, 2021

ymao1 commented Dec 8, 2022

sorenlouv commented Dec 9, 2022

mikecote commented Dec 12, 2022

dgieselaar commented Dec 12, 2022

mikecote commented Dec 12, 2022 • edited Loading

dgieselaar commented Dec 12, 2022

mikecote commented Dec 12, 2022

dgieselaar commented Dec 12, 2022

Test impact of using `refresh: false` for task manager internals #99444

Test impact of using `refresh: false` for task manager internals #99444

YulNaumenko commented Jun 7, 2021 •

edited

Loading

mikecote commented Dec 12, 2022 •

edited

Loading