Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test impact of using refresh: false for task manager internals #99444

Closed
dgieselaar opened this issue May 6, 2021 · 16 comments
Closed

Test impact of using refresh: false for task manager internals #99444

dgieselaar opened this issue May 6, 2021 · 16 comments
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. performance resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:APM All issues that need APM UI Team support Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@dgieselaar
Copy link
Member

In #99160, we are considering setting refresh to false for Task Manager internal operations, like creating/updating/deleting tasks. By default, the Saved Objects client will use wait_for, which means that it will keep a connection open to Elasticsearch until a shard gets refreshed. By default, this is 1 second. This means that workers are not freed up as quickly as they could be, and can have a negative impact on the rate of tasks that can be executed.

We should test not just the functionality, but also the performance impact, by running a small (local) load test.

We should also do a larger-scale load test, but given the complexity, we will address that separately.

cc @pmuellr @gmmorris

@dgieselaar dgieselaar added the Team:APM All issues that need APM UI Team support label May 6, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@dgieselaar dgieselaar added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label May 6, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@dgieselaar dgieselaar changed the title Test impact of removing wait_for for task manager internals Test impact of using refresh: false for task manager internals May 6, 2021
@gmmorris
Copy link
Contributor

gmmorris commented May 6, 2021

Thank Dario, added to out project for triage 👍

I would separate the additional instrumentation from this change - these are two distinct deliverables, and tying them together will make it harder to tell what the source of an issue is if we deliver them together.
Both changes have potential downstream impact in terms of perf and lifecycle eccentricities, so I'd like to keep them apart if possible.

@mikecote
Copy link
Contributor

For this issue, we should help finalize this PR #99919 and do some load testing before the end of the release cycle (#95194).

@YulNaumenko
Copy link
Contributor

YulNaumenko commented Jun 7, 2021

This issue seems to be a part of the current one or should be implemented as the next step.

@YulNaumenko
Copy link
Contributor

Based on the discussion with Dario, here is some details about setup a local environment:

  1. Configure your local Kibana kibana.dev.yml with the next options:
elastic.apm.transactionSampleRate: 1
elastic.apm.breakdownMetrics: true
elastic.apm.active: true
elastic.apm.environment: yuliia
elastic.apm.disableInstrumentations:
  - http
  - https
  1. yarn es snapshot --ssl
  2. yarn start --ssl
  3. Create Rule with actions. I'm using es-apm-sys-sim -r 40 15 es-apm-sys-sim https://elastic:changeme@localhost:9200 for the data
  4. Makes sure it run successfully
  5. Open https://ela.st/kibana-ops-ci-stats
  6. Select your local environment name from the list of the environments:

Screen Shot 2021-06-11 at 10 34 05 AM

8. Navigate to APM Transactions and select transaction type `taskManager run`:

Screen Shot 2021-06-11 at 11 32 13 AM

9. Select some of the longest running the transactions could be improved:

Screen Shot 2021-06-11 at 11 33 34 AM

@mikecote
Copy link
Contributor

This issue didn't make it part of the 7.15 planning. @YulNaumenko, how much effort is left before resolving this? We are planning to move this to the backlog for now.

@gmmorris gmmorris added the resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility label Jul 15, 2021
@YulNaumenko YulNaumenko removed their assignment Jul 19, 2021
@YulNaumenko
Copy link
Contributor

This issue didn't make it part of the 7.15 planning. @YulNaumenko, how much effort is left before resolving this? We are planning to move this to the backlog for now.

The most effort currently is on the functional and performance testing. It looks like loe:week

@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Aug 11, 2021
@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@gmmorris gmmorris added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Oct 1, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@ymao1
Copy link
Contributor

ymao1 commented Dec 8, 2022

Closing as done

@ymao1 ymao1 closed this as completed Dec 8, 2022
Repository owner moved this from Todo to Done in AppEx: ResponseOps - Execution & Connectors Dec 8, 2022
@sorenlouv
Copy link
Member

@ymao1 Did this get implemented or what was the outcome of the perf test?

@mikecote
Copy link
Contributor

@ymao1 Did this get implemented or what was the outcome of the perf test?

@sqren I did some tests during my April ON-week and couldn't find places that were missing refresh: false. So we decided to close it during last week's grooming session. Everything should be 🔥 fast 😎 .

@dgieselaar
Copy link
Member Author

@mikecote just to check: the default from the SO client is 'wait_for', no? Do you mean that the Alerting/Task Manager sets it to refresh: false where possible? It still would have been nice to see the impact :)

@mikecote
Copy link
Contributor

mikecote commented Dec 12, 2022

@mikecote just to check: the default from the SO client is 'wait_for', no? Do you mean that the Alerting/Task Manager sets it to refresh: false where possible?

That's correct, so we had to change a bunch of places to refresh: false explicitly where we didn't rely on/need the data to be searchable right away. We've done those changes a long time ago (7.10 - 7.11 / Alerting GA) and have been holding onto this issue to see if there are places we've missed. Based on some testing earlier this year, it's all good! So unfortunately we don't have anything to compare as Task Manager has been running with these optimizations since the beginning of alerting.

It still would have been nice to see the impact :)

I think it was bad enough when using refresh: wait_for that we couldn't run many tasks per minute nor GA anything 🙈

@dgieselaar
Copy link
Member Author

The linked PR in this ticket is tagged as 7.14. How do you reconcile that with:

So unfortunately we don't have anything to compare as Task Manager has been running with these optimizations since the beginning of alerting.

Do you define "beginning of alerting" as GA? Because I cannot see how it is true in any other case. When did alerting go GA?

@mikecote
Copy link
Contributor

There might be a few refresh missing or that have been added over time but there isn't a specific release where we would see significant changes to alerting/task manager performance. Alerting went GA in 7.11.

Could be cool to compare 7.11 to now though if someone had spare cycles.

@dgieselaar
Copy link
Member Author

@mikecote have you looked at APM traces? Or more generally, how have you verified your assumptions are correct? E.g., the PR I put up as a result of me looking into this has 4 changes - out of those 4, only one place actually seems to have added a refresh: false. Those changes were made based on looking at an actual trace and identifying bottlenecks. Maybe that one change is all we can do but this feels hand-wavy to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. performance resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:APM All issues that need APM UI Team support Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

9 participants