Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] adds capacity estimation to the TM health endpoint #100475

Merged
merged 46 commits into from
Jun 14, 2021

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented May 24, 2021

Summary

closes #98625

Adds Capacity Estimation to the Task Manager Health Endpoint.
Below is a diagram depicting what information we use to estimate the varying capacity variables.

Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.

Documentation changes:

Task Manager Capacity Estimation@2x (6)

Notes for Reviewers

  1. I recommend reading the documentation that I have added both as they are designed to be user facing requiring review, and because they should give you the context needed to review the code.
  2. Do you think the diagram belongs in the docs? Or is it too technically detailed and adds more noise than signal?
  3. These estimations are very rough and are difficult to put together as we never really know how many Kibana instances are in the cluster and it's very hard to infer from these current stats what the non-recurring workload looks like. I hope the variable naming makes the calculations clearer, but I am fearful this might be had to maintain. If you have thoughts on how we can restructure the calculation code to make it easier to maintain- I'm all ears.
  4. You will see references to ephemeral tasks in the docs and API output - this is in preparation for @chrisronline 's [Alerting] Change execution of alerts from async to sync #97311 PR, but as you can see in the code we don't actually detect any ephemeral tasks yet, as they don't actually exist in the code in any way yet (so the % of ephemeral task will always be zero). If [Alerting] Change execution of alerts from async to sync #97311 doesn't get merged in 7.14, then we can make a small fix to this code to remove references to ephemeral tasks. If it does get merged, then we'll need a follow up PR to hook them into each other, but I expect that will be a relatively small PR.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

gmmorris added 16 commits May 24, 2021 17:50
* master: (60 commits)
  [Docs] Index patterns REST API docs (elastic#100549)
  [Ingest pipelines] add support for fingerprint processor (elastic#100541)
  ping Core team when renovate bot bumps es client version (elastic#100662)
  [Maps] Add draw wizard (elastic#100278)
  Use documentation link service for index pattern field editor (elastic#100609)
  [Maps] filter dashboard by map extent (elastic#99860)
  [ftr] migrate screenshots and snapshots services to FtrService class (elastic#100514)
  fix anomaly functional test (elastic#100504)
  update breaking changes template to incorporate ES deprecations (elastic#100621)
  improve default time ranges (elastic#100536)
  [Gauge] Fixes wrong translations on ranges less than symbol (elastic#100535)
  [ftr] migrate "globalNav" service to FtrService class (elastic#100604)
  [ftr] migrate "testSubjects" to FtrService class (elastic#100512)
  Fix spaces test flakyness (elastic#100605)
  [Ingest pipelines] add support for ip type in convert processor (elastic#100531)
  [ftr] migrate "browser" to FtrService class (elastic#100507)
  [ftr] migrate "find" service to FtrService class (elastic#100509)
  [telemetry] report config deprecations (elastic#99887)
  [ftr] migrate "docTable" service to FtrService class (elastic#100595)
  [ftr] migrate "listingTable" service to FtrService class (elastic#100606)
  ...
* master: (77 commits)
  [RAC][Security Solution] Register Security Detection Rules with Rule Registry (elastic#96015)
  [Enterprise Search] Log warning for Kibana/EntSearch version mismatches (elastic#100809)
  updating the saved objects test to include more saved object types (elastic#100828)
  [ML] Fix categorization job view examples link when datafeed uses multiple indices (elastic#100789)
  Fixing ES archive mapping failure (elastic#100835)
  Fix bug with Observability > APM header navigation (elastic#100845)
  [Security Solution][Endpoint] Add event filters summary card to the fleet endpoint tab (elastic#100668)
  [Actions] Taking space id into account when creating email footer link (elastic#100734)
  Ensure comments on parameters in arrow functions are captured in the docs and ci metrics. (elastic#100823)
  [Security Solution] Improve find rule and find rule status route performance (elastic#99678)
  [DOCS] Adds video to introduction (elastic#100906)
  [Fleet] Improve combo box for fleet settings (elastic#100603)
  [Security Solution][Endpoint] Endpoint generator and data loader support for Host Isolation (elastic#100813)
  [DOCS] Adds Lens video (elastic#100898)
  [TSVB] [Table tab] Fix "Math" aggregation (elastic#100765)
  chore(NA): moving @kbn/io-ts-utils into bazel (elastic#100810)
  [Alerting] Adding feature flag for enabling/disabling rule import and export (elastic#100718)
  [TSVB] Fix Upgrading from 7.12.1 to 7.13.0 breaks TSVB (elastic#100864)
  [Lens] Adds dynamic table cell coloring (elastic#95217)
  [Security Solution][Endpoint] Do not display searchbar in security-trusted apps if there are no items (elastic#100853)
  ...
* master: (68 commits)
  Unskip advanced settings a11y test (elastic#100558)
  [App Search] Crawler Landing Page (elastic#100822)
  [DOCS] Clarify when to use kbn clean (elastic#101155)
  change label behavior (elastic#100991)
  skip flaky suite (elastic#101126)
  Fix cases plugin ownership (elastic#101073)
  [Home] Adding file upload to add data page (elastic#100863)
  [ML] Functional tests - reenable categorization tests (elastic#101137)
  [DOCS] Adds server.uuid to settings docs (elastic#101121)
  Fix newsfeed unread notifications always on when reloading Kibana (elastic#100357)
  [Lens] Time shift metrics (elastic#98781)
  [Deprecations service] make `correctiveActions.manualSteps` required (elastic#100997)
  Add "Risk Matrix" section to the PR template (elastic#100649)
  [Maps] spatially filter by all geo fields (elastic#100735)
  [Security Solution] Add Ransomware canary advanced policy option (elastic#101068)
  [Exploratory view] Core web vitals (elastic#100320)
  [Security solution][Endpoint] Add unit tests for fleet event filters/trusted apps cards (elastic#101034)
  [Lens] Use a setter function for the dimension panel (elastic#101123)
  [Index Patterns] Fix return saved index pattern object (elastic#101051)
  [CI] For PRs, build TS refs before public api docs check (elastic#100791)
  ...
@gmmorris gmmorris changed the title added capacity estimation to TM health endpoint [Task Manager] adds capacity estimation to the TM health endpoint Jun 2, 2021
@gmmorris gmmorris added Feature:Task Manager release_note:enhancement Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.14.0 v8.0.0 labels Jun 3, 2021
@gmmorris gmmorris marked this pull request as ready for review June 3, 2021 09:26
@gmmorris gmmorris requested a review from a team as a code owner June 3, 2021 09:26
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris
Copy link
Contributor Author

gmmorris commented Jun 3, 2021

Hi @elastic/kibana-docs 👋 ,
I've added extensive docs here, and would appreciate a review. :)

Thanks

@gmmorris gmmorris requested a review from a team June 3, 2021 09:27
Comment on lines +45 to +47
function persistenceOf(task: ConcreteTaskInstance) {
return task.schedule ? TaskPersistence.Recurring : TaskPersistence.NonRecurring;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see here - we only ever identify a task as having a "recurring" or a "non-recurring" persistence.
"Ephemeral" tasks never get detected because they don't yet exist- this is plumbing in preparation for #97311.

If that PR isn't merged in 7.14, then we'll have to merge a small fix here removing ephemeral tasks from the output of this API.

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through the docs but not the code yet. Everything makes sense but the docs are quite dense. Can we consider adding a tldr section to the capacity estimation? Maybe direct users upfront to look at proposed.proposed_kibana to see the suggested number of kibana instances, but keep the detailed docs for users who want to understand why that number was proposed?


| This section provides a rough estimate about the sufficiency of its capacity. As the name suggests, these are estimates based on historical data and should not be used as predictions. We recommend using these estimations when following the Task Manager <<task-manager-scaling-guidance>>.
+
[NOTE]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is showing up kind of funky in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops looks like you can't put a Note inside of a definitions list 😬

}
}
--------------------------------------------------
<1> These estimates assume that there is one {kib} instance actively executing tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me observed_kibana_instances means the number of Kibanas running, while this description says estimate. Is this an estimate or an actual number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an estimate based on the observed number of OwnerIds.
There is no way, currently, for Kibana to know how many instances are running, which is frustrating but we have to work with it.
In the code you'll see a bunch of calculations where we use the term "assumed" - as in, we assume this top be the case, but it's not 100% as we're basing this off of the observed number of Kibana. In order for us to produce any kind of estimate for the non-recurring we need some kind of assumption to work off of, and so observing the OwnerIds is the closest we can come to producing an estimate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke to @chrisronline and he thinks Stack Monitoring could provide us a more reliable number.
I'll file a follow up issue for us to look into that in the near future, so we can improve this incrementally.

}
--------------------------------------------------
<1> These estimates assume that there is one {kib} instance actively executing tasks
<2> Based on past throughput the overdue tasks in the system could be executed within 1 minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a guideline as to how many minutes_to_drain_overdue is too many? Comparing this example to the below example of an overloaded deployment, 1 minute to drain vs 12 minutes to drain doesn't seem too bad?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really.
In theory, we shouldn't have any overdue tasks, so anything > 0 is a sign that we are low on capacity.
12 minutes isn't too bad until you realized that during those 12 minutes, additional overdue tasks will build up.

I tried to calrify that in the docs saying:

* There appear to be so many overdue tasks that it would take 12 minutes of executions to catch up with that backlog. This does not take into account tasks that might become overdue during those 12 minutes, so while this congestion might be temporary, the system could also remain consistently under provisioned and might never drain the backlog entirely.

gmmorris and others added 9 commits June 8, 2021 11:28
…s/kibana into task-manager/capacity-estimation

* 'task-manager/capacity-estimation' of github.com:gmmorris/kibana:
  Apply grammer suggestions
  Update docs/user/production-considerations/task-manager-production-considerations.asciidoc
  Update docs/user/production-considerations/task-manager-production-considerations.asciidoc
  Update docs/user/production-considerations/task-manager-production-considerations.asciidoc
Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

gmmorris added 11 commits June 9, 2021 10:20
* master: (54 commits)
  Implement "select all" rules feature (elastic#100554)
  [ML] Remove script fields from the Anomaly detection alerting rule executor  (elastic#101607)
  [Security solutions][Endpoint] Update event filtering texts (elastic#101563)
  [Enterprise Search] Mocks/tests tech debt - avoid hungry mocking (elastic#101107)
  [FTR] Updates esArchive paths
  [FTR] Updates esArchive paths
  [Security Solution][Detection Engine] Adds runtime field tests (elastic#101664)
  Added APM PHP agent to the list of agent names (elastic#101062)
  [CI] Restore old version_info behavior when .git directory is present (elastic#101642)
  [Fleet] Add fleet server telemetry (elastic#101400)
  [APM] Syncs agent config settings to APM Fleet policies (elastic#100744)
  [esArchiver] drop support for --dir, use repo-relative paths instead (elastic#101345)
  Revert "[xpack/test] restore incremental: false in ts project"
  [Security Solution] Remove Host Isolation feature flag (elastic#101655)
  [xpack/test] restore incremental: false in ts project
  [DOCS] Adds link to video landing page (elastic#101413)
  [ML] Move Index Data Visualizer into separate plugin (Part 1) (elastic#100922)
  Improve security plugin return types (elastic#101492)
  [ts] migrate `x-pack/test` to composite ts project (elastic#101441)
  [App Search] Updated Search UI to new URL (elastic#101320)
  ...
* master:
  clarify which parts of TM are experimental (elastic#101757)
  Add sh scripts with _bulk_action route usage examples (elastic#101736)
  [Uptime] Only register route in side nav if uptime show capability is true (elastic#101709)
  Use KIBANA_DOCS in doc link service (elastic#101667)
  [Alerting][Event log] Persisting duration information for active alerts in event log (elastic#101387)
  Address design issues in Discover/Graph (elastic#101584)
  Optimize performance for document table (elastic#101715)
  Change file data visualizer links to point to new location in home application (elastic#101393)
  [Fleet] Tighten policy permissions, take II (elastic#97366)
  [ML] Add debounce to the severity control update  (elastic#101581)
  [Fleet] Fix routing issues with `getPath` and `history.push` (elastic#101658)
  [APM] Add link-to/transaction route (elastic#101731)
  [Index Patterns] Runtime fields CRUD REST API  (elastic#101164)
  [ILM] Refactor types and fix missing aria labels (elastic#101518)
  [Lens] New summary row feature for datatable (elastic#101075)
  Blocks save event filter with a white space name (elastic#101599)
  Improve security server types (elastic#101661)
  [APM] Replace side nav with tabs on Settings page (elastic#101460)
  [APM] Only register items in side nav if user has permissions to see app (elastic#101707)
  [Security solution][Endpoint] Add back button when to the event filters list (elastic#101280)
* master: (68 commits)
  skip flaky suite (elastic#94043)
  skip flaky suite (elastic#102012)
  [esArchive] Persists updates for management/saved_objects/* (elastic#101992)
  skip flaky suite (elastic#101449)
  remove unnecessary hack (elastic#101909)
  [Exploratory View] Use human readable formats (elastic#101520)
  [Expressions] Refactor expression functions to use observables underneath (elastic#100409)
  [esArchives] Persist migrated Kibana archives (elastic#101950)
  [kbnArchiver] fix save to non-existent file (elastic#101974)
  [Enterprise Search] Add owner and description properties to kibana.json (elastic#101957)
  [DOCS] Fixes terminology in Stack Monitoring:Kibana alerts (elastic#101696)
  [Observability] [Cases] Cases in the observability app (elastic#101487)
  [Alerting][Docs] Combine rule creation and management pages (elastic#101498)
  temporarily disable build-buddy
  [Fleet] Fix fleet server collector in case settings are not set (elastic#101752)
  [Event Log] Populated rule.* ECS fields for alert events. (elastic#101132)
  [APM] Fleet support for merging input.config values with other nested properties in the policy input (elastic#101690)
  Add comments to some alerting plugin public API items (elastic#101551)
  [Alerting][Docs] Moving alerting setup to its own page (elastic#101323)
  remove uptime public API, it's not used. (elastic#101799)
  ...
* master:
  [Security solution][Endpoint] Removes zip compression when creating artifacts (elastic#101379)
  [Dashboard]: Fixes disabled viz filter is applied (elastic#101859)
  [Discover] Deangularization of search embeddable (elastic#100552)
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@gmmorris gmmorris merged commit daa3f62 into elastic:master Jun 14, 2021
gmmorris added a commit to gmmorris/kibana that referenced this pull request Jun 14, 2021
…astic#100475)

Adds Capacity Estimation to the Task Manager Health Endpoint.
Below is a diagram depicting what information we use to estimate the varying capacity variables.

Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.
gmmorris added a commit that referenced this pull request Jun 14, 2021
…00475) (#102054)

Adds Capacity Estimation to the Task Manager Health Endpoint.
Below is a diagram depicting what information we use to estimate the varying capacity variables.

Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.
cuff-links pushed a commit to cuff-links/kibana that referenced this pull request Jun 15, 2021
…astic#100475)

Adds Capacity Estimation to the Task Manager Health Endpoint.
Below is a diagram depicting what information we use to estimate the varying capacity variables.

Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager release_note:enhancement Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.14.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task Manager] Health Metrics capacity estimation enhancements
7 participants