-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] adds capacity estimation to the TM health endpoint #100475
[Task Manager] adds capacity estimation to the TM health endpoint #100475
Conversation
* master: (60 commits) [Docs] Index patterns REST API docs (elastic#100549) [Ingest pipelines] add support for fingerprint processor (elastic#100541) ping Core team when renovate bot bumps es client version (elastic#100662) [Maps] Add draw wizard (elastic#100278) Use documentation link service for index pattern field editor (elastic#100609) [Maps] filter dashboard by map extent (elastic#99860) [ftr] migrate screenshots and snapshots services to FtrService class (elastic#100514) fix anomaly functional test (elastic#100504) update breaking changes template to incorporate ES deprecations (elastic#100621) improve default time ranges (elastic#100536) [Gauge] Fixes wrong translations on ranges less than symbol (elastic#100535) [ftr] migrate "globalNav" service to FtrService class (elastic#100604) [ftr] migrate "testSubjects" to FtrService class (elastic#100512) Fix spaces test flakyness (elastic#100605) [Ingest pipelines] add support for ip type in convert processor (elastic#100531) [ftr] migrate "browser" to FtrService class (elastic#100507) [ftr] migrate "find" service to FtrService class (elastic#100509) [telemetry] report config deprecations (elastic#99887) [ftr] migrate "docTable" service to FtrService class (elastic#100595) [ftr] migrate "listingTable" service to FtrService class (elastic#100606) ...
…ll TMs are active
* master: (77 commits) [RAC][Security Solution] Register Security Detection Rules with Rule Registry (elastic#96015) [Enterprise Search] Log warning for Kibana/EntSearch version mismatches (elastic#100809) updating the saved objects test to include more saved object types (elastic#100828) [ML] Fix categorization job view examples link when datafeed uses multiple indices (elastic#100789) Fixing ES archive mapping failure (elastic#100835) Fix bug with Observability > APM header navigation (elastic#100845) [Security Solution][Endpoint] Add event filters summary card to the fleet endpoint tab (elastic#100668) [Actions] Taking space id into account when creating email footer link (elastic#100734) Ensure comments on parameters in arrow functions are captured in the docs and ci metrics. (elastic#100823) [Security Solution] Improve find rule and find rule status route performance (elastic#99678) [DOCS] Adds video to introduction (elastic#100906) [Fleet] Improve combo box for fleet settings (elastic#100603) [Security Solution][Endpoint] Endpoint generator and data loader support for Host Isolation (elastic#100813) [DOCS] Adds Lens video (elastic#100898) [TSVB] [Table tab] Fix "Math" aggregation (elastic#100765) chore(NA): moving @kbn/io-ts-utils into bazel (elastic#100810) [Alerting] Adding feature flag for enabling/disabling rule import and export (elastic#100718) [TSVB] Fix Upgrading from 7.12.1 to 7.13.0 breaks TSVB (elastic#100864) [Lens] Adds dynamic table cell coloring (elastic#95217) [Security Solution][Endpoint] Do not display searchbar in security-trusted apps if there are no items (elastic#100853) ...
* master: (68 commits) Unskip advanced settings a11y test (elastic#100558) [App Search] Crawler Landing Page (elastic#100822) [DOCS] Clarify when to use kbn clean (elastic#101155) change label behavior (elastic#100991) skip flaky suite (elastic#101126) Fix cases plugin ownership (elastic#101073) [Home] Adding file upload to add data page (elastic#100863) [ML] Functional tests - reenable categorization tests (elastic#101137) [DOCS] Adds server.uuid to settings docs (elastic#101121) Fix newsfeed unread notifications always on when reloading Kibana (elastic#100357) [Lens] Time shift metrics (elastic#98781) [Deprecations service] make `correctiveActions.manualSteps` required (elastic#100997) Add "Risk Matrix" section to the PR template (elastic#100649) [Maps] spatially filter by all geo fields (elastic#100735) [Security Solution] Add Ransomware canary advanced policy option (elastic#101068) [Exploratory view] Core web vitals (elastic#100320) [Security solution][Endpoint] Add unit tests for fleet event filters/trusted apps cards (elastic#101034) [Lens] Use a setter function for the dimension panel (elastic#101123) [Index Patterns] Fix return saved index pattern object (elastic#101051) [CI] For PRs, build TS refs before public api docs check (elastic#100791) ...
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Hi @elastic/kibana-docs 👋 , Thanks |
function persistenceOf(task: ConcreteTaskInstance) { | ||
return task.schedule ? TaskPersistence.Recurring : TaskPersistence.NonRecurring; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see here - we only ever identify a task as having a "recurring" or a "non-recurring" persistence.
"Ephemeral" tasks never get detected because they don't yet exist- this is plumbing in preparation for #97311.
If that PR isn't merged in 7.14, then we'll have to merge a small fix here removing ephemeral tasks from the output of this API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked through the docs but not the code yet. Everything makes sense but the docs are quite dense. Can we consider adding a tldr section to the capacity estimation? Maybe direct users upfront to look at proposed.proposed_kibana
to see the suggested number of kibana instances, but keep the detailed docs for users who want to understand why that number was proposed?
|
||
| This section provides a rough estimate about the sufficiency of its capacity. As the name suggests, these are estimates based on historical data and should not be used as predictions. We recommend using these estimations when following the Task Manager <<task-manager-scaling-guidance>>. | ||
+ | ||
[NOTE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is showing up kind of funky in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops looks like you can't put a Note inside of a definitions list 😬
docs/user/production-considerations/task-manager-production-considerations.asciidoc
Outdated
Show resolved
Hide resolved
docs/user/production-considerations/task-manager-production-considerations.asciidoc
Show resolved
Hide resolved
} | ||
} | ||
-------------------------------------------------- | ||
<1> These estimates assume that there is one {kib} instance actively executing tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me observed_kibana_instances
means the number of Kibanas running, while this description says estimate
. Is this an estimate or an actual number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an estimate based on the observed number of OwnerIds.
There is no way, currently, for Kibana to know how many instances are running, which is frustrating but we have to work with it.
In the code you'll see a bunch of calculations where we use the term "assumed" - as in, we assume this top be the case, but it's not 100% as we're basing this off of the observed number of Kibana. In order for us to produce any kind of estimate for the non-recurring we need some kind of assumption to work off of, and so observing the OwnerIds is the closest we can come to producing an estimate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke to @chrisronline and he thinks Stack Monitoring could provide us a more reliable number.
I'll file a follow up issue for us to look into that in the near future, so we can improve this incrementally.
} | ||
-------------------------------------------------- | ||
<1> These estimates assume that there is one {kib} instance actively executing tasks | ||
<2> Based on past throughput the overdue tasks in the system could be executed within 1 minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a guideline as to how many minutes_to_drain_overdue
is too many? Comparing this example to the below example of an overloaded deployment, 1
minute to drain vs 12
minutes to drain doesn't seem too bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really.
In theory, we shouldn't have any overdue tasks, so anything > 0 is a sign that we are low on capacity.
12 minutes isn't too bad until you realized that during those 12 minutes, additional overdue tasks will build up.
I tried to calrify that in the docs saying:
* There appear to be so many overdue tasks that it would take 12 minutes of executions to catch up with that backlog. This does not take into account tasks that might become overdue during those 12 minutes, so while this congestion might be temporary, the system could also remain consistently under provisioned and might never drain the backlog entirely.
…nsiderations.asciidoc Co-authored-by: gchaps <[email protected]>
…nsiderations.asciidoc Co-authored-by: gchaps <[email protected]>
…nsiderations.asciidoc Co-authored-by: gchaps <[email protected]>
Co-authored-by: gchaps <[email protected]>
…s/kibana into task-manager/capacity-estimation * 'task-manager/capacity-estimation' of github.com:gmmorris/kibana: Apply grammer suggestions Update docs/user/production-considerations/task-manager-production-considerations.asciidoc Update docs/user/production-considerations/task-manager-production-considerations.asciidoc Update docs/user/production-considerations/task-manager-production-considerations.asciidoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* master: (54 commits) Implement "select all" rules feature (elastic#100554) [ML] Remove script fields from the Anomaly detection alerting rule executor (elastic#101607) [Security solutions][Endpoint] Update event filtering texts (elastic#101563) [Enterprise Search] Mocks/tests tech debt - avoid hungry mocking (elastic#101107) [FTR] Updates esArchive paths [FTR] Updates esArchive paths [Security Solution][Detection Engine] Adds runtime field tests (elastic#101664) Added APM PHP agent to the list of agent names (elastic#101062) [CI] Restore old version_info behavior when .git directory is present (elastic#101642) [Fleet] Add fleet server telemetry (elastic#101400) [APM] Syncs agent config settings to APM Fleet policies (elastic#100744) [esArchiver] drop support for --dir, use repo-relative paths instead (elastic#101345) Revert "[xpack/test] restore incremental: false in ts project" [Security Solution] Remove Host Isolation feature flag (elastic#101655) [xpack/test] restore incremental: false in ts project [DOCS] Adds link to video landing page (elastic#101413) [ML] Move Index Data Visualizer into separate plugin (Part 1) (elastic#100922) Improve security plugin return types (elastic#101492) [ts] migrate `x-pack/test` to composite ts project (elastic#101441) [App Search] Updated Search UI to new URL (elastic#101320) ...
* master: clarify which parts of TM are experimental (elastic#101757) Add sh scripts with _bulk_action route usage examples (elastic#101736) [Uptime] Only register route in side nav if uptime show capability is true (elastic#101709) Use KIBANA_DOCS in doc link service (elastic#101667) [Alerting][Event log] Persisting duration information for active alerts in event log (elastic#101387) Address design issues in Discover/Graph (elastic#101584) Optimize performance for document table (elastic#101715) Change file data visualizer links to point to new location in home application (elastic#101393) [Fleet] Tighten policy permissions, take II (elastic#97366) [ML] Add debounce to the severity control update (elastic#101581) [Fleet] Fix routing issues with `getPath` and `history.push` (elastic#101658) [APM] Add link-to/transaction route (elastic#101731) [Index Patterns] Runtime fields CRUD REST API (elastic#101164) [ILM] Refactor types and fix missing aria labels (elastic#101518) [Lens] New summary row feature for datatable (elastic#101075) Blocks save event filter with a white space name (elastic#101599) Improve security server types (elastic#101661) [APM] Replace side nav with tabs on Settings page (elastic#101460) [APM] Only register items in side nav if user has permissions to see app (elastic#101707) [Security solution][Endpoint] Add back button when to the event filters list (elastic#101280)
* master: (68 commits) skip flaky suite (elastic#94043) skip flaky suite (elastic#102012) [esArchive] Persists updates for management/saved_objects/* (elastic#101992) skip flaky suite (elastic#101449) remove unnecessary hack (elastic#101909) [Exploratory View] Use human readable formats (elastic#101520) [Expressions] Refactor expression functions to use observables underneath (elastic#100409) [esArchives] Persist migrated Kibana archives (elastic#101950) [kbnArchiver] fix save to non-existent file (elastic#101974) [Enterprise Search] Add owner and description properties to kibana.json (elastic#101957) [DOCS] Fixes terminology in Stack Monitoring:Kibana alerts (elastic#101696) [Observability] [Cases] Cases in the observability app (elastic#101487) [Alerting][Docs] Combine rule creation and management pages (elastic#101498) temporarily disable build-buddy [Fleet] Fix fleet server collector in case settings are not set (elastic#101752) [Event Log] Populated rule.* ECS fields for alert events. (elastic#101132) [APM] Fleet support for merging input.config values with other nested properties in the policy input (elastic#101690) Add comments to some alerting plugin public API items (elastic#101551) [Alerting][Docs] Moving alerting setup to its own page (elastic#101323) remove uptime public API, it's not used. (elastic#101799) ...
* master: [Security solution][Endpoint] Removes zip compression when creating artifacts (elastic#101379) [Dashboard]: Fixes disabled viz filter is applied (elastic#101859) [Discover] Deangularization of search embeddable (elastic#100552)
💚 Build Succeeded
Metrics [docs]
History
To update your PR or re-run it, just comment with: |
…astic#100475) Adds Capacity Estimation to the Task Manager Health Endpoint. Below is a diagram depicting what information we use to estimate the varying capacity variables. Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.
…00475) (#102054) Adds Capacity Estimation to the Task Manager Health Endpoint. Below is a diagram depicting what information we use to estimate the varying capacity variables. Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.
…astic#100475) Adds Capacity Estimation to the Task Manager Health Endpoint. Below is a diagram depicting what information we use to estimate the varying capacity variables. Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.
Summary
closes #98625
Adds Capacity Estimation to the Task Manager Health Endpoint.
Below is a diagram depicting what information we use to estimate the varying capacity variables.
Please use the user facing docs to understand how it fits together. If the docs aren't clear enough - make a review comment and I'll clarify in the docs.
Documentation changes:
Notes for Reviewers
Checklist
Delete any items that are not applicable to this PR.
Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n supportAny UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker listThis renders correctly on smaller devices using a responsive layout. (You can test this in your browser)This was checked for cross-browser compatibilityFor maintainers