Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] adds capacity estimation to the TM health endpoint #100475

Merged
merged 46 commits into from
Jun 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
e3515ad
added capacity estimation to TM health endpoint
gmmorris May 24, 2021
8cefde7
separated estimation from capacity requirmenets
gmmorris May 26, 2021
4417858
Merge branch 'master' into task-manager/capacity-estimation
gmmorris May 26, 2021
9aab904
track owner IDs in running average as there may be cycles where not a…
gmmorris May 26, 2021
ae2bd52
added docs
gmmorris May 28, 2021
fc0b23b
made assumptions clearer in estimations
gmmorris May 28, 2021
ea2da9c
split estimations by observations and proposal
gmmorris May 28, 2021
d1de507
Merge branch 'master' into task-manager/capacity-estimation
gmmorris May 28, 2021
37bf439
fixed doc
gmmorris May 28, 2021
f754d51
fixed test
gmmorris May 29, 2021
dfb113f
split porposed from min
gmmorris Jun 1, 2021
ae5e884
clarified that instacnes need to be identical
gmmorris Jun 1, 2021
9839c36
reworded some docs
gmmorris Jun 1, 2021
367b9e0
use average execution durationwhen estimating task capacity
gmmorris Jun 1, 2021
3a52341
fixed typing issues
gmmorris Jun 2, 2021
2b9bd67
Merge branch 'master' into task-manager/capacity-estimation
gmmorris Jun 2, 2021
e41d413
tweaked doc
gmmorris Jun 3, 2021
e648c7c
Update docs/user/production-considerations/task-manager-troubleshooti…
gmmorris Jun 7, 2021
6c100f7
Update docs/user/production-considerations/task-manager-troubleshooti…
gmmorris Jun 7, 2021
03ed669
Update docs/user/production-considerations/task-manager-troubleshooti…
gmmorris Jun 7, 2021
7242f15
Update docs/user/production-considerations/task-manager-troubleshooti…
gmmorris Jun 7, 2021
fa0cc24
cleaned up docs
gmmorris Jun 7, 2021
8c43c28
replace isFinite with direct reference to NaN
gmmorris Jun 7, 2021
e40b071
Merge branch 'master' into task-manager/capacity-estimation
gmmorris Jun 7, 2021
6922b91
Merge branch 'task-manager/capacity-estimation' of github.com:gmmorri…
gmmorris Jun 7, 2021
7c3f489
remoived dead code
gmmorris Jun 8, 2021
89ce8a2
removed unused import
gmmorris Jun 8, 2021
86b24fe
Update docs/user/production-considerations/task-manager-production-co…
gmmorris Jun 8, 2021
c05279c
Update docs/user/production-considerations/task-manager-production-co…
gmmorris Jun 8, 2021
5ac3c8d
Update docs/user/production-considerations/task-manager-production-co…
gmmorris Jun 8, 2021
c17730e
Apply grammer suggestions
gmmorris Jun 8, 2021
bdddc96
gramatical corrections
gmmorris Jun 8, 2021
0ce3321
Merge branch 'task-manager/capacity-estimation' of github.com:gmmorri…
gmmorris Jun 8, 2021
727c5a5
marked as experimental
gmmorris Jun 8, 2021
0b0daf1
cant use notes in tables
gmmorris Jun 8, 2021
2fa1abd
tweaked docs
gmmorris Jun 9, 2021
6bfc2b8
fixed docs
gmmorris Jun 9, 2021
0270133
Merge branch 'master' into task-manager/capacity-estimation
gmmorris Jun 9, 2021
2b14158
cleaned up docs
gmmorris Jun 9, 2021
982c513
mark entire health monitoring endpoitn as experimental
gmmorris Jun 9, 2021
3b9f2b1
rename proposed_kibana to provisioned_kibana
gmmorris Jun 9, 2021
53e584b
Merge branch 'master' into task-manager/capacity-estimation
gmmorris Jun 9, 2021
49c8579
fixed grammer
gmmorris Jun 10, 2021
cc347ea
improved grammer
gmmorris Jun 10, 2021
4b96500
Merge branch 'master' into task-manager/capacity-estimation
gmmorris Jun 13, 2021
63fe0cf
Merge branch 'master' into task-manager/capacity-estimation
gmmorris Jun 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -92,10 +92,18 @@ a| Runtime

| This section tracks excution performance of Task Manager, tracking task _drift_, worker _load_, and execution stats broken down by type, including duration and execution results.

a| Capacity Estimation

| This section provides a rough estimate about the sufficiency of its capacity. As the name suggests, these are estimates based on historical data and should not be used as predictions. Use these estimations when following the Task Manager <<task-manager-scaling-guidance>>.

|===

Each section has a `timestamp` and a `status` that indicates when the last update to this section took place and whether the health of this section was evaluated as `OK`, `Warning` or `Error`.

The root `status` indicates the `status` of the system overall.

The Runtime `status` indicates whether task executions have exceeded any of the <<task-manager-configuring-health-monitoring,configured health thresholds>>. An `OK` status means none of the threshold have been exceeded. A `Warning` status means that at least one warning threshold has been exceeded. An `Error` status means that at least one error threshold has been exceeded.

The Capacity Estimation `status` indicates the sufficiency of the observed capacity. An `OK` status means capacity is sufficient. A `Warning` status means that capacity is sufficient for the scheduled recurring tasks, but non-recurring tasks often cause the cluster to exceed capacity. An `Error` status means that there is insufficient capacity across all types of tasks.

By monitoring the `status` of the system overall, and the `status` of specific task types of interest, you can evaluate the health of the {kib} Task Management system.
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,7 @@ This means that you can expect a single {kib} instance to support up to 200 _tas

In practice, a {kib} instance will only achieve the upper bound of `200/tpm` if the duration of task execution is below the polling rate of 3 seconds. For the most part, the duration of tasks is below that threshold, but it can vary greatly as {es} and {kib} usage grow and task complexity increases (such as alerts executing heavy queries across large datasets).

By <<task-manager-health-evaluate-the-workload,evaluating the workload>>, you can make a rough estimate as to the required throughput as a _tasks per minute_ measurement.

For example, suppose your current workload reveals a required throughput of `440/tpm`. You can address this scale by provisioning 3 {kib} instances, with an upper throughput of `600/tpm`. This scale would provide aproximately 25% additional capacity to handle ad-hoc non-recurring tasks and potential growth in recurring tasks.

It is highly recommended that you maintain at least 20% additional capacity, beyond your expected workload, as spikes in ad-hoc tasks is possible at times of high activity (such as a spike in actions in response to an active alert).
By <<task-manager-rough-throughput-estimation, estimating a rough throughput requirment>>, you can estimate the number of {kib} instances required to reliably execute tasks in a timely manner. An appropriate number of {kib} instances can be estimated to match the required scale.

For details on monitoring the health of {kib} Task Manager, follow the guidance in <<task-manager-health-monitoring>>.

Expand Down Expand Up @@ -126,6 +122,35 @@ Throughput is best thought of as a measurements in tasks per minute.

A default {kib} instance can support up to `200/tpm`.

[float]
===== Automatic estimation

experimental[]

As demonstrated in <<task-manager-health-evaluate-the-capacity-estimation, Evaluate your capacity estimation>>, the Task Manager <<task-manager-health-monitoring, health monitoring>> performs these estimations automatically.

These estimates are based on historical data and should not be used as predictions, but can be used as a rough guide when scaling the system.

We recommend provisioning enough {kib} instances to ensure a buffer between the observed maximum throughput (as estimated under `observed.max_throughput_per_minute`) and the average required throughput (as estimated under `observed.avg_required_throughput_per_minute`). Otherwise there might be insufficient capacity to handle spikes of ad-hoc tasks. How much of a buffer is needed largely depends on your use case, but keep in mind that estimated throughput takes into account recent spikes and, as long as they are representative of your system's behaviour, shouldn't require much of a buffer.

We recommend provisioning at least as many {kib} instances as proposed by `proposed.provisioned_kibana`, but keep in mind that this number is based on the estimated required throughput, which is based on average historical performance, and cannot accurately predict future requirements.

[WARNING]
============================================================================
Automatic capacity estimation is performed by each {kib} instance independently. This estimation is performed by observing the task throughput in that instance, the number of {kib} instances executing tasks at that moment in time, and the recurring workload in {es}.

If a {kib} instance is idle at the moment of capacity estimation, the number of active {kib} instances might be miscounted and the available throughput miscalculated.

When evaluating the proposed {kib} instance number under `proposed.provisioned_kibana`, we highly recommend verifying that the `observed.observed_kibana_instances` matches the number of provisioned {kib} instances.
============================================================================

[float]
===== Manual estimation

By <<task-manager-health-evaluate-the-workload,evaluating the workload>>, you can make a rough estimate as to the required throughput as a _tasks per minute_ measurement.

For example, suppose your current workload reveals a required throughput of `440/tpm`. You can address this scale by provisioning 3 {kib} instances, with an upper throughput of `600/tpm`. This scale would provide aproximately 25% additional capacity to handle ad-hoc non-recurring tasks and potential growth in recurring tasks.

Given a deployment of 100 recurring tasks, estimating the required throughput depends on the scheduled cadence.
Suppose you expect to run 50 tasks at a cadence of `10s`, the other 50 tasks at `20m`. In addition, you expect a couple dozen non-recurring tasks every minute.
gmmorris marked this conversation as resolved.
Show resolved Hide resolved

Expand All @@ -136,8 +161,11 @@ A recurring task requires as many executions as its cadence can fit in a minute.

For this reason, we recommend grouping tasks by _tasks per minute_ and _tasks per hour_, as demonstrated in <<task-manager-health-evaluate-the-workload,Evaluate your workload>>, averaging the _per hour_ measurement across all minutes.

It is highly recommended that you maintain at least 20% additional capacity, beyond your expected workload, as spikes in ad-hoc tasks is possible at times of high activity (such as a spike in actions in response to an active alert).

Given the predicted workload, you can estimate a lower bound throughput of `340/tpm` (`6/tpm` * 50 + `3/tph` * 50 + 20% buffer).
As a default, a {kib} instance provides a throughput of `200/tpm`. A good starting point for your deployment is to provision 2 {kib} instances. You could then monitor their performance and reassess as the required throughput becomes clearer.

Although this is a _rough_ estimate, the _tasks per minute_ provides the lower bound needed to execute tasks on time.
Once you calculate the rough _tasks per minute_ estimate, add a 20% buffer for non-recurring tasks. How much of a buffer is required largely depends on your use case, so <<task-manager-health-evaluate-the-workload,evaluate your workload>> as it grows to ensure enough of a buffer is provisioned.

Once you estimate _tasks per minute_ , add a buffer for non-recurring tasks. How much of a buffer is required largely depends on your use case. Ensure enough of a buffer is provisioned by <<task-manager-health-evaluate-the-workload,evaluating your workload>> as it grows and tracking the ratio of recurring to non-recurring tasks by <<task-manager-health-evaluate-the-runtime,evaluating your runtime>>.
Loading