Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Task Manager] Expose SLI metrics in HTTP API #162178

Merged
merged 33 commits into from
Aug 10, 2023

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Jul 18, 2023

Towards #160334

Summary

Exposes a new HTTP API at /api/task_manager/metrics that collects SLI metrics for task manager. The following metrics are exposed:

  • count of task claim successes & count of task claim tries - this is a counter metric that keeps track over overall task claim success, not task claim success of individual background task workers
  • count of task run success & count of task runs - this is a counter metric that keeps track of overall task run successes, as well as successes grouped by task type. Alerting and action task types are rolled up into an alerting and an actions group to allow us to calculate SLIs across all alerting rules and all actions
  • task claim duration in milliseconds - this is a histogram counter metric that is bucketed into 100 ms buckets

These counter metrics are incremented until a reset event is received, in which case the counter is reset back to 0. This allows the collection mechanism (in this case Elastic Agent) to determine the interval at which these metrics are collected as well as to collect the rate of change for these SLI metrics without having to perform complicated Elasticsearch aggregation math. In addition, the counters are reset every 30 seconds (this is configurable) to avoid providing the metrics collector with stale data in case of a collector outage.

Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2813

@ymao1 ymao1 added the ci:cloud-deploy Create or update a Cloud deployment label Jul 19, 2023
@ymao1
Copy link
Contributor Author

ymao1 commented Jul 19, 2023

@elasticmachine merge upstream

max_attempts: schema.number({ defaultValue: 100, min: 1, max: 500 }),
});

export const configSchema = schema.object(
{
allow_reading_invalid_state: schema.boolean({ defaultValue: true }),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes are just alphabetized the config keys

@ymao1 ymao1 changed the title Adding counter for task polling success rate [Response Ops][Task Manager] Expose SLI metrics in HTTP API Aug 1, 2023
@ymao1 ymao1 added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.10.0 labels Aug 7, 2023
@@ -321,9 +322,9 @@ export class TaskManagerRunner implements TaskRunner {

let taskParamsValidation;
if (this.requeueInvalidTasksConfig.enabled) {
taskParamsValidation = this.validateTaskParams();
taskParamsValidation = this.validateTaskParams(modifiedContext);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding functional tests to the task manager test suite, I noticed that the beforeSave and beforeRun middleware could potentially modify the task params which could potentially invalidate the shape of the params before this params validation occurs.

Specifically in x-pack/test/plugin_api_integration/plugins/sample_task_plugin/server/plugin.ts, the beforeSave middleware changes the task params shape while the beforeRun middleware restores them. In the task runner, we're validating on the task params as read from disk instead of after using beforeRun, so this change is to update that. I don't think it makes a difference here because we're not adding any task manager middleware but not sure what the right approach is here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we ever find anyone using the middleware? I'm +1 on deleting it :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @mikecote and we'll keep this change for now and have opened an issue to discuss removing the middleware hooks: #163401

taskTiming
)
);
async ({ runAt, schedule, hasError }: SuccessfulRunResult) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emitting the correct task run event for alerting rules.

@ymao1 ymao1 marked this pull request as ready for review August 8, 2023 02:02
@ymao1 ymao1 requested a review from a team as a code owner August 8, 2023 02:02
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1
Copy link
Contributor Author

ymao1 commented Aug 8, 2023

@elasticmachine merge upstream

Copy link
Contributor

@ersin-erdal ersin-erdal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Tested locally, works as expected.

@ymao1
Copy link
Contributor Author

ymao1 commented Aug 10, 2023

@elasticmachine merge upstream

@ymao1 ymao1 mentioned this pull request Aug 10, 2023
@kibana-ci
Copy link
Collaborator

kibana-ci commented Aug 10, 2023

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
taskManager 24 25 +1

Total ESLint disabled count

id before after diff
taskManager 24 25 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@ymao1 ymao1 merged commit 582d97d into elastic:main Aug 10, 2023
@ymao1 ymao1 deleted the alerting/slis branch August 10, 2023 14:23
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Aug 10, 2023
jloleysens added a commit to jloleysens/kibana that referenced this pull request Aug 10, 2023
* main: (108 commits)
  [Telemetry Schema Validation] Allow `null` on `string` (elastic#163499)
  [Search] Add Slack and Gmail connectors (elastic#163321)
  [ML] Provide hints for empty fields in dropdown options in Anomaly detection & Transform creation wizards, Change point detection view (elastic#163371)
  chore(slo): Add response required fields (elastic#163430)
  [AO] Fix add_to_case functional test (elastic#163155)
  unskip license type functional test (elastic#163199)
  fix(NA): yarn env vars for node_modules mirrors (elastic#163549)
  [Response Ops][Task Manager] Expose SLI metrics in HTTP API (elastic#162178)
  [Logs UI] Adapt test to ES highlighting changes and unskip (elastic#163592)
  [Infra UI] Implement Telemetry on 'Show' buttons within Inventory (elastic#163587)
  [Enterprise Search]Migrate all usages of EuiPage*_Deprecated (elastic#163482)
  fix(slo): settings and access for serverless (elastic#163514)
  [Infra UI] Implement telemetry for the asset details flyout (elastic#163078)
  [Fleet] Add a banner to the top of the Kafka Output UI to say that Elastic Defend integration is not supported (elastic#163579)
  [Fleet] Re-enable and fix Fleet policy secret integration tests (elastic#163428)
  [Fleet] add managed to imported saved object (elastic#163526)
  [Index Management] Disable index actions using contextRef (elastic#163475)
  [Discover] Inline shard failures warnings (elastic#161271)
  [Security Solution][Detection engine] skips geo_point non-ecs validation (elastic#163487)
  Update EUI layout components in bfetch example plugin (elastic#163490)
  ...
ymao1 added a commit to ymao1/kibana that referenced this pull request Aug 10, 2023
jbudz pushed a commit that referenced this pull request Aug 10, 2023
#163639)

…#162178)"

This reverts commit 582d97d.

## Summary

Summarize your PR. If it involves visual changes include a screenshot or
gif.


### Checklist

Delete any items that are not applicable to this PR.

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces—unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes—Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
ymao1 added a commit that referenced this pull request Aug 11, 2023
…163652)

## Summary

Redo of this PR #162178 but
without the `native-hdr-histogram` library which caused issues in the
serverless build. In the future we may want to pursue generating a
custom build of this native library but for our current purposes, a
simple bucketed histogram should suffice. The only changes from the
original PR are in this commit:
dde5245,
where we create a `SimpleHistogram` class to bucket task claim durations
into `100ms` buckets.

Please reference the original PR for more description about this HTTP
API

---------

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:cloud-deploy Create or update a Cloud deployment Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.10.0
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

6 participants