-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Response Ops][Task Manager] Expose SLI metrics in HTTP API #162178
Conversation
@elasticmachine merge upstream |
max_attempts: schema.number({ defaultValue: 100, min: 1, max: 500 }), | ||
}); | ||
|
||
export const configSchema = schema.object( | ||
{ | ||
allow_reading_invalid_state: schema.boolean({ defaultValue: true }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes are just alphabetized the config keys
@@ -321,9 +322,9 @@ export class TaskManagerRunner implements TaskRunner { | |||
|
|||
let taskParamsValidation; | |||
if (this.requeueInvalidTasksConfig.enabled) { | |||
taskParamsValidation = this.validateTaskParams(); | |||
taskParamsValidation = this.validateTaskParams(modifiedContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding functional tests to the task manager test suite, I noticed that the beforeSave
and beforeRun
middleware could potentially modify the task params which could potentially invalidate the shape of the params before this params validation occurs.
Specifically in x-pack/test/plugin_api_integration/plugins/sample_task_plugin/server/plugin.ts
, the beforeSave
middleware changes the task params shape while the beforeRun
middleware restores them. In the task runner, we're validating on the task params as read from disk instead of after using beforeRun
, so this change is to update that. I don't think it makes a difference here because we're not adding any task manager middleware but not sure what the right approach is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we ever find anyone using the middleware? I'm +1 on deleting it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taskTiming | ||
) | ||
); | ||
async ({ runAt, schedule, hasError }: SuccessfulRunResult) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emitting the correct task run event for alerting rules.
Pinging @elastic/response-ops (Team:ResponseOps) |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Tested locally, works as expected.
@elasticmachine merge upstream |
💚 Build Succeeded
Metrics [docs]Unknown metric groupsESLint disabled line counts
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: cc @ymao1 |
* main: (108 commits) [Telemetry Schema Validation] Allow `null` on `string` (elastic#163499) [Search] Add Slack and Gmail connectors (elastic#163321) [ML] Provide hints for empty fields in dropdown options in Anomaly detection & Transform creation wizards, Change point detection view (elastic#163371) chore(slo): Add response required fields (elastic#163430) [AO] Fix add_to_case functional test (elastic#163155) unskip license type functional test (elastic#163199) fix(NA): yarn env vars for node_modules mirrors (elastic#163549) [Response Ops][Task Manager] Expose SLI metrics in HTTP API (elastic#162178) [Logs UI] Adapt test to ES highlighting changes and unskip (elastic#163592) [Infra UI] Implement Telemetry on 'Show' buttons within Inventory (elastic#163587) [Enterprise Search]Migrate all usages of EuiPage*_Deprecated (elastic#163482) fix(slo): settings and access for serverless (elastic#163514) [Infra UI] Implement telemetry for the asset details flyout (elastic#163078) [Fleet] Add a banner to the top of the Kafka Output UI to say that Elastic Defend integration is not supported (elastic#163579) [Fleet] Re-enable and fix Fleet policy secret integration tests (elastic#163428) [Fleet] add managed to imported saved object (elastic#163526) [Index Management] Disable index actions using contextRef (elastic#163475) [Discover] Inline shard failures warnings (elastic#161271) [Security Solution][Detection engine] skips geo_point non-ecs validation (elastic#163487) Update EUI layout components in bfetch example plugin (elastic#163490) ...
…lastic#162178)" This reverts commit 582d97d.
#163639) …#162178)" This reverts commit 582d97d. ## Summary Summarize your PR. If it involves visual changes include a screenshot or gif. ### Checklist Delete any items that are not applicable to this PR. - [ ] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] Any UI touched in this PR is usable by keyboard only (learn more about [keyboard accessibility](https://webaim.org/techniques/keyboard/)) - [ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: [FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/), [Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US)) - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This renders correctly on smaller devices using a responsive layout. (You can test this [in your browser](https://www.browserstack.com/guide/responsive-testing-on-local-server)) - [ ] This was checked for [cross-browser compatibility](https://www.elastic.co/support/matrix#matrix_browsers) ### Risk Matrix Delete this section if it is not applicable to this PR. Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release. When forming the risk matrix, consider some of the following examples and how they may potentially impact the change: | Risk | Probability | Severity | Mitigation/Notes | |---------------------------|-------------|----------|-------------------------| | Multiple Spaces—unexpected behavior in non-default Kibana Space. | Low | High | Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces. | | Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. | High | Low | Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure. | | Code should gracefully handle cases when feature X or plugin Y are disabled. | Medium | High | Unit tests will verify that any feature flag or plugin combination still results in our service operational. | | [See more potential risk examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) | ### For maintainers - [ ] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
…163652) ## Summary Redo of this PR #162178 but without the `native-hdr-histogram` library which caused issues in the serverless build. In the future we may want to pursue generating a custom build of this native library but for our current purposes, a simple bucketed histogram should suffice. The only changes from the original PR are in this commit: dde5245, where we create a `SimpleHistogram` class to bucket task claim durations into `100ms` buckets. Please reference the original PR for more description about this HTTP API --------- Co-authored-by: Kibana Machine <[email protected]>
Towards #160334
Summary
Exposes a new HTTP API at
/api/task_manager/metrics
that collects SLI metrics for task manager. The following metrics are exposed:alerting
and anactions
group to allow us to calculate SLIs across all alerting rules and all actionsThese counter metrics are incremented until a reset event is received, in which case the counter is reset back to 0. This allows the collection mechanism (in this case Elastic Agent) to determine the interval at which these metrics are collected as well as to collect the rate of change for these SLI metrics without having to perform complicated Elasticsearch aggregation math. In addition, the counters are reset every 30 seconds (this is configurable) to avoid providing the metrics collector with stale data in case of a collector outage.
Flaky test runner: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/2813