Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onboard Inventory Metric Threshold rule type with FAAD #178506

Merged
merged 13 commits into from
Mar 22, 2024

Conversation

ersin-erdal
Copy link
Contributor

@ersin-erdal ersin-erdal commented Mar 12, 2024

Towards: #169867

This PR onboards Inventory Metric Threshold rule type with FAAD.

To verify.

I used data-generator to generate metric data.

Then created an Inventory Threshold rule with actions (alert and recovered),
conitions: For Hosts, When CPU usage is above 10.
Inventory Threshold uses the following formula to calculate the result:

(system.cpu.user.pct + system.cpu.system.pct) / system.cpu.cores

Set
system.cpu.user.pct = 1
system.cpu.system.pct = 1
system.cpu.cores = 4
in the cpu-001. This makes the CPU usage 0.5 (50%) for the host-1

and run the generator with ./generate metrics

Your rule should create an alert and should saved it in .internal.alerts-observability.metrics.alerts-default-000001

Then set system.cpu.user.pct=0 and system.cpu.system.pct=0.

The alert should be recovered and the AAD in the above index should be updated kibana.alert.status: recovered.

@ersin-erdal ersin-erdal added release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.14.0 labels Mar 12, 2024
@ersin-erdal ersin-erdal self-assigned this Mar 12, 2024
@ersin-erdal
Copy link
Contributor Author

/ci

@ersin-erdal
Copy link
Contributor Author

/ci

@ersin-erdal ersin-erdal requested a review from umbopepato March 13, 2024 17:11
@ersin-erdal ersin-erdal marked this pull request as ready for review March 13, 2024 17:11
@ersin-erdal ersin-erdal requested a review from a team as a code owner March 13, 2024 17:11
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ersin-erdal ersin-erdal requested a review from pmuellr March 13, 2024 17:12
evaluationValues?: Array<number | null>,
thresholds?: Array<number | null>
) => InventoryMetricThresholdAlert;

export const createInventoryMetricThresholdExecutor = (libs: InfraBackendLibs) =>
libs.metricsRules.createLifecycleRuleExecutor<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to remove this usage of createLifecycleRuleExecutor so that this no longer goes through the rule registry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed that! Thanks for pointing out.
Fixed with 5c5315f

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; noted some potential source changes to reduce casts. Also left a question about the threshold values when using a "between" comparator, but the behaviour here (not writing threshold values to the alerts docs) seems the same as es query, so perhaps it's as good as we can do today.

alerts: {
...MetricsRulesTypeAlertDefinition,
shouldWrite: true,
} as IRuleTypeAlerts<InventoryMetricThresholdAlert>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to get rid of this cast by changing MetricsRulesTypeAlertDefinition from IRuleTypeAlerts to IRuleTypeAlerts<InventoryMetricThresholdAlert> in x-pack/plugins/observability_solution/infra/server/lib/alerting/register_rule_types.ts, and it seemed to like it - I was able to remove the cast here completely anyway. Had to import InventoryMetricThresholdAlert in the register_rule_types.ts module ...

Though looking at that MetricsRulesTypeAlertDefinition, it has shouldWrite: false. Can we change the definition to shouldWrite: true and then remove the override here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple rule types using the MetricsRulesTypeAlertDefinition. If all of them are onboarded to the framework AAD and no longer using the lifecycle executor, then we can set shouldWrite: true on the MetricsRulesTypeAlertDefinition and use it as is without the override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already talked about it with Alexi.
Whoever merges their PR last, can remove this line :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This done now.

> & {
// Defining a custom type for this because the schema generation script doesn't allow explicit null values
[ALERT_EVALUATION_VALUES]?: Array<number | null>;
[ALERT_EVALUATION_THRESHOLD]?: Array<number | null>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like THRESHOLD isn't omitted above ... I was able to add it to the Omit<> with no errors ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it, thanks.

[ALERT_REASON]: reason,
[ALERT_ACTION_GROUP]: actionGroup,
[ALERT_EVALUATION_VALUES]: evaluationValues,
[ALERT_EVALUATION_THRESHOLD]: thresholds,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a "between" comparator, which should be passing an array of values to thresholds. The rule works fine, but this field is not set in the alert document. For a single-value threshold (anything but between or not between), it printed the threshold fine.

This may be a pre-existing issue with this rule type - I noticed that es query seems to not populate this field if it's multi-value as well, perhaps that's the pattern for these?

Copy link
Contributor Author

@ersin-erdal ersin-erdal Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is intentional:
https://github.com/elastic/kibana/blob/main/x-pack/plugins/observability_solution/infra/server/lib/alerting/common/get_values.ts#L34

We don't save threshold if at least one of the comparators is a multiple threshold one such as 'in between'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmuellr @ersin-erdal I've created a ticket for this topic to save the range threshold as an array. I also saw the the same behaviour in the es query rule, what do you think about adding the range threshold as an array in all rules (including es query)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be good, but I suspect we will have pre-existing data with just a scalar value - so any typing on such fields in Typescript will have to be number | number[].

fields: {
alertsClient.setAlertData({
id: UNGROUPED_FACTORY_KEY,
payload: {
[ALERT_REASON]: reason,
[ALERT_ACTION_GROUP]: actionGroup,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't need to report this back since it's set by the framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


const esClient = services.scopedClusterClient.asCurrentUser;

const { savedObjectsClient, alertFactory: baseAlertFactory, alertsClient } = services;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the baseAlertFactory anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@maryam-saeidi maryam-saeidi self-requested a review March 21, 2024 09:17
Copy link
Member

@maryam-saeidi maryam-saeidi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested creating alerts and recover them + notification, everything seemed as expected! 💪🏻

Regarding code, I've checked it but I rely mostly on ResponseOps team review :)

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

References to deprecated APIs

id before after diff
infra 9 7 -2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ersin-erdal

@ersin-erdal ersin-erdal merged commit f35b1ae into elastic:main Mar 22, 2024
24 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Mar 22, 2024
@ersin-erdal ersin-erdal deleted the 169867-inventory-faad branch March 22, 2024 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.14.0
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

7 participants