Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Add more rule execution context #117504

Merged
merged 15 commits into from
Nov 28, 2021

Conversation

dgieselaar
Copy link
Member

@dgieselaar dgieselaar commented Nov 4, 2021

Closes #113506.

CleanShot 2021-11-05 at 14 49 04@2x

CleanShot 2021-11-05 at 14 51 55@2x

The following changes were made:

  • When rule execution starts, update the transaction name: Executing alerting rule
  • When rule execution starts, update the transaction labels with the rule id
  • When the rule is loaded, update the transaction name by appending the rule name
  • When the rule is loaded, update the transaction labels with the consumer, rule type id, rule name, rule params and rule tags
  • When the rule has executed, set the appropriate outcome
  • When the rule has executed, update the transaction labels with the number of active, recovered and new instances

@dgieselaar dgieselaar added v8.0.0 release_note:skip Skip the PR/issue when compiling release notes auto-backport Deprecated - use backport:version if exact versions are needed v7.16.0 v8.1.0 labels Nov 4, 2021
@cyrille-leclerc
Copy link
Contributor

The added label and the new transaction name will be very helpful!

@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Nov 4, 2021

@dgieselaar Can you please remind us how we identify as spans the different rule actions: email, index, server-log, ServiceNow-itsm, webhook, pagerduty...

For example,

@dgieselaar
Copy link
Member Author

@cyrille-leclerc unfortunately the trace waterfall is broken, @cauemarcondes is working on a fix, I'll update the PR with a screenshot for actions if that's fixed before this PR lands. I'm not sure if I get your second point though, can you elaborate?

@dgieselaar dgieselaar marked this pull request as ready for review November 5, 2021 14:02
@dgieselaar dgieselaar requested a review from a team as a code owner November 5, 2021 14:02
@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Nov 5, 2021

Thanks @dgieselaar

I'm not sure if I get your second point though, can you elaborate?

For the rule actions (email, index, server-log, servicenow-itsm, webhook, pagerduty...), it would be great to capture span labels that characterize the execution like the URL invoked, the authentication username ... in order to slice and dice traces in any dimension and also enable visualization of the destination on the service map and as an uninstrumented backend.

Here is an example that has a lot of commonalities with the labels we collect on CI/CD pipelines steps like the pipeline checkout step:

Labels
labels.ci_pipeline_run_user SYSTEM
labels.git_branch master
labels.git_clone_depth 0
labels.git_clone_shallow FALSE
labels.git_repository cyrille-leclerc/my-war
labels.git_username cyrille-leclerc
labels.host_ip 192.168.1.46
labels.host_name cyrillerclaptop.localdomain
labels.jenkins_computer_name #controller#
labels.jenkins_pipeline_step_id 7
labels.jenkins_pipeline_step_name Check out from version control
labels.jenkins_pipeline_step_plugin_name workflow-scm-step
labels.jenkins_pipeline_step_plugin_version 2.13
labels.jenkins_pipeline_step_type checkout
labels.jenkins_url http://localhost:9600/
labels.service_namespace jenkins
Trace
trace.id dd7cc4c4220d72dc85f2a6de6b0c0d69
Span
span.id a6c33c39e2e3141a
Service
service.name jenkins

image

image

@dgieselaar
Copy link
Member Author

For the rule actions (email, index, server-log, servicenow-itsm, webhook, pagerduty...), it would be great to capture span labels that characterize the execution like the URL invoked, the authentication username ... in order to slice and dice traces in any dimension and also enable visualization of the destination on the service map and as an uninstrumented backend.

Hmm, I don't want to inadvertently leak sensitive data, I'm not sure how to prevent that if we for instance stringify params or config. Any thoughts here @elastic/kibana-alerting-services?

@cyrille-leclerc
Copy link
Contributor

I don't want to inadvertently leak sensitive data,

Good catch @dgieselaar , we sanitized a bunch of attributes in the Jenkins Otel integration, typically parsing URLs and reconstructing them to ensure they don't leak credentials.

Here is an example:
https://github.com/jenkinsci/opentelemetry-plugin/blob/opentelemetry-0.21/src/main/java/io/jenkins/plugins/opentelemetry/job/AbstractGitStepHandler.java#L133-L153

@dgieselaar
Copy link
Member Author

@elasticmachine merge upstream

@xcrzx
Copy link
Contributor

xcrzx commented Nov 15, 2021

@dgieselaar Is it possible to view latency distribution for all rules of a specific rule type with these changes? Let's say I'm investigating performance issues with siem.queryRule. I used to do that in the following way:

  1. Select all transactions of the siem.queryRule rule type
  2. On the latency distribution diagram, drag mouse to select slowest 10%
  3. Examine their spans to identify performance bottlenecks

But with this PR, I cannot select all query rules in a single view, as transactions now split but rule name. So of I have 100+ activated rules, it becomes tedious to examine them one by one. Or I'm missing something?

@dgieselaar
Copy link
Member Author

@dgieselaar Is it possible to view latency distribution for all rules of a specific rule type with these changes?

Not in the APM app. You can use e.g. Lens to gather that data, but it won't allow you to inspect a trace without manually copying trace ids.

Usually we separate transaction groups if they have different performance characteristics, which I would expect to be the case here from rule instance to rule instance.

@dgieselaar
Copy link
Member Author

@elasticmachine merge upstream

@cyrille-leclerc
Copy link
Contributor

Usually we separate transaction groups if they have different performance characteristics, which I would expect to be the case here from rule instance to rule instance.

From my observations, performance characteristics are homogeneous for Security rules of the same rule type. That's why I think grouping by rule type would be helpful.

Would it make sense to group rules by rule type by default as before and leave the ability to filter by rule name (it is already possible, labels.alerting_rule_name : "Rule Name") for those who need that level of granularity?

That's a very good point: for guided rules, the execution path and the performance characteristics are likely to be very homogeneous and thus it could make sense to have the same transaction group.
This would be different for generic rules that can have completely different performance characteristics and probably different execution path.
For this reason of homogeneity of "guided rules", it could make sense to use the rule type name as the name of the transaction. I'm not sure.

@dgieselaar
Copy link
Member Author

For this reason of homogeneity of "guided rules", it could make sense to use the rule type name as the name of the transaction. I'm not sure.

I'm not sure how to make that distinction (between guided and generic rules) from the alerting framework's perspective. It is something that the security rule types can set themselves in the rule executor. Maybe that's a good compromise?

@cyrille-leclerc
Copy link
Contributor

I'm not sure how to make that distinction (between guided and generic rules) from the alerting framework's perspective. It is something that the security rule types can set themselves in the rule executor. Maybe that's a good compromise?

That could be an interesting starting point

@xcrzx
Copy link
Contributor

xcrzx commented Nov 16, 2021

I'm not sure how to make that distinction (between guided and generic rules) from the alerting framework's perspective. It is something that the security rule types can set themselves in the rule executor. Maybe that's a good compromise?

Yeah, it looks like we can use apm.setTransactionName to overwrite transaction names in Security to whatever value we want. Well, that sounds ok with me then.

@dgieselaar
Copy link
Member Author

@elasticmachine merge upstream

@dgieselaar dgieselaar requested a review from pmuellr November 19, 2021 14:36
@dgieselaar
Copy link
Member Author

@elasticmachine merge upstream

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left some nits on naming but looks great otherwise. Thanks!

@@ -105,7 +105,7 @@ export class ActionExecutor {
name: `execute_action`,
type: 'actions',
labels: {
actionId,
actions_action_id: actionId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following our new terminology, I think this should be actions_connector_id and actions_connector_type_id

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated the labels w/ new terminology!

@@ -855,6 +881,12 @@ function generateNewAndRecoveredInstanceEvents<
const recoveredAlertInstanceIds = Object.keys(recoveredAlertInstances);
const newIds = without(currentAlertInstanceIds, ...originalAlertInstanceIds);

if (apm.currentTransaction) {
apm.currentTransaction.addLabels({
alerting_new_instances: newIds.length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following our updated terminology, I believe this should be alerting_new_alerts, alerting_active_alerts, alerting_recovered_alerts

@dgieselaar dgieselaar enabled auto-merge (squash) November 28, 2021 11:00
@dgieselaar dgieselaar merged commit a0650c7 into elastic:main Nov 28, 2021
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Nov 28, 2021
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Nov 28, 2021
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
8.0
7.16

The backport PRs will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Nov 28, 2021
Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Dario Gieselaar <[email protected]>
kibanamachine added a commit that referenced this pull request Nov 28, 2021
Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Dario Gieselaar <[email protected]>
TinLe pushed a commit to TinLe/kibana that referenced this pull request Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed release_note:skip Skip the PR/issue when compiling release notes v7.16.1 v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Alerting] Add more context in kibana rule execution traces
7 participants