Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAM][Maintenance Window][8.8]Fix window maintenance workflow #156427

Merged

Conversation

JiaweiWu
Copy link
Contributor

@JiaweiWu JiaweiWu commented May 2, 2023

Summary

The way that we canceled every notification for our alert life cycle during an active maintenance window was not close enough to what our customers were expecting. For our persisted security solution alerts, we did not have to change the logic because it will always be a new alert. Therefore, @shanisagiv1, @mdefazio, @JiaweiWu, and @XavierM had a discussion about this problem and we decided this:

To summarize, we will only keep the notification during a maintenance window if an alert has been created/active outside of window maintenance. We created three different scenarios to explain the new logic and we will make the assumption that our alert has an action per status change. For you to understand the different scenarios, I created this legend below:
image

Scenario I

If an alert is active/created before a maintenance window and recovered inside of the maintenance window then we will send notifications
image

Scenario II

If an alert is active/created and recovered inside of window maintenance then we will NOT send notifications
image

Scenario III

if an alert is active/created in a maintenance window and recovered outside of the maintenance window then we will not send notifications
image

Checklist

@JiaweiWu JiaweiWu added bug Fixes for quality problems that affect the customer experience backport release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesManagement Issues related to the Rules Management UX v8.8.0 v8.9.0 labels May 2, 2023
@XavierM XavierM changed the title [RAM][Maintenance Window][8.8]Fix MW IDs not clearing on alerts after finishing the MW [RAM][Maintenance Window][8.8]Fix window maintenance workflow May 3, 2023
}": has active maintenance windows ${alert.getMaintenanceWindowIds()}.`
);
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this check into generateExecutables method please?
We create executables (action + alert or summarizedAlerts) array in there and iterate through them and schedule here. We skip them here only for rate limiting, so we can log how many actions were created and how may of them have been skipped.

Copy link
Contributor

@ersin-erdal ersin-erdal May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this to just after for (const [alertId, alert] of alertsArray) { should be enough.
Note: AFAIU you want to send notifications for recovered alerts if the alert was created before the maintenance window. In order to have that check, you may need to add if(!this.isRecoveredAlert(action.group)) condition as well.

Copy link
Contributor Author

@JiaweiWu JiaweiWu May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this check into generateExecutables method please?

We decided to have the code here because we still want to count up these actions as generated actions via ruleRunMetricsStore.incrementNumberOfGeneratedActionsByConnectorType(actionTypeId);. We can move it if we don't care about this count, @XavierM what do you think?

AFAIU you want to send notifications for recovered alerts if the alert was created before the maintenance window. In order to have that check, you may need to add if(!this.isRecoveredAlert(action.group)) condition as well.

We make this determination using the maintenance_window_ids field, so we don't need to check isRecoveredAlert, any alerts marked with a maintenance window id will never send notifications even if it recovered after an active window. New active alerts that originated before a window, but recovered during a window, that recovered alert will send a notification.

Copy link
Contributor

@XavierM XavierM May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way we did it here because we have a nice story to our user in our logs We talked about it internally and we decided that for now, we will follow @ersin-erdal 's advice and we will add new metrics in the future for snooze and maintenance window. So we can show this indicator to our user.

@JiaweiWu JiaweiWu marked this pull request as ready for review May 4, 2023 00:14
@JiaweiWu JiaweiWu requested review from a team as code owners May 4, 2023 00:14
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@JiaweiWu
Copy link
Contributor Author

JiaweiWu commented May 4, 2023

@elasticmachine merge upstream

@@ -49,6 +50,9 @@ const augmentAlerts = <T>({
[ALERT_START]: currentTimeOverride ?? new Date(),
[ALERT_LAST_DETECTED]: currentTimeOverride ?? new Date(),
[VERSION]: kibanaVersion,
...(options?.maintenanceWindowIds?.length
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are persisted alerts, we do not need to update previous alert instances like we need to for AAD. We just need to make sure they reflect the current active maintenance windows

Copy link
Contributor

@ersin-erdal ersin-erdal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments.
LGTM, Tested locally, works as expected.

@@ -520,6 +525,16 @@ export class ExecutionHandler<
if (alert.isFilteredOut(summarizedAlerts)) {
continue;
}

if (alert && alert.getMaintenanceWindowIds().length > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: alert && is not needed. alert is always there.

@JiaweiWu
Copy link
Contributor Author

JiaweiWu commented May 4, 2023

@elasticmachine merge upstream

@XavierM XavierM enabled auto-merge (squash) May 4, 2023 21:30
@XavierM XavierM merged commit ea40798 into elastic:main May 5, 2023
@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Rules, Alerts and Exceptions ResponseOps Cypress Tests on Security Solution / Detections : Page Filters "after each" hook for "Changed banner should hide on Reset"
  • [job] [logs] Rules, Alerts and Exceptions ResponseOps Cypress Tests on Security Solution / Detections : Page Filters Changed banner should hide on Reset

Metrics [docs]

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
alerting 49.0KB 49.0KB +39.0B
Unknown metric groups

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
securitySolution 398 401 +3
total +5

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
securitySolution 478 481 +3
total +5

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request May 5, 2023
…c#156427)

## Summary

The way that we canceled every notification for our alert life cycle
during an active maintenance window was not close enough to what our
customers were expecting. For our persisted security solution alerts, we
did not have to change the logic because it will always be a new alert.
Therefore, @shanisagiv1, @mdefazio, @JiaweiWu, and @XavierM had a
discussion about this problem and we decided this:

To summarize, we will only keep the notification during a maintenance
window if an alert has been created/active outside of window
maintenance. We created three different scenarios to explain the new
logic and we will make the assumption that our alert has an action per
status change. For you to understand the different scenarios, I created
this legend below:
<img width="223" alt="image"
src="https://user-images.githubusercontent.com/189600/236045974-f4fa379b-db5e-41f8-91a8-2689b9f24dab.png">

### Scenario I
If an alert is active/created before a maintenance window and recovered
inside of the maintenance window then we will send notifications
<img width="463" alt="image"
src="https://user-images.githubusercontent.com/189600/236046473-d04df836-d3e6-42d8-97be-8b4f1544cc1a.png">

### Scenario II
If an alert is active/created and recovered inside of window maintenance
then we will NOT send notifications
<img width="407" alt="image"
src="https://user-images.githubusercontent.com/189600/236046913-c2f77131-9ff1-4864-9dab-89c4c429152e.png">

### Scenario III
if an alert is active/created in a maintenance window and recovered
outside of the maintenance window then we will not send notifications
<img width="496" alt="image"
src="https://user-images.githubusercontent.com/189600/236047613-e63efe52-87fa-419e-9e0e-965b1d10ae18.png">

### Checklist
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Xavier Mouligneau <[email protected]>
Co-authored-by: Kibana Machine <[email protected]>
(cherry picked from commit ea40798)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.8

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request May 5, 2023
…156427) (#156770)

# Backport

This will backport the following commits from `main` to `8.8`:
- [[RAM][Maintenance Window][8.8]Fix window maintenance workflow
(#156427)](#156427)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Jiawei
Wu","email":"[email protected]"},"sourceCommit":{"committedDate":"2023-05-05T00:11:26Z","message":"[RAM][Maintenance
Window][8.8]Fix window maintenance workflow (#156427)\n\n##
Summary\r\n\r\nThe way that we canceled every notification for our alert
life cycle\r\nduring an active maintenance window was not close enough
to what our\r\ncustomers were expecting. For our persisted security
solution alerts, we\r\ndid not have to change the logic because it will
always be a new alert.\r\nTherefore, @shanisagiv1, @mdefazio, @JiaweiWu,
and @XavierM had a\r\ndiscussion about this problem and we decided
this:\r\n\r\nTo summarize, we will only keep the notification during a
maintenance\r\nwindow if an alert has been created/active outside of
window\r\nmaintenance. We created three different scenarios to explain
the new\r\nlogic and we will make the assumption that our alert has an
action per\r\nstatus change. For you to understand the different
scenarios, I created\r\nthis legend below:\r\n<img width=\"223\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236045974-f4fa379b-db5e-41f8-91a8-2689b9f24dab.png\">\r\n\r\n###
Scenario I\r\nIf an alert is active/created before a maintenance window
and recovered\r\ninside of the maintenance window then we will send
notifications\r\n<img width=\"463\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236046473-d04df836-d3e6-42d8-97be-8b4f1544cc1a.png\">\r\n\r\n###
Scenario II\r\nIf an alert is active/created and recovered inside of
window maintenance\r\nthen we will NOT send notifications\r\n<img
width=\"407\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236046913-c2f77131-9ff1-4864-9dab-89c4c429152e.png\">\r\n\r\n###
Scenario III\r\nif an alert is active/created in a maintenance window
and recovered\r\noutside of the maintenance window then we will not send
notifications\r\n<img width=\"496\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236047613-e63efe52-87fa-419e-9e0e-965b1d10ae18.png\">\r\n\r\n\r\n###
Checklist\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Xavier Mouligneau
<[email protected]>\r\nCo-authored-by: Kibana Machine
<[email protected]>","sha":"ea407983bbd6a364f23f6780ff1049f679f53488","branchLabelMapping":{"^v8.9.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","backport","release_note:skip","Team:ResponseOps","Feature:Alerting/RulesManagement","v8.8.0","v8.9.0"],"number":156427,"url":"https://github.com/elastic/kibana/pull/156427","mergeCommit":{"message":"[RAM][Maintenance
Window][8.8]Fix window maintenance workflow (#156427)\n\n##
Summary\r\n\r\nThe way that we canceled every notification for our alert
life cycle\r\nduring an active maintenance window was not close enough
to what our\r\ncustomers were expecting. For our persisted security
solution alerts, we\r\ndid not have to change the logic because it will
always be a new alert.\r\nTherefore, @shanisagiv1, @mdefazio, @JiaweiWu,
and @XavierM had a\r\ndiscussion about this problem and we decided
this:\r\n\r\nTo summarize, we will only keep the notification during a
maintenance\r\nwindow if an alert has been created/active outside of
window\r\nmaintenance. We created three different scenarios to explain
the new\r\nlogic and we will make the assumption that our alert has an
action per\r\nstatus change. For you to understand the different
scenarios, I created\r\nthis legend below:\r\n<img width=\"223\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236045974-f4fa379b-db5e-41f8-91a8-2689b9f24dab.png\">\r\n\r\n###
Scenario I\r\nIf an alert is active/created before a maintenance window
and recovered\r\ninside of the maintenance window then we will send
notifications\r\n<img width=\"463\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236046473-d04df836-d3e6-42d8-97be-8b4f1544cc1a.png\">\r\n\r\n###
Scenario II\r\nIf an alert is active/created and recovered inside of
window maintenance\r\nthen we will NOT send notifications\r\n<img
width=\"407\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236046913-c2f77131-9ff1-4864-9dab-89c4c429152e.png\">\r\n\r\n###
Scenario III\r\nif an alert is active/created in a maintenance window
and recovered\r\noutside of the maintenance window then we will not send
notifications\r\n<img width=\"496\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236047613-e63efe52-87fa-419e-9e0e-965b1d10ae18.png\">\r\n\r\n\r\n###
Checklist\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Xavier Mouligneau
<[email protected]>\r\nCo-authored-by: Kibana Machine
<[email protected]>","sha":"ea407983bbd6a364f23f6780ff1049f679f53488"}},"sourceBranch":"main","suggestedTargetBranches":["8.8"],"targetPullRequestStates":[{"branch":"8.8","label":"v8.8.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.9.0","labelRegex":"^v8.9.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/156427","number":156427,"mergeCommit":{"message":"[RAM][Maintenance
Window][8.8]Fix window maintenance workflow (#156427)\n\n##
Summary\r\n\r\nThe way that we canceled every notification for our alert
life cycle\r\nduring an active maintenance window was not close enough
to what our\r\ncustomers were expecting. For our persisted security
solution alerts, we\r\ndid not have to change the logic because it will
always be a new alert.\r\nTherefore, @shanisagiv1, @mdefazio, @JiaweiWu,
and @XavierM had a\r\ndiscussion about this problem and we decided
this:\r\n\r\nTo summarize, we will only keep the notification during a
maintenance\r\nwindow if an alert has been created/active outside of
window\r\nmaintenance. We created three different scenarios to explain
the new\r\nlogic and we will make the assumption that our alert has an
action per\r\nstatus change. For you to understand the different
scenarios, I created\r\nthis legend below:\r\n<img width=\"223\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236045974-f4fa379b-db5e-41f8-91a8-2689b9f24dab.png\">\r\n\r\n###
Scenario I\r\nIf an alert is active/created before a maintenance window
and recovered\r\ninside of the maintenance window then we will send
notifications\r\n<img width=\"463\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236046473-d04df836-d3e6-42d8-97be-8b4f1544cc1a.png\">\r\n\r\n###
Scenario II\r\nIf an alert is active/created and recovered inside of
window maintenance\r\nthen we will NOT send notifications\r\n<img
width=\"407\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236046913-c2f77131-9ff1-4864-9dab-89c4c429152e.png\">\r\n\r\n###
Scenario III\r\nif an alert is active/created in a maintenance window
and recovered\r\noutside of the maintenance window then we will not send
notifications\r\n<img width=\"496\"
alt=\"image\"\r\nsrc=\"https://user-images.githubusercontent.com/189600/236047613-e63efe52-87fa-419e-9e0e-965b1d10ae18.png\">\r\n\r\n\r\n###
Checklist\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Xavier Mouligneau
<[email protected]>\r\nCo-authored-by: Kibana Machine
<[email protected]>","sha":"ea407983bbd6a364f23f6780ff1049f679f53488"}}]}]
BACKPORT-->

Co-authored-by: Jiawei Wu <[email protected]>
jloleysens added a commit that referenced this pull request May 5, 2023
* main: (153 commits)
  [Security Solution] {{state.signals_count}} Object not working (#156472) (#156707)
  [Synthetics] refresh data on visualization scrubbing (#156777)
  [RAM] Docs for slack improvements (#153885)
  [RAM] Alert search bar only KQL (#155947)
  [ML] Functional tests - stabilize export job tests (#156586)
  [Saved Search] Update saved search schema to allow empty `sort` arrays (#156769)
  [ML] Rename `curated` model type to `elastic` (#156684)
  [Discover] Enable sharing for text based languages (#156652)
  [api-docs] 2023-05-05 Daily api_docs build (#156781)
  Upgrade EUI to v77.2.2 (#155208)
  [RAM][Maintenance Window][8.8]Fix window maintenance workflow (#156427)
  [DOCS] Case file attachments (#156459)
  [D4C] additional error handling for 'block' action added + policy editor UI fixes (#156629)
  [Enterprise Search] refactor(SearchApplications): rename telemetry ids (#156733)
  [Enterprise Search] Add telemetry to ELSER deployment buttons + error (#156545)
  [Security Solution] fixes Data Quality dashboard errors when a `basePath` is configured (#156233)
  [Logs onboarding] StepsFooter outside of main panel (#156686)
  [Security Solution] Add a migration to unmute custom Security Solution rules (#156593)
  [Enterprise Search][Behavioral Analytics] Update formulas (#156704)
  Add API Events to Endpoint Security Advanced Policy (#156718)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport bug Fixes for quality problems that affect the customer experience Feature:Alerting/RulesManagement Issues related to the Rules Management UX release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.8.0 v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants