Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for correlation periods for alerts #65

Merged
merged 27 commits into from
Jul 17, 2023

Conversation

sfc-gh-nlele
Copy link
Collaborator

@sfc-gh-nlele sfc-gh-nlele commented Jun 9, 2023

This PR adds support for specifying custom correlation periods for individual alerts, extending having a single correlation period for all of the alerts.

This correlation period will be specified in the alert object as a negative number indicating the duration prior to the current time, for which the alerts would be correlated. If it's absent, the default correlation period set for the alert processor will be used.

E.g.

.
.
.
, 'q9u9z3b' AS query_id
  , 'USING CRON 0 0 * * * UTC' AS schedule
  , '1d' AS lookback
  , ARRAY_CONSTRUCT(
      OBJECT_CONSTRUCT(
        'type', 'ef-slack',
        'message',  ' test',
        'channel', 'test'
      )
    ) AS handlers
   , -180 as correlation_period
.
.
.

Test:

Correlation period = -10
Screenshot 2023-06-13 at 7 04 08 PM

@sfc-gh-nlele sfc-gh-nlele changed the title Add support for correlation periods in alerts Add support for correlation periods for alerts Jun 9, 2023
@sfc-gh-pkommini
Copy link
Collaborator

@sfc-gh-nlele can correlation period be +ve? What would that mean? Do we check for that?

Copy link
Collaborator

@sfc-gh-pkommini sfc-gh-pkommini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Just a question to cover my gap in knowledge. (re: -ve vs +ve correlation_period)

@sfc-gh-nlele
Copy link
Collaborator Author

@sfc-gh-pkommini positive correlation periods can be entered, but they wouldn't really be fruitful I think. The two scenarios I can think of with them are:

  1. There are, e.g., three un-correlated alerts during a run, and the alert that was most recent gets processed first. That alert will get a new correlation ID and the rest of the 2 alerts will then receive this same ID, if they're in this positive window.

  2. There are no alerts in the future, thus every alert gets its own correlation ID.

@sfc-gh-pkommini
Copy link
Collaborator

@sfc-gh-pkommini positive correlation periods can be entered, but they wouldn't really be fruitful I think. The two scenarios I can think of with them are:

  1. There are, e.g., three un-correlated alerts during a run, and the alert that was most recent gets processed first. That alert will get a new correlation ID and the rest of the 2 alerts will then receive this same ID, if they're in this positive window.
  2. There are no alerts in the future, thus every alert gets its own correlation ID.

So CORRELATION_PERIOD being negative means look back in last n minutes if there is an alert with the same fingerprint then reuse tthat existing correlation ID. It being positive means look forward if there is any alert with the fingerprint. Hence when it is positive since the current batch of alerts which have same fingerprint within the window will get correlated but won't get correlated with past batches making it not as effective as when used with a negative value.

Makes sense to me. Please merge. This was not meant to be a blocker.

@sfc-gh-pkommini
Copy link
Collaborator

sfc-gh-pkommini commented Jun 21, 2023

@sfc-gh-pkommini positive correlation periods can be entered, but they wouldn't really be fruitful I think. The two scenarios I can think of with them are:

  1. There are, e.g., three un-correlated alerts during a run, and the alert that was most recent gets processed first. That alert will get a new correlation ID and the rest of the 2 alerts will then receive this same ID, if they're in this positive window.
  2. There are no alerts in the future, thus every alert gets its own correlation ID.

So CORRELATION_PERIOD being negative means look back in last n minutes if there is an alert with the same fingerprint then reuse tthat existing correlation ID. It being positive means look forward if there is any alert with the fingerprint. Hence when it is positive, only alerts within the current batch which have same fingerprint will get correlated but won't get correlated with past batches making it not as effective or useful.

Makes sense to me. Please merge. This was not meant to be a blocker.

@sfc-gh-afedorov sfc-gh-afedorov self-requested a review June 23, 2023 15:46
@sfc-gh-afedorov
Copy link
Collaborator

sfc-gh-afedorov commented Jun 23, 2023

Instead of -180 as the value, could we have it be '-180s', '-180m', or '-180d' to specify second, minutes, or days?

@sfc-gh-nlele
Copy link
Collaborator Author

sfc-gh-nlele commented Jun 26, 2023

@sfc-gh-afedorov Sure 👍, but maybe positive values could indicate looking back (the usual case I think, it seems implied that we usually look back), and negative values could indicate looking forward? Either way works!

Altho it seems like we wouldn't be able to use something like this:

  AND event_time > ? - INTERVAL COALESCE(alert:CORRELATION_PERIOD, $${CORRELATION_PERIOD_MINUTES})

and will have to use CASE and manually parse the periods, since INTERVAL doesn't seem to support expressions within it.

AND event_time > DATEADD(minutes, $${CORRELATION_PERIOD_MINUTES}, ?)
AND event_time >
CASE
WHEN REGEXP_SUBSTR(LOWER(COALESCE(alert:CORRELATION_PERIOD, $${CORRELATION_PERIOD_MINUTES})), '[a-z]') = 's' THEN DATEADD(seconds, - TO_NUMBER(REGEXP_SUBSTR(COALESCE(alert:CORRELATION_PERIOD, $${CORRELATION_PERIOD_MINUTES})), '\\d+'),?)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a good place to use the alternative syntax for CASE

Screenshot 2023-06-29 at 9 27 19 AM

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oo right! This will be better 👍

WHERE alert:EVENT_TIME > DATEADD(minutes, $${CORRELATION_PERIOD_MINUTES}, ?)
WHERE alert:EVENT_TIME >
CASE
WHEN REGEXP_SUBSTR(LOWER(COALESCE(alert:CORRELATION_PERIOD, $${CORRELATION_PERIOD_MINUTES})), '[a-z]') = 's' THEN DATEADD(seconds, - TO_NUMBER(REGEXP_SUBSTR(COALESCE(alert:CORRELATION_PERIOD, $${CORRELATION_PERIOD_MINUTES})), '\\d+'),?)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -34,6 +34,21 @@ function fillArray(value, len) {
return arr
}

function ifColumnExists(column_name) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-afedorov would something like this be okay for backward compatibility?

Checking if CORRELATION_PERIOD exists in the rule definition, and then adding it to the insert statement body if it does.

@@ -62,6 +77,7 @@ SELECT '$${RUN_ID}' run_id
'EVENT_DATA', IFNULL(EVENT_DATA::VARIANT, PARSE_JSON('null')),
'SEVERITY', IFNULL(SEVERITY::VARIANT, PARSE_JSON('null')),
'HANDLERS', IFNULL(OBJECT_CONSTRUCT(*):HANDLERS::VARIANT, PARSE_JSON('null'))
$${ifColumnExists('CORRELATION_PERIOD')}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems better to repeat the pattern above --

'CORRELATION_PERIOD', IFNULL(OBJECT_CONSTRUCT(*):CORRELATION_PERIOD::VARIANT, PARSE_JSON('null'))

or break it into a helper that applies this pattern like

'CORRELATION_PERIOD', $${defaultNullReference('CORRELATION_PERIOD::VARIANT')}

SET correlation_id = COALESCE(:1, UUID_STRING())
WHERE alert:EVENT_TIME >
CASE REGEXP_SUBSTR(LOWER(COALESCE(alert:CORRELATION_PERIOD, '$${CORRELATION_PERIOD_MINUTES}')), '[a-z]')
WHEN 's' THEN DATEADD(seconds, - TO_NUMBER(REGEXP_SUBSTR(COALESCE(alert:CORRELATION_PERIOD, '$${CORRELATION_PERIOD_MINUTES}'), '\\\\d+')), :2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these are repeated eight times, maybe another approach is better, e.g. putting the number of seconds into JS and then the SQL just uses that variable

Copy link
Collaborator

@sfc-gh-afedorov sfc-gh-afedorov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two changes pls and then feel free to merge or punt back for re-review

@sfc-gh-nlele
Copy link
Collaborator Author

Tested the latest changes.

@sfc-gh-nlele sfc-gh-nlele merged commit cc05344 into main Jul 17, 2023
@sfc-gh-nlele sfc-gh-nlele deleted the add_custom_correlation_period branch July 17, 2023 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants