Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New "search alert" type based on ES dsl query and hit count #61313

Closed
mikecote opened this issue Mar 25, 2020 · 40 comments · Fixed by #88528
Closed

New "search alert" type based on ES dsl query and hit count #61313

mikecote opened this issue Mar 25, 2020 · 40 comments · Fixed by #88528
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Mar 25, 2020

There is value for an alert to be purely working with ES DSL query and alerts on data returned.

Things to investigate:

  • Look at current query based alerting as implemented by Watcher. Many questions have been raised by users who wish to achieve the same in Alerting as they currently do in Watcher. We don't have to have feature parity, but knowing where the differences are is valuable.
  • What constitutes an Alert Instance? Is it (most of these can be roadmap items but we can start with one):
    • per document
    • per count
    • above/below count
    • can we base instances off of an aggregation result somehow?
  • Painless support - are there limitations on what we can support? (Can we support this for example)
  • UI experience (none at the beginning, JSON editor or expression style field selector "x is y")
  • Using the --examples flag or POC to integrate with console app
  • Using the --examples flag or POC to integrate with discover app (see Proof of concept integrating "search alert" with discover #61314)
@mikecote mikecote added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Mar 25, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Mar 26, 2020

If this alert is going to have a complex input element - eg, query DSL, or perhaps even KQL - I wonder if we'd want to "hide" it in the places we allow customers to create connectors. Only allow it to be created in an app like discover, where we will won't have to deal with validating a complex input object.

@dsztykman
Copy link

I think we should only allow saved search for this

@gmmorris gmmorris self-assigned this Apr 6, 2020
@gmmorris
Copy link
Contributor

gmmorris commented Apr 6, 2020

I think we should only allow saved search for this

@dsztykman Am I right in thinking that by saved search you actually mean saved query?
Unless I'm misunderstanding something I can't see how a saved search would be useful for an alert.
I was misunderstanding, as I thought saved searches save the results as well, which isn't the case - it only saves the underlying Discover configurations.

As for limiting this to Saved Queries - what do you see as the value in limiting alerts to a saved query? 🤔
Supporting saved queries makes sense to me, as that could provide a much simpler alert creation/editing UI, but limiting to saved queries them has some down sides.

For example, it would force the users to create noise in their saved queries list if they need to define a bunch of alerts. It also means we would create a coupling between a saved query and its underlying alert, which means a user might change the behaviour of an alert when they change a query without even knowing they're having that effect (unless we can somehow make this link explicit in Discover, but not sure how that would be done).

Anyway, would love to hear your thoughts. :)

@gmmorris
Copy link
Contributor

gmmorris commented Apr 6, 2020

Looking at how Discover supports both Lucene and KQL I don't see an obvious reason not to support both in the Alert Type.

I'm even considering how we can best have the Alert Type mirror Discover (perhaps by building off of the types exposed by Discover so that they are truly linked), though that creates a coupling I'm not yet 100% sure about... bottom line, I think the Alert could, and probably should, support both Lucene and KQL. :)

@dsztykman
Copy link

KQL is a bit limited in certain usage, and saved queries only apply for KQL, that's why I talked about saved search.

@gmmorris
Copy link
Contributor

gmmorris commented Apr 6, 2020

Had a chat with @dsztykman and here are some notes from his feedback that are definitely worth keeping in mind:

David's main concern is that if you don't use a saved search/query as the basis users can't verify that their search actually returns what they want. I explained the plan is to expose this as an action in Discover where you create an Alert based off of the current search, which addresses this. This does bring to light the need to have a chart (like in Index Threshold) that gives the user an idea of how many results their search currently returns.

This raises a question worth asking: Do we want to offer the option in the flyout of creating an alert based on a saved search / query?
If so, what do we do if the user wants to create a new search?
We could take some free text search and rely on the chart for validating the criteria, but perhaps instead, we could redirect the user to Discover where they have a much better experience in doing so?

@AlonaNadler
Copy link

I think we should only allow saved search for this

Having alerts only on saved searches will be very limited, it is a struggle we have today with export to CSV. This will hide the alerting feature and will create many saved search objects that are not needed.

Regarding saved queries, this is rather a new feature that has low usage.

Ideally, users have an explicitly create an alert in Discover which is not tied to saved search or saved query. If it helps in the implementation you can maybe create a saved search or query in the background in user transparent way, for the sake of the alert. If saved query/search will be a mandatory step to create an alert I'm concerned we are going through the challenge we have with CSV

If I had to choose I would focus on KQL first, since this is the default and close to 90% of the clusters are using KQL

@pmuellr
Copy link
Member

pmuellr commented Apr 7, 2020

If I had to choose I would focus on KQL first, since this is the default and close to 90% of the clusters are using KQL

+1

Seems like this would be much easier for customers to deal with than query DSL anyway. I have an ESSQL based alert I was playing with a while back also - https://github.com/pmuellr/kbn-sample-plugins/blob/master/plugins/alert_type_examples/server/alert_types/essql.ts

@gmmorris
Copy link
Contributor

gmmorris commented Apr 8, 2020

I'm looking into how we can use the underlying data plugin that Discover uses.

The advantage to this approach is that we would get a 1-to-1 match with how Discover runs it's queries, which means that whatever the user sees in the Discover UI, should match what the query in the Alert sees. This means that when the user hits this theoretical "Create Alert from Query" button, they should get a starting point which matches what Discover displays and can then do things like change the time range etc.

It also means that changes that are made in Discover in the future, such as changes to the default params, would apply to the underlying Alerts as well and won't go out of sync.

In terms of work needed this would mean:

  1. We would have to expose an api on the data plugin that would allow the Alerting plugin to create a search API like we have on context. You can see an example of this in the PR below.
  2. We would need to figure out which fields populated by Discover should be modified at Alert creation (such as stripping out the date range and replacing with a date range relative to when the Alert runs) and what triggering options we need to expose (do we trigger an instance per document? Do we do so from the moment a document appears or does it require a threshold? etc.)

Would love to hear thoughts from @elastic/kibana-alerting-services and @elastic/kibana-app-arch (as I think you own the data plugin?)

Example PR of the changes we'd need to make in data and what the Alert Type would then have to do: #62932

@pmuellr
Copy link
Member

pmuellr commented Apr 8, 2020

Exposing a scoped data plugin accessor like we do callCluster and savedObjectClient for alertTypes seems like it makes a lot of sense - though I'm not familiar with the data plugin.

@gmmorris
Copy link
Contributor

gmmorris commented Apr 8, 2020

Exposing a scoped data plugin accessor like we do callCluster and savedObjectClient for alertTypes seems like it makes a lot of sense - though I'm not familiar with the data plugin.

It's the plugin used by Discover to query.

@lukeelmers
Copy link
Member

@gmmorris Looking at your PR, what you are describing sounds like what we are already doing here, where we register route handler context to provide scoped search:

core.http.registerRouteHandlerContext<'search'>('search', context => {
return createApi({
caller: context.core.elasticsearch.dataClient.callAsCurrentUser,
searchStrategies: this.searchStrategies,
});
});

However, it looks like you need the adminClient instead of dataClient, correct?

I wonder if it would be worth us providing an admin-scoped API as well via the router handler context? Or alternatively if we exported createApi statically and made the search strategies available on the runtime contract, you could probably construct the scoped api directly in your plugin.

Anyway, @lukasolson knows more on our long-term plans for the data search service on the server so he might have a better answer to this question.

@gmmorris
Copy link
Contributor

gmmorris commented Apr 14, 2020

@gmmorris Looking at your PR, what you are describing sounds like what we are already doing here, where we register route handler context to provide scoped search:

core.http.registerRouteHandlerContext<'search'>('search', context => {
return createApi({
caller: context.core.elasticsearch.dataClient.callAsCurrentUser,
searchStrategies: this.searchStrategies,
});
});

However, it looks like you need the adminClient instead of dataClient, correct?

Thanks @lukeelmers, I might not have explained this well enough, let me expand a bit:

It's not that we're using the adminClient, rather what we're doing is creating a scoped dataClient by using the adminClient such that the scope is of a specific user - specifically the user that created the Alert.
The reason we can't use the context created in the snippet you reference, is that we do not actually come through a router, but rather this is a scheduled task executed by Task Manager. RouteHandlerContext simply isn't available to us when we're running a scheduled task.

I wonder if it would be worth us providing an admin-scoped API as well via the router handler context?

As we're not using the admin scoped client, but rather a user scoped client that wasn't created via a route handler, this wouldn't help in this case. 😬

Or alternatively if we exported createApi statically and made the search strategies available on the runtime contract, you could probably construct the scoped api directly in your plugin.

The thinking behind exposing the API that I added was that it provides the ability to create a scoped search using precisely the same code that creates it for the context. Using the same code means that even though these are two different entry points, they will always provide the same API and we shouldn't have any divergence in the future.

While we could expose createApi and searchStrategies, it felt cleaner to me to keep the domain of "creating a search api" scoped to the Search plugin, rather than leak that into other plugins. 🤔
Part of the goal here is to have parity (and alignment) with the API that Discover uses, so this change made the most sense to me.

Would you still prefer to expose createApi and searchStrategies over exposing just the handler that takes a dataClient same way the contextHandler does? It seems like a bigger (and less maintainable) change to me... Happy to chat over zoom, as I might just be missing some context.

I hope that all makes sense. 😄

Anyway, @lukasolson knows more on our long-term plans for the data search service on the server so he might have a better answer to this question.

I spoke to @lukasolson last week and he generally seemed to support the direction I took, but there's always a chance I missed something.

@lukeelmers
Copy link
Member

Thanks for clarifying @gmmorris! I misunderstood your use case. If you already talked with @lukasolson and he's on board, then no worries on my end 😄

gmmorris added a commit that referenced this issue Apr 16, 2020
… and uptime (#63489)

Work on #61313 has revealed that we don't have amock for AlertServices, which creates coupling between us and any solution depending on us, which makes it harder to make changes in our own code.

This PR adds mocks and uses them in SIEM, Monitoring and Uptime, so that we can make future changes without having to change outside solutions.
gmmorris added a commit to gmmorris/kibana that referenced this issue Apr 16, 2020
… and uptime (elastic#63489)

Work on elastic#61313 has revealed that we don't have amock for AlertServices, which creates coupling between us and any solution depending on us, which makes it harder to make changes in our own code.

This PR adds mocks and uses them in SIEM, Monitoring and Uptime, so that we can make future changes without having to change outside solutions.
gmmorris added a commit that referenced this issue Apr 16, 2020
…toring and uptime (#63489) (#63662)

Work on #61313 has revealed that we don't have amock for AlertServices, which creates coupling between us and any solution depending on us, which makes it harder to make changes in our own code.

This PR adds mocks and uses them in SIEM, Monitoring and Uptime, so that we can make future changes without having to change outside solutions.
@gmmorris
Copy link
Contributor

gmmorris commented Apr 16, 2020

I've spent the day trying to figure out how the pieces can fit together and I think I now have a mental model of what might work well.

The direction I've been experimenting with is thus: The ES Query AlertType will essentially be parameterised with the following:

  1. A Query which is formatted as a SavedQuery - this would mean that anything we can save as a saved query can also back the ES Query AlertType.
    This would include the query itself (KQL initially), filter, time filter etc., as seen in this type:

export interface SavedQueryAttributes {
)

  1. The Alert Type can then use the data plugin's search api in the same manner that Discover uses it, which means we can keep this Alert aligned with Discover in terms of behavior. This, though, will entail mimicking the process that happens in discover where the Saved Query is converted into the query that is passed to the search api. This is tricky as much of that implementation is encapsulated in the data plugin. I have managed to expose these required helpers locally on my machine (getTime for timeFilters, buildEsQuery to turn the KQL query into a DSL Query etc.), but it looks like we'll also need access to the IndexPatterns service which, currently, is only exposed on the client side (public plugin, not the server plugin) and so, this will require additional changes in the data plugin. This would need to be done with guidance from the @elastic/kibana-app-arch team.

  2. The Alert Type could then get back the results of this search and decide whether to trigger AlertInstances based on whatever parameters are specified by the alert creator - presumably a Minimal Hit Count above which we trigger the instances.

The benefit of this approach is that it should be straight forward to create an Alert directly from within Discover, as the format will match that of Saved Queries, and the matching documents seen in Discover should always match the ones seen in the alert. Imagine going from Discover to Alert and back to Discover, we'd be able to maintain a unified experience there, which I think is how we envisioned the original requirement.
The challenge will be that quite a few of the data plugin's internals (APIs & behaviour) would need to be exposed/mimicked in a maintainable manner, that will allow data to keep its encapsulation, allow alerting to avoid unnecessary coupling with data and will still provide us with the consistent behaviour we expect.

I think making additional work on this alert should wait for us to confirm this is the right direction and align with App-Arch to make sure we don't step on their toes with changes we'd need to make in the data plugin.

@gmmorris
Copy link
Contributor

I'm going to try and get my local prototype PR to a stable enough point that it can express, more or less, how this might work.

@AlonaNadler
Copy link

Thanks @gmmorris for sharing your approach here!
@timroes @kertal not sure if you see that. You might have thoughts on Gidi's approach

@timroes
Copy link
Contributor

timroes commented Apr 17, 2020

@gmmorris That sounds all very reasonable. That part with server side index patterns is really important, since I assume that's a bit larger change (one that has been on the list for a long time), and I agree that we can't proceed with this alert, before we don't have that. So I think it's worth talking to App Arch and clarify with them about planning around this.

@AlonaNadler
Copy link

cc: @alexh97 @ppisljar it seems that this needs index pattern on server side

@pmuellr
Copy link
Member

pmuellr commented Nov 17, 2020

It would be really neat if this one can support painless scripts as well as if it would add a certain possibility to cover certain edge cases as well.

That seems like that could be nice as a way to boil down the query result into some "simple" variables that could then be used in action parameters. We should probably look at all the scripting options available in Watcher as potential candidates for this kind of capability. Eg, https://www.elastic.co/guide/en/elasticsearch/reference/current/transform.html

@gmmorris
Copy link
Contributor

short update from our side, IndexPatterns and SearchSource are now available on the server.

🎉 🎉 🎉 🎉

That's awesome, thanks @ppisljar
IIRC IndexPatterns was the main blocker on our end. 🤔

@pmuellr
Copy link
Member

pmuellr commented Nov 24, 2020

We get requests from customers asking for more data in the context variables from the existing alerts. In some cases, this isn't really practical because of the way the alert builds it's queries - and is likely going to be difficult to allow the customer a nice interface to specify what other data they want.

The dsl query alert is of course pretty nicely setup for this, since the customer presumably will be able to create open-ended queries. Or at least that's the end goal. The next question is then - how do we make this data available to customers such that it can easily be used as context variables in actions. I'd guess we may need something like issue #77793 in some use cases, as it's hard to imagine doing much more than providing the raw query response in the context, which I'm guessing won't be easily usable within a mustache template.

@gmmorris
Copy link
Contributor

gmmorris commented Dec 2, 2020

Please note Infra's needs over here: https://github.com/elastic/sre/issues/342
Sounds like they broadly need "the full power of the Elasticsearch Query DSL combined with Painless scripting.", so definitely worth syncing with them when this issue is picked up to ensure we're addressing their needs.

@ymao1 ymao1 self-assigned this Jan 5, 2021
@ymao1
Copy link
Contributor

ymao1 commented Jan 5, 2021

  • Look at current query based alerting as implemented by Watcher. Many questions have been raised by users who wish to achieve the same in Alerting as they currently do in Watcher. We don't have to have feature parity, but knowing where the differences are is valuable.

Watcher Queries

Watcher Search Input type supports:

  • full ES DSL, including aggregations
  • Painless scripting
  • specifying indexed search templates with parameters

When a watch is triggered, the input query determines what data is loaded in to the execution context (the payload). This payload is accessible during subsequent watch execution phases like evaluating the condition and firing actions using ctx.payload inside mustache templates.

Watcher Conditions

Watcher allows users to specify conditions that determine whether an alert becomes active (and an action is executed).

compare

Simple direct comparisons against values inside the payload.

Example `compare` condition { "condition" : { "compare" : { "ctx.payload.hits.total" : { "gte" : 5 } } } }

array_compare

Compare value against array of objects (such as aggregation output)

Example `array_compare` condition { "condition": { "array_compare": { "ctx.payload.aggregations.group_by_host.buckets": { "path": "doc_count", "gte": { "value": 1000 } } } } }

script

Painless script for more complex comparison logic

Example `script` condition { "condition": { "script": { "source": "return ctx.payload.errors.hits.total >= 15 && ctx.payload.all_messages.hits.total >= 50", "lang": "painless" } }

@ymao1
Copy link
Contributor

ymao1 commented Jan 6, 2021

Security solutions have two rule types that allow for custom KQL or Lucene queries

Custom Query Rule Type

Custom Query rule type allows users to specify either a KQL or Lucene query. Both are converted into ES DSL in the alert executor (Lucene query is converted to the query string query). There is no ability to specify aggregations or conditions for this rule type. The implied condition is the existence of documents that match the query. Alerts (or alert instances as they're currently called inside the alerting framework) are the individual documents that match the query and actions are executed (at the specified action interval) when that number of documents exceeds 0.

Threshold Rule Type

Threshold rule type builds up on the Custom Query rule type by allowing users to specify a field within the document to threshold against and a threshold value. Alerts (alert instances) are the individual documents that match the query and where the specified field exceeds the specified threshold value.

@ymao1
Copy link
Contributor

ymao1 commented Jan 7, 2021

  • Painless support - are there limitations on what we can support? (Can we support this for example)

There is nothing inherently preventing the usage of Painless scripts within the query DSL. A query like this should work out of the box

{
    "query": {
        "bool" : {
            "filter" : [
                {
                    "script" : {
                        "script" : {
                            "source": "doc['field1'].value > doc['field2'].value",
                            "lang": "painless"
                         }
                    }
                }
            ]
        }
    }
}

In the cited Watcher example, the Watch is using mustache templating to populate the query with user specified (static) metadata values. If we wanted to provide feature parity, we would need to provide a way for a user to specify generic metadata for the query and then template the search query in the alert executor. However, since the metadata fields are static, if we didn't provide this functionality the user would just need to manually inject the metadata into the query when creating the alert.

@ymao1
Copy link
Contributor

ymao1 commented Jan 7, 2021

I'm thinking something like this for the initial search alert type:
New Project

  • JSON editor for specifying just the query portion of an Elasticsearch query
  • No support for aggregations
  • No support for mustache templating within the query
  • Ability to set a condition based on document count and time range
  • This would output a single alert instance matching_documents that would be Active (Threshold Met) when the doc count for the query meets the condition and OK when it doesn’t

@arisonl @elastic/kibana-alerting-services Do you think this would be sufficient first step to start addressing the use cases you have seen for a search alert type?

After this, we could look into adding aggregation support. At this point, we could

  1. Add limited aggregation support, similar to what the Index Threshold alert offers which is count | average | sum | min | max plus the ability to group by a term.
  2. Add full aggregation support by allowing users to specify their own aggregations within the JSON query editor. This would bring us closer to Watcher parity but brings a lot of complexity into defining the alert conditions.
  • if you have a query with no aggregations, you likely want to set a threshold on the doc count of the query or on a field within the query results.
  • if you have a query with aggregations, you likely want to set a threshold on an aggregation value (for metric aggregations) or on a value inside an aggregation result (for bucket aggregations)

@mikecote
Copy link
Contributor Author

@ymao1 I gave your proposal a read and it looks good. Great analysis btw!

The one piece I can see discussion on is how the alert instances are determined (matching_documents vs per document). I can see a case for matching_documents where the user may want to be notified once if something matches and be able to mute / throttle if necessary. I can also see a case for per document where the user may want to do a custom action for each document that was found (ex: creating a PagerDuty incident for each document matched). We could always explore making this a parameter as well and let the user choose how to configure this..

Is there potential of leveraging, merging, re-using the alert type(s) from the Security solution? It could be worth starting to merge the similar alert types to avoid having too many types that do the same thing. We could add a user option to chose what type of query (ES DSL, KQL, Lucense) they wants to do. This could give a few more options to the user for free 🙂 and vice-versa in SIEM in the future.

@ymao1
Copy link
Contributor

ymao1 commented Jan 11, 2021

@mikecote That's a good point that just having matching_documents as an alert instance would not give any insight into exactly which documents match a query, which can be very useful.

Re: re-using alert types/code

  • Security solutions offers a single alert type with an executor that handles all the different types of security solutions detection rules (Custom Query, Threshold, ML, EQL, Indicator Match). There is the possibility to extract out and reuse some of the functions used to build the DSL query from the KQL/Lucene input for the Custom Query and Threshold types but we would not be able to reuse the entire alert type. Whatever functionality we extract out would also need to handle security solution specific (for now) concepts like exception lists.
  • Because security solutions considers each document an alert instance, it implements de-deduplication logic as part of the search in order to make sure no document has multiple alert instances created for it. If we wanted to have an alert instance per document match, we would have to implement similar logic. It does seem like we could extract out and reuse the same methods for this though.

@mikecote
Copy link
Contributor Author

@ymao1

Re: re-using alert types/code

Gotcha, thanks! That makes sense with possibility of re-using / copying some code but overall we'd create a new alert type due to the other one being under the signals alert type. 👍

@ymao1
Copy link
Contributor

ymao1 commented Jan 13, 2021

Threshold Rule Type

Threshold rule type builds up on the Custom Query rule type by allowing users to specify a field within the document to threshold against and a threshold value. Alerts (alert instances) are the individual documents that match the query and where the specified field exceeds the specified threshold value.

Correction on how the security solutions threshold rule type works. For this they are generating a single alert instance if the number of (de-duplicated) documents that match the query for the time range exceeds the threshold. In order to get the individual documents that matched, they re-construct the query based on information stored in the alert instance (signal) during the investigation phase.

For overlapping alert intervals (for example, an alert that runs every minute, looking back 5 minutes each time, with a threshold of 50), this is what would happen:

  • run 1 - 100 matches against the query -> 100 exceeds threshold -> alert is active and a single alert instance created
  • run 2 - 80 matches against the query, but 40 are duplicates from the first run -> 40 does not exceed threshold -> alert not active
  • run 3 - 100 matches, but 5 are duplicates from Run1, 40 were matched in Run2 (but no alert instance created) and 55 are brand new -> 95 exceeds threshold -> alert is active and single alert instance created.

@arisonl
Copy link
Contributor

arisonl commented Jan 14, 2021

@ymao1 Thanks for this analysis. Leaving aggregations initially out for the MVP is fine, but I suspect that to the users that are interested in the search alert, aggregations are important, so we should expect that we need to pick it up soon and plan accordingly.

A couple of notes that may worth pointing out in this context. Not requirements for this story, rather for your and @mikecote's consideration, as we are thinking through this and its next steps. Users often need to be able to interact with the search result for the purposes of:

  • Transforming it in order to populate action parameters accordingly as they need. I am linking a related enhancement request below, as well as internal requirements (Infra team).
  • Setting up a dynamic condition. For example set up a dynamic threshold of the type trigger if the aggregation of the metric within the set window is X times greater than that of the previous window.

@ymao1
Copy link
Contributor

ymao1 commented Jan 29, 2021

Follow on issues created:

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.