-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RAC] Alerts as Data Schema Definition #93728
Comments
Pinging @elastic/security-solution (Team: SecuritySolution) |
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
I assume I'm not sure what Do we want For current alerting purposes, we are mapping For For the current event log, we store the current Kibana's server uuid in the docs. Turns out this has been useful to identify problematic Kibanas (eg, configured with the wrong encryption key), and other diagnostic purposes. Since this is all happening on the back-end, and we have nothing like http logs for traceability. |
Within the
|
I don't think I've seen yet how we propose to handle RBAC concerns here. For Kibana alerting, we use standard feature controls and a Currently, there are at least the following
I'm guessing SIEM hasn't had to deal with this since:
We can "roll our own story" here, like we did with event log, where we don't provide open-ended access to the indices, but gate access by requiring queries to specify the saved objects they want to do queries over for the event data. We ensure the user can "read" the requested saved objects, then do the queries filtering over those saved object ids (and the current space the query was targeted at). But this doesn't lend itself to a "use the alert data in Discover or Lens", because of that saved objects filter barrier. |
Good question @pmuellr! On the SIEM/Security side it's been achieved through both the space-awareness of the indices, and for users wanting to keep everything within a single space, but maintain some sort of multi-tenancy, document level security is used to restrict access based on usually some namespace-like field within the alert document. We've had users report success with both methods, with the only real complaint being that it needs to be manually configured as spaces are added. This sorta flexibility has proved to be really nice for users as it makes the whole "use the alert data in Discover or Lens" pretty frictionless. |
Here are a few more questions from the unified architecture workstream doc:
|
I'm thinking so -- these fields will be extremely useful for tracking/dashboarding, so keeping them mapped alongside the alert will ensure that's easy to do (no join). Question is how do you track assignment/status over time? Can we rely on the Kibana Audit Log for that data (logged API calls to assign/change status)? Other question is where do they belong. @MikePaquette, are there any ECS plans for capturing workflow fields like these?
I think this is a must in effort to allow users and UI's to break down alerts by rule-specific fields, e.g. show all alerts where
Would default to |
I’ve been exploring various options over the last week to explore what we can do with alerts as data. This is more of a brain dump than a complete proposal. I might not use the right terminology in certain places, or the right field names, and my intent is not to cover all scenarios. I would need to learn more about Security/Maps/ML rules to figure out what is generally useful and what isn't, but I've tried to take whatever I know about other team's approaches in the back of my mind. I think we have the following rule type categories:
I think up until now we’ve mostly been talking about log or event rule types, which are one-offs: if an event violates the rule, it’s indexed, needs a human to close it, and should only notify (execute actions) once. I.e., the alert is an event, without a lifecycle. For the other rule types, there is a lifecycle. An alert can activate, then recover. It can also become more or less severe over time. I think it’d be valuable to capture and display that progression as well. We also need grouping - because of the underlying progression, but also because one rule can generate many alerts, and they all might point to the same underlying problem. Consider an infrastructure alert for CPU usage of Kubernetes pods for a user that has 10k pods. The execution interval is set to 1m If there is a GCP outage, the rule will generate 10k alerts per minute. In this scenario, we’d like to group by rule (id), and display a summary of the alerts as “Reason”. We also might want to group by hostname, or show alerts that recovered as different entries in the table. Here’s an example of the latter (which also shows the progression of the severity over time). To make this work, we would need another layer of data: alert events. One alert can generate multiple alert events during its lifespan. If an alert recovers, it reaches the end of its lifespan. The next time the violation happens, a new alert happens. Changes in severity do not create a new alert. Alerts can be grouped together by a grouping key. E.g., in the aforementioned infrastructure rule, the grouping key would be host.hostname. In the UI, we could then allow the user to investigate the history of alerts for host.hostname (or any other grouping key). Conceptually, a rule type, a rule, an alert, and a violation would all map to event data, and for the alert event, this data would be merged together. Each alert event captures the following data about the alert:
Additionally, information specific to the violation is stored. This is allows us to visualize the progression of the severity over time.
On all events, some data about the rule type and the rule itself should be added as well:
I would consider open/closed to be different from active/recovered. E.g., an alert could auto-close, meaning that it is closed immediately after the alert recovers. Or, it could auto-close 30m after the alert recovers. The alert would be kept “alive” by the framework until the timeout expires. For now, I’ve left open/closed out of the equation, as it might not be necessary now, but a possibility is that if an alert closes later than it recovers, a new event is indexed, with the state from the last alert event, but with alert.closed: true. Additionally, there’s one other concept that I’ve been experimenting with that might be interesting: influencers. Similar to ML, these would be suspected contributors to the reason of the violation. These values would be indexed under alert.violation.influencers, which can be a keyword field, or a flattened field. For instance, if I have a transaction duration rule for the opbeans-java service, and it creates an alert, one of the influencers would be service.name:opbeans-java. But, we also might want to run a significant terms aggregation, compare data that exceeded the threshold with data that didn’t, and surface terms that occur more in one dataset than the other, and index them as influencers. This could surface host.hostname:x as an influencer, and would allow us to correlate data between alerts from different solutions. It could also be multiple hostnames. These fields might be ECS-compatible, but I would still index them under alert.violation.influencers. The difference is that I see fields that are stored as ECS are a guarantee that the alert relates to that field, but influencers are a suspicion. E.g., if we suspect that a host contributed to the violation of the transaction duration rule for opbeans-java, we would index it as an influencer only. If we have an infrastructure rule for a host name, and a violation of the rule occurs, it is indexed both under host.hostname and alert.violation.influencers. This allows the Infrastructure UI to recall alerts that are guaranteed to be relevant for the displayed host, but also display possibly related alerts in a different manner. The event-based approach would also allow us to capture state changes over time, e.g. we could use them to answer @spong’s question about tracking changes in open/closed over time. Any state change would be a new event, inheriting its state from the previous event. Querying this data generally means: get me the values of the last alert event if I group by x/y. The easiest way to do this is a terms aggregation with a nested top_metrics aggregation, sorted by As you can see, there are gaps, which meant that the alert recovered, but then activated again. In this case, that's because I'm disabling and enabling the rule. But this could also show that the threshold is too low, allowing the user to adjust that. Or, hypothetically, the framework could keep an alert alive for a short time period, to prevent it from flip-flapping. |
I would highly recommend applying lessons learned from #50213 to the schema. We've explored different paths but the one that works is to create an index per rule type. This would mean having |
I would like to capture the "why" that the schema has to be in ECS format (from an observability and Security perspective). The reasons will be useful input for the saved-objects as data story (#94502). |
Also, I don't see workflow-related fields in the schema above (status, etc), are they part of |
I believe @FrankHassanabad captured it best in a demo awhile back, but the power of using ECS (in these docs) across solutions is in the use and re-use of visualizations and workflows, being able view these alerts in custom solution-based views like an authentication table or transaction log, and allowing users to quickly and easily filter on fields that are common and familiar.
I don't think they were captured in the initial doc(?), for now I was thinking we'd namespace them under I made a few small updates to the schema in the above description, namespacing to As mentioned earlier, we'll need to capture the alerts-on-alerts fields as well -- will finalize those as part of implementing the the different rule types within the rac test plugin. I've created a draft PR of what I've had a chance to put together so far (not much), but it at least covers the bootstrapping of the index/template/ilm and has the same script for generating the mapping like the |
@sqren @dgieselaar is the above the same for Observability's use case to use ECS?
@spong is the workflows something used by the Security solution or only Observability? I would like to capture and discuss why we wouldn't use cases instead of duplicating their workflows.
Some thoughts:
|
In our case the amount of fields (ecs or not) that we store with the alert
document is fairly limited, I suspect. E.g. it might just be a handful per
rule type/observer.
…On Tue, Mar 16, 2021, 13:34 Mike Côté ***@***.***> wrote:
cc @kobelb <https://github.com/kobelb>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#93728 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACWDXDFJK2DXYL3GZQ554LTD5F4TANCNFSM4YUNEN3Q>
.
|
I had a good meeting with @dgieselaar, @spong, @smith, and a few others last week to discuss alert concepts to make sure we are all on the same page with how to talk about all of these moving parts. I think we all have a common understanding of the general issues around merging our two sets of concepts, but there are still a few things I'm not clear on. Here are some images to try to help push this discussion forward. These are the concepts that I understand for observability rules + alerts: These are the related concepts that I (admittedly very vaguely) understand for security rules + alerts: My questions are about this concept I've called an "alert stream" or that Dario refers to as "alert events" and how observability treats each detected violation as a mutable event, and user-centric "alerts" are aggregations of those events (including an eventual recovery event). Whereas security, I think, creates a mutable violation/alert and then updates that object with different workflow states (open|in-progress|closed). It's not yet clear to me how we are planning to merge these concepts in a schema we both share. Also, I have a question about security violations/alerts as depicted in my crude drawing: Can a given security detection generate multiple violations/alerts in this way? What am I misunderstanding about this model still? |
Yes, one rule execution can result in multiple alerts. Each alert has its own status and can be marked in progress/closed independently. The typical example is that a search rule matches multiple documents and we create an alert for each. This example is our most simple/common rule type, but it's worth looking at our more complex rule types as well. In particular, the Event Correlation rule type has some similarities with the Observability alerts. A correlation rule detects a sequence of events (each event respects some condition, in a particular order). To use your terminology, we create "alert events" for each individual event and then also a single user-centric mutable alert that we display to the user. In our terminology, we call the events "building-block alerts" and the user-centric alert, simply Alert. The "alert events" are good because they save the documents in case they get deleted by ILM and capture the state of the docs as they were at rule execution time. The user-centric alert is good because it contains the mutable state and makes it easy to page through in the UI and create custom visualizations on top. I'm thinking the same model can apply to Observability alerts. WDYT? |
@tsg:
To some extent, yes. But I'm not sure if we need to mutate things. We can "just" write the latest alert state to an index, and then use aggregations or collapse to get the last value. Otherwise, we would either need some kind of job that cleans up old alerts, or have a user do that manually (and maybe bulk it). For the mutable alerts in security, how do they get deleted? (fwiw I think there's definitely common ground here, and at least for APM we will be looking at rules that might be more similar to Security rules, e.g. detection of new errors, so I'm not too worried about diverging). |
To make sure I understand, you are saying that you could only index the "alert event", and that the "user-centric alert" doesn't need to be explicitly present in the index? I think we'll need mutations for marking in progress/closed/acknowledged/etc., right? That is related to the MTTx discussion earlier. Also, I think indexing a "user-centric" alert makes querying and visualizations easier because you don't need an aggregation layer at query time.
The .alerts indices have ILM policies. By default, they are not deleted, but users can configure a Delete phase in ILM.
++, I think though that the decision of always having a user-centric alert indexed is important because then we know we can rely on it in any new UI. |
In my head, we wouldn't need mutations. We'd just append a new event, with the updated state.
Yeah, maybe? I don't know. I put a lot of faith in aggregations and I think ES can help here as well. I think for Observability we almost always just want to aggregate over events. My perception is that there are better ways to surface the most important data than asking the user to paginate through the whole dataset. Aggregation/grouping might be an interesting default for the Security alerts table as well.
I'm not sure if I fully understand how ILM works, but suppose the user configures their policy to delete alerts after 30 days, corresponding to the retention period of their machine data, and an alert is in progress for longer than 30 days, is it deleted? That seems like an edge case, but the answer would help me understand the implications of mutating alert documents better. Maybe mutating data is easier. But I think that you'll end up having to manage (sync?) two data sources (the events, and the user-centric alert), and things might get ugly quickly. |
Not sure if this is the right place so feel free to slack or email about it, but what are use cases today that cannot be satisfied by aggregations or collapse? |
Maybe the right word is "evaluation" instead of "check". Thinking out loud: an evaluation of the rule might be a violation of the rule, but not always. (e.g., you'd have Maybe there are two distinct phases to rule execution: evaluate, and alert. The severity belongs on the alert, not on the evaluation. Whatever the last severity level will be, will be stored on the alert. If the changes in severity are important we can query the metrics - we should be able to store most if not all alert fields on the evaluation metric document. If not, we can just query the alerts. An example of a rule that monitors latency for all production services. A violation occurs at 13:00, opening an alert, stays active at 13:01 when the latency was above the threshold, and closes at 13:02 when the latency was below the threshold. [
{
"@timestamp": "2021-03-28T13:00:00.000Z",
"event.kind": "metric",
"event.action": "evaluate",
"rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
"rule.id": "apm.transaction_duration_threshold",
"rule.name": "Transaction duration for production services",
"rule.category": "Transaction duration",
"producer": "apm",
"alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
"alert.id": "opbeans-java",
"alert.start": "2021-03-28T13:00:00.000Z",
"alert.duration.us": 0,
"alert.severity.value": 90,
"alert.severity.level": "critical",
"alert.status": "open",
"evaluation.value": 1000,
"evaluation.threshold": 900,
"evaluation.status": "violation",
"service.name": "opbeans-java",
"service.environment": "production"
},
{
"@timestamp": "2021-03-28T13:00:00.000Z",
"event.kind": "alert",
"rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
"rule.id": "apm.transaction_duration_threshold",
"rule.name": "Transaction duration for production services",
"rule.category": "Transaction duration",
"producer": "apm",
"alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
"alert.id": "opbeans-java",
"alert.start": "2021-03-28T13:00:00.000Z",
"alert.duration.us": 0,
"alert.severity.value": 90,
"alert.severity.level": "critical",
"alert.status": "open",
"service.name": "opbeans-java",
"service.environment": "production"
},
{
"@timestamp": "2021-03-28T13:01:00.000Z",
"event.kind": "metric",
"event.action": "evaluate",
"rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
"rule.id": "apm.transaction_duration_threshold",
"rule.name": "Transaction duration for production services",
"rule.category": "Transaction duration",
"producer": "apm",
"alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
"alert.id": "opbeans-java",
"alert.start": "2021-03-28T13:00:00.000Z",
"alert.duration.us": 60000000,
"alert.severity.value": 90,
"alert.severity.level": "critical",
"alert.status": "open",
"evaluation.value": 1050,
"evaluation.threshold": 900,
"evaluation.status": "violation",
"service.name": "opbeans-java",
"service.environment": "production"
},
{
"@timestamp": "2021-03-28T13:01:00.000Z",
"event.kind": "alert",
"rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
"rule.id": "apm.transaction_duration_threshold",
"rule.name": "Transaction duration for production services",
"rule.category": "Transaction duration",
"producer": "apm",
"alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
"alert.id": "opbeans-java",
"alert.start": "2021-03-28T13:00:00.000Z",
"alert.duration.us": 60000000,
"alert.severity.value": 90,
"alert.severity.level": "critical",
"alert.status": "open",
"service.name": "opbeans-java",
"service.environment": "production"
},
{
"@timestamp": "2021-03-28T13:02:00.000Z",
"event.kind": "metric",
"event.action": "evaluate",
"rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
"rule.id": "apm.transaction_duration_threshold",
"rule.name": "Transaction duration for production services",
"rule.category": "Transaction duration",
"producer": "apm",
"alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
"alert.id": "opbeans-java",
"alert.start": "2021-03-28T13:00:00.000Z",
"alert.end": "2021-03-28T13:02:00.000Z",
"alert.duration.us": 120000000,
"alert.severity.value": 90,
"alert.severity.level": "critical",
"alert.status": "close",
"evaluation.value": 500,
"evaluation.threshold": 900,
"evaluation.status": "ok",
"service.name": "opbeans-java",
"service.environment": "production"
},
{
"@timestamp": "2021-03-28T13:02:00.000Z",
"event.kind": "alert",
"rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
"rule.id": "apm.transaction_duration_threshold",
"rule.name": "Transaction duration for production services",
"rule.category": "Transaction duration",
"producer": "apm",
"alert.uuid": "19d83250-f773-4499-92c7-f7d1346ed747",
"alert.id": "opbeans-java",
"alert.start": "2021-03-28T13:00:00.000Z",
"alert.end": "2021-03-28T13:02:00.000Z",
"alert.duration.us": 120000000,
"alert.severity.value": 90,
"alert.severity.level": "critical",
"alert.status": "close",
"service.name": "opbeans-java",
"service.environment": "production"
},
{
"@timestamp": "2021-03-28T13:03:00.000Z",
"event.kind": "metric",
"event.action": "evaluate",
"rule.uuid": "9c00495e-0e91-4653-ab7a-30b62298085e",
"rule.id": "apm.transaction_duration_threshold",
"rule.name": "Transaction duration for production services",
"rule.category": "Transaction duration",
"producer": "apm",
"alert.id": "opbeans-java",
"evaluation.value": 300,
"evaluation.threshold": 900,
"evaluation.status": "ok",
"service.name": "opbeans-java",
"service.environment": "production"
}
] |
So as best as I can tell, we've discussed a few different possible document types we could index: [A] Signal/Alert Document: ( [B] Evaluation Document: ( [C] State-change Document: ( Our indexing options include:
My biggest question right now, I think, is do we really need more than indexing option (1) here? @andrewvc from Uptime has a lot of thoughts about this exact thing, so I'd like him to chime in from another Observability perspective, as well. Thanks! |
We cannot do the severity log. But I want to challenge the idea that pre-aggregating and updating a single document is the simpler approach. From what I can tell there are reports of performance issues where bulk updates of tens of thousands of alerts can take longer than 60s (@spong correct me if I misunderstood that). For Observability alerts this could happen every minute or even at a higher rate. |
@dgieselaar I had a good talk with @andrewvc yesterday where he talked at length about the simplicity gains of updates over relying on create-only + aggregations. I'll let him elaborate. The Metrics UI is already a massive tangle of giant aggregations for which we are constantly wishing we had more stable and easily queryable documents, so I'm at least open to the idea that we may need to get over our fear of updates, but I'm not deep enough in the weeds to have a solid opinion. @simianhacker @weltenwort do either of you have insights on this specific part of the conversation, re: storing lots of individual documents and aggregating for queries vs updating a single document and querying for that? |
Also, the read would definitely be simpler, in the sense that it's one less level of aggregation to think through (and be limited by in some ways). The update may be more expensive than create/append-only, but I'm not sure to what extent. |
If we don't need to visualize severity or state changes over time, agree. But from what I know we do want to have that.
Me neither. I asked for guidance from the ES folks but from what I understand there are no benchmarks for frequent updates because that is not recommended in general. We should probably chat to them at some point soon. |
What a thread, I'm finally caught up reading the whole thing :). Conclusive answers here are tricky, and while I have a number of open questions at this point I'm leaning toward the 'signal' doc approach as superior (though I'm not 100% certain there). The main question that kept coming into my head reading through this thread is 'what are the actual queries we want to run on this data?'. The challenge here is that we want a schema that works for multiple future UI iterations, where the queries are unknown. Any choice we make here amounts to a bet on the sorts of features we'll develop in the future. There's no one right answer. Given that, if the main thing we want to represent in our UIs is the aggregated data (signals), that points toward using these mutable signal docs IMHO. Both approaches can work, but append only is more limiting in many read-time scenarios. Synthetics I have some further specific questions about how this would all play with Uptime/Synthetics. A common question we have to answer is 'give me a list of monitors filtered by up/down status'. It would be great to shift this to a list of monitors by alert active/not active, since that aligns the UI behavior with the alert behavior, and really makes alerts the definition of whether a monitor is good / bad. Some questions about how this impacts us:
One other option we've discussed is embedding the alert definition in the Heartbeat config with Fleet/Yaml, and doing the sliding window status evaluation on the client side, in Heartbeat, embedding the status / state in each heartbeat document. The performance cost here is essentially zero to record both contiguous up and down states. This process would let us look at a monitor's history as a series of state changes, as in the image below: This effect can be achieved by the equivalent of embedding an 'evaluation' doc in each indexed doc, but with very little data overhead since it-s just a few fields in each doc. We could probably derive a 'signal' doc if in a variety of ways (dual write from the beat, some bg process in Kibana etc), or defer that work till needed. That said, there's a lot to be said for the elegance of doing it all centrally in Kibana, you have much more power at your fingertips in terms of data context, but then you have more perf concerns as well (like scaling Kibana workers). Performance Performance is a great thing to be thinking about now, and it's a tough comparison, because while we may save at index time by going append-only we may pay a large cost in software complexity and slow queries doing weird filtering, perhaps being forced to do that on One thing that I would like us to consider generally is that any system that gets more active as things get worse is somewhat scary. When half your datacenter dies and an alert storm ensues you don't want your ES write load to jump by 1000% etc. One advantage of a system that reports both up and down states (nice for Synthetics use cases) would be that it would have less performance volatility. This also points toward mutable docs since steady states would be less onerous in terms of storage requirements (if you implement a flapping state they can be damped even further). I'll also add that it's hard to compare perf of updates vs append-only if the read-time costs of append-only wind up higher. |
Cheers for chiming in @andrewvc, good stuff - your experience with heartbeat/uptime is very valuable in the context of this discussion.
++. But, that's also why I feel we should store more, not less, precisely because we don't know.
I think you'd have to have one changeable alert document per monitor, that you continuously update with the latest status. At least, I don't see how you avoid the issues you mention if you create a new changeable alert document on up/down changes. Having one long-running changeable alert document is somewhat concerning to me though.
I'm not sure what the "alert" is you're referring to here. In the new terminology, is it the rule? E.g., the user can create multiple rules that evaluate the same (Uptime) monitor?
Yep, I'm hoping ES compression will help us here :).
I don't entirely follow :) what documents are you indexing and / or updating here? One document per state change, and update that?
++. I think eventually we want something that can be applied both as a rule in Kibana, or as a rule in a beat (PromQL recording/alerting rules come to mind). I think Security also has something like this where rule definitions are shared between Kibana rules and the endpoint agents. @tsg is that correct?
Agree, and I want to emphasise again that I am not suggesting we repeat the problems that Uptime and Metrics UI are running into. Which is why I think we should re-open elastic/elasticsearch#61349 (comment) :). If ES supports something like give me the top document of value x for field y, and only filter/aggregate on those documents, that would be a huge power-play.
++ on having consistent output (ie, evaluations). But not sure if mutable docs are reasonable here? If a very large percentage of your documents are continuously being updated, will that not create merge pressure on Elasticsearch because it tries to merge segments all the time due to the deleted documents threshold being reached? |
Thanks @andrewvc for chiming in (and for the patience of reading the whole thing :) ).
Interesting. I'm curious about the advantages of this approach of using the alerts data as source of truth versus maintaining your own state in the app and creating alerts to reflect that state. Is it a matter of consolidating the logic in a single place, and that place is the Rule type? I think that can work, I'm just considering if this will bring a new set of requirements on the RAC :) Btw, Heartbeat might have some similarities with Endpoint in this regard. The Endpoint knows already what is an alert as soon as it happens on the edge (e.g. malware model was triggered). It indexes an immutable document in its normal data streams. Then we have a "Promotion Rule", that's supposed to be always on, which "promotes" the immutable alert documents to Alerts with workflow (
I'm thinking easiest would be that for each transition to down status you would store an Alert ( Then the Alerting view in Obs or Kibana top-level will only show the down status alerts by default, but the custom UI in the Synthetics app can query both down and up status alerts. In other words, the contract is that |
@dgieselaar, @andrewvc, and I just had a short zoom call to discuss observability schema stuff, and I think we settled on:
(For reference, I made up these document letter names here in this comment above) Observability will start out likely storing [A] and [B], we expect Security may continue only storing [A], and that should all work fine for all of our needs and create consistent query API for an alert table. We also talked about further engaging the Elasticsearch team about update performance at scale, among other things, which @dgieselaar and @andrewvc will continue looking into in parallel. |
@tsg this feels like it aligns perfectly with what I wrote in my last comment, but I hadn't seen this bit in yours. If so, then I think we've got good alignment here! cc @spong |
@jasonesc Yes, I think we're aligned 👍 . Because this ticket is so long, I'll write a tl;dr of this conclusion and add it to the description early next week. |
One thing that I want to point out about these [B] and [C] documents: if we store them in the Alerts indices, they will be bound by the Alerts indices mapping. That means ECS + some fields that we agree on. @jasonrhodes @andrewvc @dgieselaar for the use cases that you have in mind, are there fields that you expect to need to sort/filter by that are not in ECS? If yes, can you list them to see if we could still include them in the mapping? As long as they don't risk conflicting with future ECS fields and there are not too many of them, we can probably just add them to the mapping. |
@tsg the plan is to create separate indices for solutions or even specific rule types. The fields needed for a unified experience should be in the root mapping that is shared between all indices. So, any specific fields we need might start out being defined in some of the Observability indices, and they can be "promoted" to the root mapping once we feel it's mature enough and widely useful. In my head, any kind of event can be stored in these indices. That means state changes, evaluations, but also rule execution events. I do wonder if we should store signal documents in another index that is not ILMed (with the same mappings). From what I know ILM requires one and only one write index. If we roll over an index on a stack upgrade, presumably this puts the rolled over index in read mode. How do you then update signal documents from indices that just rolled over? (Added this to the agenda for tomorrow) |
Ok, yeah, we have individual indices anyway for other reasons, so maybe this is a non-issue. Would still be good to have an overview of the fields so that we get ahead of potential conflicts and confusion. For example, if two rule types use the same field name but with different types, it might be a good idea to resolve one way or the other to avoid future pain.
Are you perhaps thinking of datastreams? I think just ILM doesn't have this restriction and I did a quick test on the signal indices and seems to work. |
++, my idea was to throw an error on startup if two registries try to register the same field. We could also add a precompile step that checks for any incompatibilities between different registries + ecs mappings.
Hmm, maybe? It's mentioned in the ILM docs [1] and the rollover API docs [2]:
I'm not sure what scenarios are applicable here, to be honest. |
Most aspects have already been discussed in this comprehensive thread, so I have little to add. The dual indexing strategy sounds like the most reasonable and least paint-ourselves-into-a-corner solution to me too. As for query complexity, I think a well-designed mapping can alleviate many of the pains. The situation is not really comparable to the Metrics UI IMHO, since we fully control the mapping here and can choose the field semantics to match our queries. |
This ticket is very long and has a lot of complicated parts to it, but I think @tsg has done a good job of summarizing in the actual ticket. For further discussion of RAC alerts as data, let's open a new ticket or refer to another document. I'm locking this for right now, but please feel free to unlock/re-open if anyone needs to add anything. |
Closing in favor of "Alerts as Data" RFC doc. |
This issue is for finalizing the Alerts as Data Schema definition.
The most recent proposal is as follows:
The current Detection Alert Schema is as follows:
.siem-signals schema
Features in the
.siem-signals
schema to take note of:parent
/parents
/ancestors/depth/original_time/original_signal
(I believe there's some deprecating we can do here)_meta.version
, because knowing your place in time is a good thing :)Rules
(i.erisk_score
mapping,severity_mapping
,rule_name_override
,timestamp_override
Features that don't currently exist that would've been nice in hindsight:
Open questions?
original_event
have any lingering features tied to it? (PR)Relevant source files:
Reference docs:
tl;dr on the long debate in the comments here
The contract that we have is that each Rule type, when executed, creates Alert documents for each “bad” thing. A “bad” thing could be a security violation, a metric going above a threshold, a service detected down, an ML anomaly, a move out of a map region, etc. These Alert documents use the ECS schema and have the fields required for workflow (e.g. in progress/close, acknowledge, assign to user). The common Alerts UI displays them in a table, typically one Alert per row. The user can filter and group by any ECS field. This is common for all solutions and rule types.
In addition to these Alert documents, the Rule type code is allowed to add other documents in the Alert indices (with a different
event.kind
), as long as they don’t cause mapping conflicts nor fields explosion. These extra documents are typically immutable and provide extra details for the Alert.For example, for a threshold based alert, they can contain the individual measurements (evaluations) over time as well as any state changes (alert is over warning watermark, alert is over critical watermark). These documents will be used by Alerts detail fly-out/page, which is Rule type specific, to display a visual timeline for each alert.
Curated UIs, like the Synthetics one, can use both Alerts and the evaluations docs to build the UIs that they need.
The text was updated successfully, but these errors were encountered: