-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution][Detections] Migrate from ruleStatusSavedObjectType to Alerting Event Log for Rule Monitoring #91265
Comments
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
Some notes after formalizing our requirements and checking the Our data modelCurrently, we don't use any built-in features in Alerting to represent our detection rule execution statuses. We have custom statuses stored as Saved Objects in a separate SO of type These are attributes of this SO (field types are simplified for readability):
Our needsWe need to be able to store and fetch 2 types of entities: current rule execution status (or state if you will), and rule execution log (history of statuses or even better, a more generic log). Current rule execution status. We need to be able to:
Rule execution log. We need to be able to:
Some things to clarifyAlerting-related questions:
Event log-related questions:
Data model TO BERoughly, as I would imagine at this point:
|
So we talked about all the points mentioned above with the Alerting team and then synced within our Detection and Response team. I tried to reconcile all info about available options, our needs, decisions and action items. Here's a summary. Trying to be concise, still there's a lot to read, sorry about that. If you have any comments, please post here in the ticket. @dhurley14 @FrankHassanabad @gmmorris @mikecote @peluja1012 @pmuellr @spong Option 1: alert stateRule (alert) state can be used for storing the current rule execution status (see above comment). And a list of 5 latest failures, and other data, if needed. We don't use alert state in detection rules. You can read about it in the docs, see "alert type level state". Filtering, sorting, searching over state:
Fetching execution states:
Updating execution states:
Option 2: event logEvent Log is the new API created by the Alerting team:
It can be used to store the rule execution log (status history etc, see above comment). Moreover, in theory, with proper support for aggregations, it could be used to retrieving the current rule execution status and other "current"-like information.
Our needsAs a minimum, now we need:
Ultimately, we need:
We don't need:
Decisions and next stepsAt this point we think that Event Log might be promising for our needs.
We'll put the Alert State-based implementation on hold for now. But we accept to perhaps get back to it later. If not everything will be possible with Event Log, or there will be performance or other considerations. Next steps for me:
|
Regarding:
I'd say effort is high, as we'd have to change how tasks throughout the system are executed. How long do these tasks tend to run in production environments? Is it so long that incrementally updating throughout the task execution is unusually important? |
For the event log:
In progress PR for that here: #91731 |
@pmuellr Oh, that's awesome, thank you for pointing out to this PR. |
@gmmorris So regarding "10s of seconds". On one hand, it's my bad, apparently I looked at a wrong piece of response from https://kibana.siem.estc.dev/api/task_manager/_health 🤦 This is the correct data (today):
This is probably what I was looking at yesterday:
So it's likely just seconds in the dev environment, not tens of seconds or minutes. On the other hand, I have no data from production environments. Some of our rules are much more expensive than the others, and the execution time might depend on the rule type, parameters, and source event indices. We allow creating custom rules, so... @FrankHassanabad is it possible to find some numbers from production envs? Or maybe you can give any general comments on that. |
Just to be clear - Task Manager does support long running tasks. |
Perhaps we could have a constraint like - if you set a task's timeout higher than default, you must provide max concurrency as well. |
@gmmorris @pmuellr thank you, I see. As far as I can see from the I hope 5 minutes should not be an issue in most (if not all) environments. I'll try to figure out if this is possible and how likely it is to happen. |
) **Epic:** #118324 **Tickets:** #119603, #119597, #91265, #118511 ## Summary The legacy rule execution logging implementation is replaced by a new one that introduces a new model for execution-related data, a new saved object and a new, cleaner interface and implementation. - [x] The legacy data model is deleted (`IRuleStatusResponseAttributes`, `IRuleStatusSOAttributes`) - [x] The legacy `siem-detection-engine-rule-status` saved object type is deleted and marked as deleted in `src/core` - [x] A new data model is introduced (`x-pack/plugins/security_solution/common/detection_engine/schemas/common/rule_monitoring.ts`). This data model doesn't contain a mixture of successful and failed statuses, which should simplify client-side code (e.g. the code of Rule Management and Monitoring tables, as well as Rule Details page). - [x] A new `siem-detection-engine-rule-execution-info` saved object is introduced (`x-pack/plugins/security_solution/server/lib/detection_engine/rule_execution_log/rule_execution_info/saved_object.ts`). - [x] This SO has 1:1 association with the rule SO, so every rule can have 0 or 1 execution info associated with it. This SO is used in order to 1) update the last execution status and metrics and 2) fetch execution data for N rules more efficiently comparing to the legacy SO. - [x] The logic of creating or updating this SOs is based on the "upsert" approach (planned in #118511). It does not fetch the SO by rule id before updating it anymore. - [x] Rule execution logging logic is rewritten (see `x-pack/plugins/security_solution/server/lib/detection_engine/rule_execution_log`). The previous rule execution log client is split into two objects: `IRuleExecutionLogClient` for using it from route handlers, and `IRuleExecutionLogger` for writing logs from rule executors. - [x] `IRuleExecutionLogger` instance is scoped to the currently executing rule and space id. There's no need to pass rule id, name, type etc to `.logStatusChange()` every time. - [x] Rule executors and related functions are updated. - [x] API routes are updated, including the rule preview route which uses a special "spy" implementation of `IRuleExecutionLogger`. A rule returned from an API endpoint now has optional `execution_summary` field of type `RuleExecutionSummary`. - [x] UI is updated to use the new data model of `RuleExecutionSummary`: - [x] Rule Management and Monitoring tables - [x] Rule Details page - [x] A new API route is introduced for fetching rule execution events: `/internal/detection_engine/rules/{ruleId}/execution/events`. It is used for rendering the Failure History tab (last 5 failures) and is intended to be used in the coming UI of Rule Execution Log on the Details page. - [x] Rule Details page and Failure History tab are updated to use the new data models and API routes. - [x] I used `react-query` for fetching execution events - [x] See `x-pack/plugins/security_solution/public/detections/containers/detection_engine/rules/use_rule_execution_events.tsx` - [x] The lib is updated to the latest version - [x] Tests and fixed and updated according to all the changes - [x] Components related to rule execution statuses are all moved to `x-pack/plugins/security_solution/public/detections/components/rules/rule_execution_status`. - [x] I left a lot of `// TODO: https://github.com/elastic/kibana/pull/121644` comments in the code which I'm planning to address and remove in a follow-up PR. Lots of clean up work is needed, but I'd like to unblock the work on Rule Execution Log UI. ## In the next episodes - Address and remove `// TODO: https://github.com/elastic/kibana/pull/121644` comments in the code - Make sure that SO id generation for `siem-detection-engine-rule-execution-info` is safe and future-proof. Sync with the Core team. If there are risks, we will need to choose between risks and performance (reading the SO before updating it). It would be easy to submit a fix if needed. - Add APM integration. Use `withSecuritySpan` in methods of `rule_execution_log` citizens. - Add comments to the code and README. - Add test coverage. - Etc... ### Checklist Delete any items that are not applicable to this PR. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] Any UI touched in this PR is usable by keyboard only (learn more about [keyboard accessibility](https://webaim.org/techniques/keyboard/)) - [ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: [FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/), [Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US)) - [x] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [ ] This renders correctly on smaller devices using a responsive layout. (You can test this [in your browser](https://www.browserstack.com/guide/responsive-testing-on-local-server)) - [ ] This was checked for [cross-browser compatibility](https://www.elastic.co/support/matrix#matrix_browsers) ### For maintainers - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
The last piece has been implemented in #121644. With this implementation:
The next major step would be to address #112193 which will give us a chance to get rid of |
Same as: #83235
Currently the Detection Engine uses a few separate SO's for managing additional state, sometimes as a stopgap while other solutions were under development. Details on all SO's managed by Detections are detailed in this comment here: #60053 (comment).
One of these SO's, the
ruleStatusSavedObjectType
, was created to store the last 5 rule failures while support for additional monitoring was added to the alerting framework. This SO is used to display a Rule's most recent failures on the Rule Details page, and is also joined with the Rule SO for display within the Rules and Monitoring tables. The latter of which quickly becomes a performance issue with the more and more rules we add, and with the limitations of the SO client.Since the original implementation, the alerting framework now has a dedicated index called the
Event Log
for writing Rule (alert) state. In recent discussions with the @elastic/kibana-alerting-services team it sounds like we should be able to migrate from our sidecar SO model of managing status, to leveraging theEvent Log
. In doing so we should be able to remove the maintenance burden of managing another SO type, increase performance of the Rules/Monitoring tables by optimizing the join, and better prepare the Detection Engine for the unified alerting architecture.Useful notes from our meeting with the alerting team:
Event Log
's security model requires you provide the SO you're trying to write/retrieve the event for. As of 7.11, bulk retrieval is possible by sending an array of SO's.status
field on the Alert SO so we don't have to do a join when fetching records for the Rules/Monitoring tables.Event Log
can be written to from within the execution context, or outside. Logs are scheduled and written shortly after.ruleStatusSavedObjectType
)ruleStatusSavedObjectType
kibana/x-pack/plugins/security_solution/server/lib/detection_engine/rules/saved_object_mappings.ts
Lines 49 to 54 in 1216b0f
Event Log PR's
https://github.com/elastic/kibana/pulls?q=is%3Apr+label%3AFeature%3AAlerting+Event+Log+is%3Aclosed
How to remove existing SO's (mapping and existing objects on upgrade)
I checked with the platform team and this was our guidance for removing the existing SO:
The text was updated successfully, but these errors were encountered: