[RAC] RAC framework calls to ES should wrap their own errors #106315

pmuellr · 2021-07-20T18:49:12Z

From issue #101753, we realized that the RAC framework is making ES calls during the rule executor, which are throwing exceptions, and not caught. Those get bubbled up back to the rule executor, which assumes the error came from the rule executor. In this case, the rule executor was actually RAC, which did some work, called the original rule executor, and then did some more work, before returning control back to the alerting framework.

The problem with this is that it can cause some confusion during diagnosis of errors. It would really be best if the RAC ES calls were "wrapped", to catch exceptions, such that:

the RAC code logs an error/warning itself, with it's own plugin id (implicit in the logging tags)
the calculated error message provides a prefix describing what it's doing semantically
can feel free to throw the exception back to the alerting framework

This will help us when we see these exceptions to narrow down the place it occurred.

As an example, re: #101753, I believe the exception happened when executing the following code:

kibana/x-pack/plugins/rule_registry/server/rule_data_client/index.ts

Lines 125 to 143 in bc4928f

    
           if (!aliasExists) { 
        
             try { 
        
               await clusterClient.indices.create({ 
        
                 index: concreteIndexName, 
        
                 body: { 
        
                   aliases: { 
        
                     [alias]: { 
        
                       is_write_index: true, 
        
                     }, 
        
                   }, 
        
                 }, 
        
               }); 
        
             } catch (err) { 
        
               // something might have created the index already, that sounds OK 
        
               if (err?.meta?.body?.error?.type !== 'resource_already_exists_exception') { 
        
                 throw err; 
        
               } 
        
             } 
        
           }

For that case, instead of just the throw err, it probably should log a message indicating the problem (debug is fine if it feels like it will be too noisy, but this one seems worthy of a logged message), and then create a new error with the any details found in the originating error, with an error message describing the error occurred while trying to create an alerting index.

There weren't too many ES calls that I could see, directly using the clusterClient; they seemed to be in the following three files:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-07-21T08:09:12Z

Pinging @elastic/security-solution (Team: SecuritySolution)

gmmorris · 2021-09-03T08:49:07Z

Looking at the code I can't see a Rule Registry Logger at all.
Feels like the registry should have its own context, separate from the Alerting plugin and the Rule Type's plugin (which is wrapped in the registry).

banderror · 2021-09-03T13:22:30Z

@gmmorris Rule Registry has its own logger and it's used to log some info and debug messages:

kibana/x-pack/plugins/rule_registry/server/plugin.ts

Line 66 in 6f357e0

this.logger = initContext.logger.get();

If you enable it in the dev config, you will see some logs about installation of ES resources like component templates etc:

logging.events:
  {
    log: ['ruleRegistry', 'error', 'fatal'],
    request: ['fatal'],
    error: '*',
    ops: __no-ops__,
  }

Just to make sure I got it correctly, is this exactly what you meant by "the registry should have its own context", or something else?

banderror · 2021-09-03T13:47:36Z

@pmuellr I'd like to clarify the expected behaviour.

the RAC code logs an error/warning itself, with it's own plugin id (implicit in the logging tags)

the calculated error message provides a prefix describing what it's doing semantically

can feel free to throw the exception back to the alerting framework

What do you mean by the RAC code? I think we have several different citizens:

a generic indexing implementation (RuleDataService + RuleDataClient) used by all of the RAC-involved plugins
a somewhat-generic base executor implementations (lifecycle executor used by Observability plugins, persistence executor used by Security)
a specific rule type implementations in the RAC-involved plugins

RuleDataService and RuleDataClient currently write some logs, but they don't wrap any exceptions and don't have (almost) their own Error classes. I guess what's missing here is they could and should present every particular error with a custom Error exception with clear message describing the case. They will be able to handle some of the cases and not re-throw, while most of the exceptions will be either thrown or re-thrown to the outside world (most likely to the "somewhat-generic base executor implementations").

the calculated error message provides a prefix describing what it's doing semantically

Something like that?

} catch (e) {
  this.options.logger.error(e);

  const reason = e?.message || 'Unknown reason';
  throw new Error(`Failure installing ${resources}. ${reason}`);
}

If we talk about the rest of the "citizens", does the same approach to exceptions and logging apply?

Just to double-check, you're fine with exceptions thrown from whatever executors as long as they can be easily identified as "exceptions from rule registry" or "from security solution rule types" etc, right?

And finally, are you concerned only about the code that runs within the Alerting Framework executors, or any code in general that calls ES APIs from "RAC"? For example, we have RBAC and RBAC-related endpoints that call ES outside of the rule types.

gmmorris · 2021-09-06T08:07:41Z

Rule Registry has its own logger and it's used to log some info and debug messages:

oops, sorry @banderror , I don't know how I missed that!

Just to make sure I got it correctly, is this exactly what you meant by "the registry should have its own context", or something else?

Yeah, I think that should do it - as long as you catch the errors and log them yourself, they should have the context of the RulesRegistry.

gmmorris · 2021-09-06T08:15:03Z

Just to double-check, you're fine with exceptions thrown from whatever executors as long as they can be easily identified as "exceptions from rule registry" or "from security solution rule types" etc, right?

Yes, throwing the error tells the framework that the rule failed, and that's valuable.
What we want to easily identify is whether the exception took place in the rule itself, the rule registry or the framework. This should make it easier to identify the right team to help investigate and reduce our mean time to resolution.

elasticmachine · 2021-09-08T10:45:06Z

Pinging @elastic/security-detections-response (Team:Detections and Resp)

banderror · 2021-10-11T18:25:21Z

Hey everyone, FYI ownership of this ticket and other tickets related to rule_registry (like #101016) now goes to the Detection Alerts area (Team:Detection Alerts label). Please ping @peluja1012 and @marshallmain if you have any questions.

marshallmain · 2022-04-19T15:22:51Z

Transferring again to @elastic/response-ops as they now own the rule registry implementation.

pmuellr added the Feature:RAC label obsolete label Jul 20, 2021

botelastic bot added the needs-team Issues missing a team label label Jul 20, 2021

timroes added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Jul 21, 2021

botelastic bot removed the needs-team Issues missing a team label label Jul 21, 2021

pmuellr added the Theme: rac label obsolete label Jul 27, 2021

banderror mentioned this issue Sep 3, 2021

[RAC][Meta] Consolidate the two indexing implementations in rule_registry plugin #101016

Open

41 tasks

banderror added the Team:Detections and Resp Security Detection Response Team label Sep 8, 2021

banderror added the Team:Detection Alerts Security Detection Alerts Area Team label Oct 11, 2021

gmmorris added this to AppEx: ResponseOps - Execution & Connectors May 11, 2022

gmmorris moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors May 11, 2022

mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors May 12, 2022

mikecote added Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry and removed Theme: rac label obsolete Feature:RAC label obsolete labels May 16, 2022

ymao1 self-assigned this Jul 8, 2022

ymao1 moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Jul 8, 2022

ymao1 mentioned this issue Jul 12, 2022

[Rule Registry] Additional error logging #136225

Merged

ymao1 moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Jul 12, 2022

ymao1 closed this as completed in #136225 Jul 22, 2022

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RAC] RAC framework calls to ES should wrap their own errors #106315

[RAC] RAC framework calls to ES should wrap their own errors #106315

pmuellr commented Jul 20, 2021

elasticmachine commented Jul 21, 2021

gmmorris commented Sep 3, 2021

banderror commented Sep 3, 2021

banderror commented Sep 3, 2021 •

edited

Loading

gmmorris commented Sep 6, 2021 •

edited

Loading

gmmorris commented Sep 6, 2021

elasticmachine commented Sep 8, 2021

banderror commented Oct 11, 2021

marshallmain commented Apr 19, 2022 •

edited

Loading

[RAC] RAC framework calls to ES should wrap their own errors #106315

[RAC] RAC framework calls to ES should wrap their own errors #106315

Comments

pmuellr commented Jul 20, 2021

elasticmachine commented Jul 21, 2021

gmmorris commented Sep 3, 2021

banderror commented Sep 3, 2021

banderror commented Sep 3, 2021 • edited Loading

gmmorris commented Sep 6, 2021 • edited Loading

gmmorris commented Sep 6, 2021

elasticmachine commented Sep 8, 2021

banderror commented Oct 11, 2021

marshallmain commented Apr 19, 2022 • edited Loading

banderror commented Sep 3, 2021 •

edited

Loading

gmmorris commented Sep 6, 2021 •

edited

Loading

marshallmain commented Apr 19, 2022 •

edited

Loading