Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Instrument rule executors with Elastic APM #117672

Merged
merged 1 commit into from
Dec 16, 2021

Conversation

xcrzx
Copy link
Contributor

@xcrzx xcrzx commented Nov 5, 2021

Summary

This PR shows how Elastic APM can help us find performance bottlenecks in Security API routes and rule executors.

Walkthrough

  1. To instrument your local Kibana with APM, create config/apm.dev.js with the following content:
    module.exports = {
      environment: '<use uniq name here>', // You will use this name to filter logs from your local environment
      active: true,
    };
  2. Start Kibana as you usually do (yarn start); it'll start sending logs to the shared APM Server.
  3. Navigate to https://ela.st/kibana-ops and select Default space.
  4. Go to Observability > APM > Services, select your environment and click on Kibana service.
    Screenshot 2021-11-05 at 16 24 45
  5. Click on the Transactions tab and select the task-run transaction type.
  6. Then find transactions corresponding to the rule type you want to inspect, e.g., siem.queryRule rule execution.
  7. Well done 🙌 Now you can investigate the rule execution timeline

@xcrzx xcrzx changed the base branch from main to 7.16 November 5, 2021 14:49
@xcrzx xcrzx changed the base branch from 7.16 to main November 8, 2021 10:17
@xcrzx xcrzx force-pushed the apm-test branch 7 times, most recently from 4a29ee2 to d21d723 Compare November 10, 2021 11:41
@xcrzx xcrzx force-pushed the apm-test branch 2 times, most recently from ea464f7 to 6c8aeac Compare November 17, 2021 09:08
@xcrzx xcrzx changed the title Instrument rule executors with Elastic APM [Security Solution] Instrument rule executors with Elastic APM Nov 17, 2021
@xcrzx xcrzx self-assigned this Nov 17, 2021
@xcrzx xcrzx added Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team auto-backport Deprecated - use backport:version if exact versions are needed release_note:skip Skip the PR/issue when compiling release notes v8.1.0 labels Nov 17, 2021
@xcrzx xcrzx force-pushed the apm-test branch 2 times, most recently from dba12d0 to 93bc2ec Compare November 17, 2021 09:28
@elastic elastic deleted a comment from kibanamachine Nov 17, 2021
@xcrzx xcrzx added the v8.0.0 label Nov 17, 2021
@xcrzx xcrzx force-pushed the apm-test branch 4 times, most recently from b21fcfd to 413d0c8 Compare December 15, 2021 14:13
@xcrzx xcrzx marked this pull request as ready for review December 15, 2021 14:20
@xcrzx xcrzx requested a review from a team as a code owner December 15, 2021 14:20
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

public deleteCurrentStatus(ruleId: string): Promise<void> {
return this.client.deleteCurrentStatus(ruleId);
public async deleteCurrentStatus(ruleId: string): Promise<void> {
await withSecuritySpan('RuleExecutionLogClient.deleteCurrentStatus', () =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we await'ing here now and weren't before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed return, and added async + await. Don't remember why I did that, tbh. But these functions are identical:

const foo = async () => {
  await asyncFnReturningVoid();
};

const bar = () => {
  return asyncFnReturningVoid();
};

for (const [hash, entry] of Object.entries(signalHistory)) {
if (entry.lastSignalTimestamp < tuple.from.valueOf()) {
toDelete.push(hash);
return withSecuritySpan('detectionEngine thresholdExecutor', async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thresholdExecutor? Doesn't look like the other executors prefix with detectionEngine?

Suggested change
return withSecuritySpan('detectionEngine thresholdExecutor', async () => {
return withSecuritySpan('thresholdExecutor', async () => {

Copy link
Contributor Author

@xcrzx xcrzx Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously used detectionEngine as a prefix but changed to withSecuritySpan later to avoid duplication. Missed that piece during refactoring, thanks!

eventLogService,
logger,
});
agent.setTransactionName(`${options.rule.ruleTypeId} rule execution`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably don't need rule here since all the ruleTypeId's end with Rule anyway?

Suggested change
agent.setTransactionName(`${options.rule.ruleTypeId} rule execution`);
agent.setTransactionName(`${options.rule.ruleTypeId} execution`);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, changed!


type Span = Exclude<typeof agent.currentSpan, undefined | null>;

export const withSecuritySpan = <T>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add JSDoc for when folks should use this function and any necessary pre-req's (does this need to happen within the scope of agent.setTransactionName()?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added doc. As for prerequisites, there aren't any. We can use this method anywhere throughout our codebase. All main code paths are already wrapped in transactions on the framework level.

Comment on lines +186 to +191
const errorMessage = buildRuleMessage(`Check privileges failed to execute ${exc}`);
logger.error(errorMessage);
await ruleStatusClient.logStatusChange({
...basicLogArguments,
message: errorMessage,
newStatus: RuleExecutionStatus['partial failure'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be setting agent.setTransactionOutcome('failure'); here (and in other failure/success cases) just as you did over in signal_rule_alert_type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, is this covered globally by task_runner?

if (apm.currentTransaction) {
if (executionStatus.status === 'ok' || executionStatus.status === 'active') {
apm.currentTransaction.setOutcome('success');
} else if (executionStatus.status === 'error' || executionStatus.status === 'unknown') {
apm.currentTransaction.setOutcome('failure');
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they added it recently, so there's no need to set the outcome on our side anymore. Thanks for pointing it out. I also removed setOutcome from signal_rule_alert_type .

Copy link
Member

@spong spong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked out, was able to verify and test locally against a cloud APM instance (presumable major version mis-match when trying against ops.kibana.dev), and performed code review.

Few nits and questions around async/await usage and setting transaction outcome in create_security_rule_type_wrapper.ts, but other than that LGTM! 👍 Thanks for instrumenting all our rule types @xcrzx! This is going to be extreeeemely helpful in debugging going forward! 😀

@xcrzx xcrzx added backport:skip This commit does not require backporting and removed auto-backport Deprecated - use backport:version if exact versions are needed labels Dec 16, 2021
@xcrzx xcrzx enabled auto-merge (squash) December 16, 2021 12:48
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

  • 💔 Build #13610 failed 3d5e03a94085a0aaaf56c688197b45d29bafcca8
  • 💚 Build #13420 succeeded e2c4e933a7375f7e5dc93a593c7f32c3671e2684
  • 💚 Build #12851 succeeded b21fcfd2e684d0c4e4e4bf1de4fa45a26e5bada8
  • 💚 Build #12463 succeeded 1c0a240b900516bd1f25a132c430eb0cd65472ba
  • 💔 Build #12453 failed 9967baaed3e6d9d1dcaf8495e86834f77fb6cec2
  • 💛 Build #7310 was flaky 93bc2eccafdc910b7bd84dd967164ceffa2e7235

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @xcrzx

@xcrzx xcrzx merged commit 7847bc8 into elastic:main Dec 16, 2021
@xcrzx xcrzx deleted the apm-test branch December 16, 2021 13:46
TinLe pushed a commit to TinLe/kibana that referenced this pull request Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting performance release_note:skip Skip the PR/issue when compiling release notes Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants