Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky alert assignment tests #176930

Merged
merged 9 commits into from
Mar 6, 2024

Conversation

e40pud
Copy link
Contributor

@e40pud e40pud commented Feb 14, 2024

Summary

Addresses:

Fix flaky alert assignments tests. I split assignments tests into two groups: tests with one assignee available and tests with multiple assignees.

Right now there is a flakiness in tests with multiple assignees. Most probably it is happening because we do multiple login calls in a row to make sure we activate different users to make them available for assignments:

// Login into accounts so that they got activated and visible in user profiles list
       login(ROLES.t1_analyst);
       login(ROLES.t2_analyst);
       login(ROLES.t3_analyst);
       login(ROLES.soc_manager);
       login(ROLES.detections_admin);
       login(ROLES.platform_engineer);

These tests are tend to be flaky and it is possible that kibana operations team will skip those. To make sure that we run basic cypress verification of alert assignments feature we decided to add tests with only one assignee available (current user) which allows us to avoid multiple consecutive login calls.

Also, as part of these changes I removed unnecessary logins and un-skipped #176529

NOTE

After discussing these failure with the team, we decided to remove tests which are covered by the integration and unit tests. While fixing the flakiness we realised that we do unnecessary work trying to fight the internal errors within elastic search on serverless when we do multiple user logins in a row. Instead we will rely on:

  • integration tests coverage of API related functionality including RBAC
  • unit tests coverage of all assignments UI components
  • cypress tests coverage of basic UI interaction with the alert assignments with only one user available for the assignments

cc @yctercero

Checklist

Delete any items that are not applicable to this PR.

@e40pud e40pud added release_note:skip Skip the PR/issue when compiling release notes Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Engine Security Solution Detection Engine Area labels Feb 14, 2024
@e40pud e40pud self-assigned this Feb 14, 2024
@e40pud
Copy link
Contributor Author

e40pud commented Feb 14, 2024

/ci

@e40pud e40pud marked this pull request as ready for review February 15, 2024 08:41
@e40pud e40pud requested review from a team as code owners February 15, 2024 08:41
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detection-engine (Team:Detection Engine)

@e40pud
Copy link
Contributor Author

e40pud commented Feb 16, 2024

@elasticmachine merge upstream

@yctercero yctercero requested review from rylnd and removed request for yctercero February 17, 2024 06:09
@e40pud
Copy link
Contributor Author

e40pud commented Feb 19, 2024

@elasticmachine merge upstream

Copy link
Contributor

@rylnd rylnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a lot of thoughts here, sorry 😅 .

I think the only needed change here is removal of the redundant single-user tests; everything else is a step in the right direction.

I was looking for examples of where these tests failed, to validate the multi-user hypothesis and see if there wasn't some information in the error/failure. However, the original skip PR linked here only shows that ES threw a 503 during the test, and the flaky test runner seemingly only had timeouts and not any legitimate failures.

If there are particular error messages that we're basing this PR on, it would be great to call those out both in this PR and the "skipped test" issue, for posterity.

@@ -77,42 +67,23 @@ describe('Alert user assignment - ESS & Serverless', { tags: ['@ess', '@serverle
});

it('alert with some assignees in alerts table', () => {
const users = [ROLES.detections_admin, ROLES.t1_analyst];
const users = [getDefaultUserName()];
updateAssigneesForFirstAlert(users);
alertsTableShowsAssigneesForAlert(users);
});

it(`alert with some assignees in alert's details flyout`, () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broader question outside the scope of this particular PR: why is this test not part of the one above it? The script appears to be:

  1. login
  2. create rule, wait for alerts
  3. assign alert to user
  4. make assertions about assignment

Other than violating the "one assertion per test" rule (which I don't believe is relevant to cypress), is there a reason for not consolidating these? I can imagine that having more tests seems like it would make the suite more robust, but given the amount of work that happens before the assertions (that then gets repeated in every independent test), I believe the opposite is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scrolling further it seems as though both of these tests are now redundant with Updating assignees (single alert) adding new assignees via 'More actions' in alerts table; that one tests everything the two of these do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! I will walk through tests and will consolidate where it is possible.

updateAssigneesForFirstAlert(users);
alertsTableShowsAssigneesForAlert(users);
});

it(`alert with some assignees in alert's details flyout`, () => {
const users = [ROLES.detections_admin, ROLES.t1_analyst];
const users = [getDefaultUserName()];
updateAssigneesForFirstAlert(users);
expandFirstAlert();
alertDetailsFlyoutShowsAssignees(users);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it would be nice to have some convention (maybe either function naming (assert as a prefix), or folder location (/assertions), or both) to identify these tasks as performing assertions.

I know that for a while @MadameSheema was requesting that we not abstract assertions into helpers at all, but I think if we do so we should make those assertions a bit more discoverable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MadameSheema any thought/preferences on this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey!! the overall preference is to NOT abstract assertions.

@e40pud
Copy link
Contributor Author

e40pud commented Feb 23, 2024

@elasticmachine merge upstream

@e40pud
Copy link
Contributor Author

e40pud commented Feb 23, 2024

Thank you for the review @rylnd!!

I think the only needed change here is removal of the redundant single-user tests; everything else is a step in the right direction.

I agree with redundancy of update assignee in single user case and will update/remove unnecessary test cases.

I was looking for examples of where these tests failed, to validate the multi-user hypothesis and see if there wasn't some information in the error/failure. However, the original skip PR linked here only shows that ES threw a 503 during the test, and the flaky test runner seemingly only had timeouts and not any legitimate failures.

If there are particular error messages that we're basing this PR on, it would be great to call those out both in this PR and the "skipped test" issue, for posterity.

Yes, the 503 error is what casing the issue. It happens within beforeEach block on deletion of indices and lists. We do exactly the same steps as in all other tests except in our case we do multiple login calls to activate multiple accounts. That's why this is the only reason I can think of that could cause that internal error. While I will be investigating that issue further, I would like to have at least some stable tests covering assignments functionality.

@e40pud e40pud requested a review from rylnd February 23, 2024 14:33
@e40pud
Copy link
Contributor Author

e40pud commented Feb 23, 2024

@elasticmachine merge upstream

Copy link
Contributor

@rylnd rylnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes and helpful responses. LGTM, let's get these back online and hopefully avoid those 503s in the future 👍 .

@e40pud
Copy link
Contributor Author

e40pud commented Mar 5, 2024

@elasticmachine merge upstream

it('alert with some assignees in alerts table', () => {
const users = [ROLES.detections_admin, ROLES.t1_analyst];
it('alert with some assignees in alerts table & details flyout', () => {
const users = [getDefaultUserName()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Change all the users constants to user since we only have one.

Copy link
Member

@MadameSheema MadameSheema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Engineering Productivity changes LGTM!! Lots of thanks for addressing the flakiness! :)

NIT: Change all the users constants to user since we only have one.

Doubt: Is there any impact from the functional point of view on doing the testing with just one user instead of more?

Thanks!

@rylnd
Copy link
Contributor

rylnd commented Mar 5, 2024

@MadameSheema we discussed the idea of testing one user vs multiple here, and the potential loss of coverage, and it was argued that having:

  • a cypress test to verify that a single user assignment propagates correctly to the UI
  • a cypress test to verify that multiple user assignments propagate correctly to the UI
  • multiple jest tests to test component behavior for the myriad users/roles assigned

would provide the same coverage as all the previous cypress tests, which much less cost/downside. Do you agree with that?

# Conflicts:
#	x-pack/test/security_solution_cypress/cypress/e2e/detection_response/detection_engine/detection_alerts/assignments/assignments_serverless_complete.cy.ts
@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Defend Workflows Cypress Tests on Serverless #10 / User Roles for Security Complete PLI with Endpoint Complete addon for role: endpoint_operations_analyst should have access to response action: processes should have access to response action: processes

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @e40pud

@e40pud e40pud merged commit 31cd917 into elastic:main Mar 6, 2024
36 checks passed
@kibanamachine kibanamachine added v8.14.0 backport:skip This commit does not require backporting labels Mar 6, 2024
This was referenced Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Detection Engine Security Solution Detection Engine Area Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants