Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Improving test automation reliability #173508

Open
6 of 11 tasks
Tracked by #153633 ...
MadameSheema opened this issue Dec 18, 2023 · 7 comments
Open
6 of 11 tasks
Tracked by #153633 ...

[Security Solution] Improving test automation reliability #173508

MadameSheema opened this issue Dec 18, 2023 · 7 comments
Labels
Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.

Comments

@MadameSheema
Copy link
Member

MadameSheema commented Dec 18, 2023

Update

This ticket was created to minimize blocking other teams' PRs because of our tests. That has been achieved already by #173815.

There is still work pending to do to minimise the flakiness in our automation and to have more reliable tests.

This ticket from now on is going to track all the efforts around it.

Tasks

Preview Give feedback
  1. Team: SecuritySolution release_note:skip v8.12.0 v8.13.0
    MadameSheema charlie-pichette
  2. Team: SecuritySolution
    dkirchan
  3. Team: SecuritySolution Team:Detection Engine Team:Detection Rule Management Team:Detections and Resp Team:Entity Analytics Team:Threat Hunting Team:Threat Hunting:Explore Team:Threat Hunting:Investigations backport:skip release_note:skip v8.12.0 v8.13.0
    MadameSheema
  4. Team: SecuritySolution
  5. Team: SecuritySolution
  6. Team: SecuritySolution
  7. Team: SecuritySolution

Useful information

Flaky test vs blocker test

A flaky test is an automated test that has inconsistent behaviour, sometimes it passes, sometimes fails. In our context, a flaky test is an automated test that has failed at least one time in:

  • The on-merge pipeline
  • The Flaky test suite runner execution
  • The execution of a PR

Without blocking the merge of a PR and without breaking the on-merge pipeline.

A blocker test is an automated test that has consistently failing in:

  • The on-merge pipeline
  • The Flaky test suite runner execution
  • The execution of a PR

Blocking the merge of a PR or breaking the on-merge pipeline.

In both cases, we are talking about failing tests.

How are we tracking failed tests?

  • When a test fails for the first time on the on-merge pipeline a github ticket is automatically created in the Kibana repository.

    • The title of the ticket is the path of the failed test
    • The description is the error message of the failure with a link to the buildkite execution that failed
    • The ticket is labeled with failed-test and the owner of the test
  • When a test that is already reported on a github ticket fails again:

    • A new comment is added to the ticket with a link to the builkite execution
  • When the test has failed several times, kibana-operations team skips it.

@MadameSheema MadameSheema added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Dec 18, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@peluja1012
Copy link
Contributor

Thanks for putting this ticket together, @MadameSheema! I'm glad we're taking steps towards making further improvements to our flaky test process. I have a couple of questions/comments.

  1. Giving Security Solution the ability to skip tests without waiting for CI is a positive change, but I wonder if we could take it one step further. This change would only save us the time it takes for CI to pass when a PR is open to skip a test. We could still potentially be causing problems for other teams during the time period between when the test is reported as a failed-test and when we become aware of the problem. In order to mitigate this, could we automatically skip a failed-test reported in the on-merge pipeline?
  2. Which team would we need to work with in order to automatically ping teams on slack when a test is marked as failed-test?
  3. What is "burn functionality"?

@MadameSheema
Copy link
Member Author

@peluja1012 lots of thanks for reviewing thoroughly the ticket, here you can find my insights:

Giving Security Solution the ability to skip tests without waiting for CI is a positive change, but I wonder if we could take it one step further. This change would only save us the time it takes for CI to pass when a PR is open to skip a test. We could still potentially be causing problems for other teams during the time period between when the test is reported as a failed-test and when we become aware of the problem. In order to mitigate this, could we automatically skip a failed-test reported in the on-merge pipeline?

You are completely right about it, we could still potentially be causing problems. The proposed solution in the main ticket is a quick win since it is easy to implement and would save us time. I don't know how complex can be skipping tests automatically, that is something we need to research. Another thing we need to investigate or take into consideration is the pros of cons, for instance, sometimes failures are due to problems not related to us or network connection issues that happens just one time, skipping the test right away will lead to minimising our test coverage. But I really like the idea of automating this process and we can research to see if we can make it possible without compromising our coverage :)

Which team would we need to work with in order to automatically ping teams on slack when a test is marked as failed-test?

I've researched about this and seems to be simpler than we thought. I'll ping you on slack and I'll provide you details about it.

What is "burn functionality"?

Is a functionality integrated in Cypress by Patryk that checks which tests have been modified/introduced and executes them several times in a row. That would help us to get rid of the flaky test suite runner and make the check of flakiness in our tests part of the PR process. This is a great idea that we need to re-implement, but there is some work we need to do. For instance, we need to make sure that our tests can be executed several times in the same instance without having false failures or that process is not going to impact teams outside security solution.

@jaredburgettelastic
Copy link
Contributor

Awesome writeup, I'm particularly looking forward to "Provid[ing] an extended guide of automation best practices focused on flakiness" coming to fruition.

Couple of clarifying questions:

PRs should not be merged without having rebased new changes in main first.

Is "rebasing" strictly required here? Could we say more broadly:

PRs should not be merged without having rebased on main or merging/pulling from main first.


PRs that introduces new tests or changes in tests should not be merged without having first exercised the tests in the Flaky test suite runner

Is there a way to execute a single test file utilizing the flaky test runner, or do team members only have the option to run tests at the "config file" (test group) level? I've gotten some team feedback that it would be nice to have more granularity within the flaky test runner, by being able to run specific test files as opposed to groups.

Thank you!

@MadameSheema
Copy link
Member Author

Is "rebasing" strictly required here? Could we say more broadly:

PRs should not be merged without having rebased on main or merging/pulling from main first.

Absolutely! Thanks for the suggestion.

Is there a way to execute a single test file utilizing the flaky test runner, or do team members only have the option to run tests at the "config file" (test group) level? I've gotten some team feedback that it would be nice to have more granularity within the flaky test runner, by being able to run specific test files as opposed to groups.

Yes it is. It is not 100% ideal because it requires a bit of manual intervention but is feasible. For your team, you must modify the line 10 for ESS tests and line 29 for serverless tests of the package.json located in x-pack/test/security_solution_cypress`, to point to a specific spec file, for instance:

"cypress:entity_analytics:run:ess": "yarn cypress:ess --spec './cypress/e2e/entity_analytics/dashboards/enable_risk_score_redirect.cy.ts'"` 

You just need to remember to revert the changes before merging the code into main :)

@sophiec20
Copy link
Contributor

sophiec20 commented Dec 22, 2023

If you are running 200x tests, please make sure that the test groups are small enough so as not to consume too many resources/cost. A good target time would be under 1h.

For example, test group (group:cypress/security_solution_investigations x 50, group:cypress/security_serverless_investigations x 50) is especially large and ideally needs breaking down. It failed after 10h, 9h respectively for only 50x (this is actually 8 days of parallelised runtime). This would be better split into smaller chunks.

@MadameSheema MadameSheema changed the title [Security Solution] Minimizing blocking other teams PRs [Security Solution] Improving test automation reliability Dec 27, 2023
@banderror
Copy link
Contributor

@MadameSheema I think out of all the tickets in the "Tasks" list, by far the most valuable one for us short-term would be #174892 because it would save ~5-10 minutes every single day for people who monitor flaky tests manually, and allow to delegate the tests triage process to any team members.

Probably the 2nd one by priority would be #174902 because it would guarantee that any new or changed Cypress tests are being run through a sort of "flaky test runner" before they get merged. Currently, PR authors need to remember to do that manually, and PR reviewers need to remember to check that it's done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
Projects
None yet
Development

No branches or pull requests

6 participants