[Security Solution] Improving test automation reliability #173508

MadameSheema · 2023-12-18T11:23:53Z

Update

This ticket was created to minimize blocking other teams' PRs because of our tests. That has been achieved already by #173815.

There is still work pending to do to minimise the flakiness in our automation and to have more reliable tests.

This ticket from now on is going to track all the efforts around it.

Tasks

Give feedback

Allow members of @security-solution-test-skippers to use /skip comments on failed test issues #173430

Team: SecuritySolution release_note:skip v8.12.0 v8.13.0
[Security Solution] Improve visibility of failing/failed tests #174892

Team: SecuritySolution
[Security Solution] Adding conditional execution on ci for Security Solution cypress tests #173815

Team: SecuritySolution Team:Detection Engine Team:Detection Rule Management Team:Detections and Resp Team:Entity Analytics Team:Threat Hunting Team:Threat Hunting:Explore Team:Threat Hunting:Investigations backport:skip release_note:skip v8.12.0 v8.13.0
Provide an extended guide of automation best practices focused on flakiness [@MadameSheema]
Create a workflow with steps about how to react to failing tests [@MadameSheema ] - Document can be found in drive as Minimising the risk of blocking colleagues PRs
[Security Solution] Create linters for our automation code #174899

Team: SecuritySolution
Create a workflow with steps about fixing flaky tests [@MadameSheema can help with that]
[Security Solution] Reintroducing burn in Cypress #174902

Team: SecuritySolution
[Security Solution] Test audit #174918

Team: SecuritySolution
[Security Solution] Guarantee Cypress Reliability: Stabilizing Using Clean Retries #174247

Team: SecuritySolution
Explore different testing tools
Options

Useful information

Flaky test vs blocker test

A flaky test is an automated test that has inconsistent behaviour, sometimes it passes, sometimes fails. In our context, a flaky test is an automated test that has failed at least one time in:

The on-merge pipeline
The Flaky test suite runner execution
The execution of a PR

Without blocking the merge of a PR and without breaking the on-merge pipeline.

A blocker test is an automated test that has consistently failing in:

The on-merge pipeline
The Flaky test suite runner execution
The execution of a PR

Blocking the merge of a PR or breaking the on-merge pipeline.

In both cases, we are talking about failing tests.

How are we tracking failed tests?

When a test fails for the first time on the on-merge pipeline a github ticket is automatically created in the Kibana repository.
- The title of the ticket is the path of the failed test
- The description is the error message of the failure with a link to the buildkite execution that failed
- The ticket is labeled with failed-test and the owner of the test
When a test that is already reported on a github ticket fails again:
- A new comment is added to the ticket with a link to the builkite execution
When the test has failed several times, kibana-operations team skips it.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-12-18T11:23:55Z

Pinging @elastic/security-solution (Team: SecuritySolution)

peluja1012 · 2023-12-18T19:52:41Z

Thanks for putting this ticket together, @MadameSheema! I'm glad we're taking steps towards making further improvements to our flaky test process. I have a couple of questions/comments.

Giving Security Solution the ability to skip tests without waiting for CI is a positive change, but I wonder if we could take it one step further. This change would only save us the time it takes for CI to pass when a PR is open to skip a test. We could still potentially be causing problems for other teams during the time period between when the test is reported as a failed-test and when we become aware of the problem. In order to mitigate this, could we automatically skip a failed-test reported in the on-merge pipeline?
Which team would we need to work with in order to automatically ping teams on slack when a test is marked as failed-test?
What is "burn functionality"?

MadameSheema · 2023-12-19T07:49:53Z

@peluja1012 lots of thanks for reviewing thoroughly the ticket, here you can find my insights:

Giving Security Solution the ability to skip tests without waiting for CI is a positive change, but I wonder if we could take it one step further. This change would only save us the time it takes for CI to pass when a PR is open to skip a test. We could still potentially be causing problems for other teams during the time period between when the test is reported as a failed-test and when we become aware of the problem. In order to mitigate this, could we automatically skip a failed-test reported in the on-merge pipeline?

You are completely right about it, we could still potentially be causing problems. The proposed solution in the main ticket is a quick win since it is easy to implement and would save us time. I don't know how complex can be skipping tests automatically, that is something we need to research. Another thing we need to investigate or take into consideration is the pros of cons, for instance, sometimes failures are due to problems not related to us or network connection issues that happens just one time, skipping the test right away will lead to minimising our test coverage. But I really like the idea of automating this process and we can research to see if we can make it possible without compromising our coverage :)

Which team would we need to work with in order to automatically ping teams on slack when a test is marked as failed-test?

I've researched about this and seems to be simpler than we thought. I'll ping you on slack and I'll provide you details about it.

What is "burn functionality"?

Is a functionality integrated in Cypress by Patryk that checks which tests have been modified/introduced and executes them several times in a row. That would help us to get rid of the flaky test suite runner and make the check of flakiness in our tests part of the PR process. This is a great idea that we need to re-implement, but there is some work we need to do. For instance, we need to make sure that our tests can be executed several times in the same instance without having false failures or that process is not going to impact teams outside security solution.

jaredburgettelastic · 2023-12-19T21:07:47Z

Awesome writeup, I'm particularly looking forward to "Provid[ing] an extended guide of automation best practices focused on flakiness" coming to fruition.

Couple of clarifying questions:

PRs should not be merged without having rebased new changes in main first.

Is "rebasing" strictly required here? Could we say more broadly:

PRs should not be merged without having rebased on main or merging/pulling from main first.

PRs that introduces new tests or changes in tests should not be merged without having first exercised the tests in the Flaky test suite runner

Is there a way to execute a single test file utilizing the flaky test runner, or do team members only have the option to run tests at the "config file" (test group) level? I've gotten some team feedback that it would be nice to have more granularity within the flaky test runner, by being able to run specific test files as opposed to groups.

Thank you!

MadameSheema · 2023-12-20T10:15:25Z

Is "rebasing" strictly required here? Could we say more broadly:

PRs should not be merged without having rebased on main or merging/pulling from main first.

Absolutely! Thanks for the suggestion.

Is there a way to execute a single test file utilizing the flaky test runner, or do team members only have the option to run tests at the "config file" (test group) level? I've gotten some team feedback that it would be nice to have more granularity within the flaky test runner, by being able to run specific test files as opposed to groups.

Yes it is. It is not 100% ideal because it requires a bit of manual intervention but is feasible. For your team, you must modify the line 10 for ESS tests and line 29 for serverless tests of the package.json located in x-pack/test/security_solution_cypress`, to point to a specific spec file, for instance:

"cypress:entity_analytics:run:ess": "yarn cypress:ess --spec './cypress/e2e/entity_analytics/dashboards/enable_risk_score_redirect.cy.ts'"`

You just need to remember to revert the changes before merging the code into main :)

sophiec20 · 2023-12-22T10:35:37Z

If you are running 200x tests, please make sure that the test groups are small enough so as not to consume too many resources/cost. A good target time would be under 1h.

For example, test group (group:cypress/security_solution_investigations x 50, group:cypress/security_serverless_investigations x 50) is especially large and ideally needs breaking down. It failed after 10h, 9h respectively for only 50x (this is actually 8 days of parallelised runtime). This would be better split into smaller chunks.

banderror · 2024-02-12T08:23:34Z

@MadameSheema I think out of all the tickets in the "Tasks" list, by far the most valuable one for us short-term would be #174892 because it would save ~5-10 minutes every single day for people who monitor flaky tests manually, and allow to delegate the tests triage process to any team members.

Probably the 2nd one by priority would be #174902 because it would guarantee that any new or changed Cypress tests are being run through a sort of "flaky test runner" before they get merged. Currently, PR authors need to remember to do that manually, and PR reviewers need to remember to check that it's done.

MadameSheema added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Dec 18, 2023

MadameSheema changed the title ~~[Security Solution] Minimizing blocking other teams PRs~~ [Security Solution] Improving test automation reliability Dec 27, 2023

This was referenced Jan 16, 2024

[Security Solution] Stabilize Cypress #153664

Closed

[Security Solution] Preparing Cypress for the second quality gate - latest steps #173327

Open

banderror mentioned this issue Feb 12, 2024

[Security Solution] Detection Engine Test Automation and Coverage #153633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution] Improving test automation reliability #173508

[Security Solution] Improving test automation reliability #173508

MadameSheema commented Dec 18, 2023 •

edited

Loading

Tasks

elasticmachine commented Dec 18, 2023

peluja1012 commented Dec 18, 2023

MadameSheema commented Dec 19, 2023

jaredburgettelastic commented Dec 19, 2023

MadameSheema commented Dec 20, 2023

sophiec20 commented Dec 22, 2023 •

edited

Loading

banderror commented Feb 12, 2024

[Security Solution] Improving test automation reliability #173508

[Security Solution] Improving test automation reliability #173508

Comments

MadameSheema commented Dec 18, 2023 • edited Loading

Update

Tasks

Useful information

Flaky test vs blocker test

How are we tracking failed tests?

elasticmachine commented Dec 18, 2023

peluja1012 commented Dec 18, 2023

MadameSheema commented Dec 19, 2023

jaredburgettelastic commented Dec 19, 2023

MadameSheema commented Dec 20, 2023

sophiec20 commented Dec 22, 2023 • edited Loading

banderror commented Feb 12, 2024

MadameSheema commented Dec 18, 2023 •

edited

Loading

sophiec20 commented Dec 22, 2023 •

edited

Loading