Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Unskip flaky tests in Prebuilt Rules FTR Integration tests #173998

Merged
merged 22 commits into from
Jan 11, 2024

Conversation

jpdjere
Copy link
Contributor

@jpdjere jpdjere commented Dec 27, 2023

Addresses:
#172107
#171380

Summary

Unskip skipped tests in:

  1. x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/bundled_prebuilt_rules_package/install_latest_bundled_prebuilt_rules.ts
  2. x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/management/fleet_integration.ts
  • Makes the retryDelay in the RetryService in packages/kbn-ftr-common-functional-services/services/retry/retry.ts a configurable parameter - used in our retry util to shorten the wait period to 200ms.
  • Creates retry wrapper util for our FTR Integration tests, that wraps retry.try from the RetryService, to implement maximum attempts.
  • Uses this retry wrapper in test utils that install the security_detection_engine Fleet package, and asserts that rules have been succesfully installed - and retries if they haven't.
  • Creates refreshSavedObjectIndices reusable util that cleans cache and refreshes indices. Centralizes comment spread around file into this single file.
  • Use this new util to clean the cache in util that install the Fleet package, and utils that read the prebuilt rule status.

Flaky test runner

Before changes:

After changes:

For maintainers

@jpdjere jpdjere changed the title Unskip tests [Security Solution] Unskip flaky tests in Prebuilt Rules FTR Integration tests Dec 28, 2023
@jpdjere
Copy link
Contributor Author

jpdjere commented Dec 28, 2023

@banderror I'd like your opinion on this, whenever you can take a look.

I wasn't able to reproduce the flake reported in either of the two linked tickets - I've run the flaky test runner with 500 runs for each, and all were green. So the issue has been tough to investigate, but I have two theories/proposed solutions.

1. Cache for ES reading requests needs to be cleared

This is what's currently implemented in this PR.

I noticed that the issue in both tickets pops up in test suites where we attempt to read the status of the prebuilt rules before and after installing our prebuilt rules. In both cases, we first check that no rules are available for install, then install prebuilt rule assets, and then read the status again to assert that rules are now available for install.

I'm not completely sure how cacheing in ES works, but I suspect that in very specific cases (one in thousands) the cache for the request is not completely cleared after the first read, and even though we correctly install the prebuilt rule assets and refresh the indices, the subsequent read operations simply responds the cached value - which is still 0 rules to be installed.

If that is the case, we can use await es.indices.clearCache({ index: ALL_SAVED_OBJECT_INDICES }); to clean the request caches in a similar way to how we refresh the indices. This PR adds that logic to our getPrebuiltRulesStatus and the legacy getPrebuiltRulesAndTimelinesStatus helper.

2. Implement retry logic for security_detection_engine package installation

There's a retry tool that is used extensively in the FTR Integration tests that we can make use of in the helpers that install the package. In all cases for these utils, we want to install at least one asset. If no assets are present, that means that the installation failed, and that we should retry it.

So we can modify our helper to something like:

export const installPrebuiltRulesPackageByVersion = async (
  es: Client,
  supertest: SuperTest.SuperTest<SuperTest.Test>,
  version: string,
  retry: RetryService
): Promise<InstallPackageResponse> => {
  return retry.try(async () => {
    const fleetResponse = await supertest
      .post(epmRouteService.getInstallPath('security_detection_engine', version))
      .set('kbn-xsrf', 'xxxx')
      .set('elastic-api-version', '2023-10-31')
      .type('application/json')
      .send({ force: true })
      .expect(200);

    // This new expect clause makes sure that if no assets were installed, then
    // then we should retry - as this assertion will fail and cause the whole block to be retried until it passes
    expect((fleetResponse.body as InstallPackageResponse).items.length).toBeGreaterThan(0);

    await es.indices.refresh({ index: ALL_SAVED_OBJECT_INDICES });

    return fleetResponse.body as InstallPackageResponse;
  });
};

This is not implemented in this PR; I'd like your opinion first.

But my suggestion is: go with option 1 (as the PR stands) and monitor these tests. If they reappear, then the cache was not the problem, and we can try with the retry logic. I can have a PR ready for that case, so we'd only need to review it and merge it as soon (and if) the flake reappears.

@jpdjere jpdjere marked this pull request as ready for review December 28, 2023 16:36
@jpdjere jpdjere requested review from a team as code owners December 28, 2023 16:36
@jpdjere jpdjere self-assigned this Dec 28, 2023
@jpdjere jpdjere added release_note:skip Skip the PR/issue when compiling release notes Team:Detections and Resp Security Detection Response Team Team:Detection Rule Management Security Detection Rule Management Team labels Dec 28, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@jpdjere jpdjere added test Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area labels Dec 28, 2023
@banderror
Copy link
Contributor

@jpdjere Yea, it seems to be a tough one.

Flaky test runs

First, about that.

I wasn't able to reproduce the flake reported in either of the two linked tickets - I've run the flaky test runner with 500 runs for each, and all were green.

In bundled_prebuilt_rules_package: Serverless: https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4695 (500 runs) you've got a version conflict error, though. So the flake is not fully fixed, and we need to investigate further before unskipping these tests.

I assume this is the same error as in #171428 -- have you got a response from @MadameSheema? Have you reached out to her elsewhere?

Secondly, why in Before changes: and After changes: we have different flaky test runner configurations? Can we have identical ones? Specifically, if we have 4 separate jobs of 500 runs each in Before, it would be great to do the same After the changes.

1. Cache for ES reading requests needs to be cleared

I suspect that in very specific cases (one in thousands) the cache for the request is not completely cleared after the first read, and even though we correctly install the prebuilt rule assets and refresh the indices, the subsequent read operations simply responds the cached value

This is a good hypothesis and we should take it into account, but it doesn't feel convincing to me from the probabilities perspective. Here there was a second test failure that occurred on the same day as the first one. The probability of that would be 1/1000000 in this case.

Another reason why it's not very convincing is that we don't seem to have the same issue with all our other requests and endpoints, e.g. rule management APIs. When we create custom rules in our tests, wouldn't we have the same issue?..

But my suggestion is: go with option 1 (as the PR stands) and monitor these tests. If they reappear, then the cache was not the problem, and we can try with the retry logic.

I would be ok to try this if we didn't get the version conflict error in that flaky test run. The tests are still flaky. I think we need to continue investigating this and doing more flaky test runs before we unskip them.

2. Implement retry logic for security_detection_engine package installation

This idea looks more robust to me and I'd suggest to implement it in this PR.

A few suggestions:

  • Let's first assert that fleetResponse.body.items.length is in the response, and then assert that it's > 0.
  • Let's set a small retry limit, such as 2 (1 initial attempt + 2 retries).
  • Let's set a short delay between the retries, such as 100 ms.
  • Let's refresh the indices after (outside of) the retries -- only if there was a successful attempt.
  • Let's also clear the cache after refreshing the indices.
  • Let's abstract away the two operations above in a reusable function.

@banderror banderror added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Dec 29, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Copy link
Contributor

@vitaliidm vitaliidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detection engine area LGTM

+++ on exploring option with retries

@jpdjere jpdjere force-pushed the fix-flake-27-12-2023 branch 3 times, most recently from 040880b to 7de37c5 Compare January 3, 2024 00:00
@jpdjere jpdjere requested review from a team as code owners January 3, 2024 00:00
@jpdjere jpdjere requested a review from banderror January 8, 2024 23:19
@banderror
Copy link
Contributor

@jpdjere The flaky runner doesn't seem to be happy...

Copy link
Contributor

@banderror banderror left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, thanks 👍
Approving the PR to not being a blocker for it. Please look into the failed flaky test runs.

@jpdjere
Copy link
Contributor Author

jpdjere commented Jan 10, 2024

@jpdjere The flaky runner doesn't seem to be happy...

Ah, seems I hadn't pushed before running them. I'll rerun them and merge if they are green.

Thanks for the review, feedback and discussion.

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Jest Integration Tests #7 / fake elasticsearch should fail to start Kibana because of the Product Check Error
  • [job] [logs] Jest Integration Tests #7 / fake elasticsearch should return unknown product when it cannot perform the Product check (503 response)
  • [job] [logs] FTR Configs #56 / links panel links panel create and edit creation can create a new by-reference links panel

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/ftr-common-functional-services 29 31 +2
Unknown metric groups

API count

id before after diff
@kbn/ftr-common-functional-services 29 31 +2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @jpdjere

@jpdjere jpdjere merged commit 81d6478 into elastic:main Jan 11, 2024
21 checks passed
@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.12 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 173998

Questions ?

Please refer to the Backport tool documentation

@jpdjere
Copy link
Contributor Author

jpdjere commented Jan 12, 2024

💚 All backports created successfully

Status Branch Result
8.12

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

jpdjere added a commit to jpdjere/kibana that referenced this pull request Jan 12, 2024
…ion tests (elastic#173998)

**Addresses:**
elastic#172107
elastic#171380

## Summary

Unskip skipped tests in:

1.
`x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/bundled_prebuilt_rules_package/install_latest_bundled_prebuilt_rules.ts`
2.
`x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/management/fleet_integration.ts`

- Makes the `retryDelay` in the **RetryService** in
`packages/kbn-ftr-common-functional-services/services/retry/retry.ts` a
configurable parameter - used in our `retry` util to shorten the wait
period to 200ms.
- Creates `retry` wrapper util for our FTR Integration tests, that wraps
`retry.try` from the **RetryService**, to implement maximum attempts.
- Uses this `retry` wrapper in test utils that install the
`security_detection_engine` Fleet package, and asserts that rules have
been succesfully installed - and retries if they haven't.
- Creates `refreshSavedObjectIndices` reusable util that cleans cache
and refreshes indices. Centralizes comment spread around file into this
single file.
- Use this new util to clean the cache in util that install the Fleet
package, and utils that read the prebuilt rule status.

## Flaky test runner

**Before changes:**
- For both `bundled_prebuilt_rules_package` and `management`:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4688
🟢 (250 and 250 runs)
- `bundled_prebuilt_rules_package`:
- ESS:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4805
(500 runs)
- Serverless:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4807
(500 runs)
- `management`
- ESS:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4806
(500 runs)
- Serverless:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4808
(500 runs)

**After changes:**
- `bundled_prebuilt_rules_package`:
- ESS:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4825
🟢 (500 runs)
- Serverless:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4826
🟢 (500 runs)
- `management`
- ESS:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4827
🟢 (500 runs)
- Serverless:
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4828
🟢 (500 runs)

### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
(cherry picked from commit 81d6478)

# Conflicts:
#	x-pack/test/security_solution_api_integration/tsconfig.json
jpdjere added a commit that referenced this pull request Jan 12, 2024
…ntegration tests (#173998) (#174761)

# Backport

This will backport the following commits from `main` to `8.12`:
- [[Security Solution] Unskip flaky tests in Prebuilt Rules FTR
Integration tests
(#173998)](#173998)

<!--- Backport version: 8.9.8 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Juan Pablo
Djeredjian","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-01-11T12:35:55Z","message":"[Security
Solution] Unskip flaky tests in Prebuilt Rules FTR Integration tests
(#173998)\n\n**Addresses:**\r\nhttps://github.com//issues/172107\r\nhttps://github.com//issues/171380\r\n\r\n##
Summary\r\n\r\nUnskip skipped tests
in:\r\n\r\n1.\r\n`x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/bundled_prebuilt_rules_package/install_latest_bundled_prebuilt_rules.ts`\r\n2.\r\n`x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/management/fleet_integration.ts`\r\n\r\n-
Makes the `retryDelay` in the **RetryService**
in\r\n`packages/kbn-ftr-common-functional-services/services/retry/retry.ts`
a\r\nconfigurable parameter - used in our `retry` util to shorten the
wait\r\nperiod to 200ms.\r\n- Creates `retry` wrapper util for our FTR
Integration tests, that wraps\r\n`retry.try` from the **RetryService**,
to implement maximum attempts.\r\n- Uses this `retry` wrapper in test
utils that install the\r\n`security_detection_engine` Fleet package, and
asserts that rules have\r\nbeen succesfully installed - and retries if
they haven't.\r\n- Creates `refreshSavedObjectIndices` reusable util
that cleans cache\r\nand refreshes indices. Centralizes comment spread
around file into this\r\nsingle file.\r\n- Use this new util to clean
the cache in util that install the Fleet\r\npackage, and utils that read
the prebuilt rule status.\r\n\r\n## Flaky test runner\r\n\r\n**Before
changes:** \r\n- For both `bundled_prebuilt_rules_package` and
`management`:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4688\r\n🟢
(250 and 250 runs)\r\n- `bundled_prebuilt_rules_package`:\r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4805\r\n(500
runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4807\r\n(500
runs)\r\n- `management` \r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4806\r\n(500
runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4808\r\n(500
runs)\r\n \r\n**After changes:** \r\n-
`bundled_prebuilt_rules_package`:\r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4825\r\n🟢
(500 runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4826\r\n🟢
(500 runs)\r\n- `management` \r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4827\r\n🟢
(500 runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4828\r\n🟢
(500 runs)\r\n\r\n### For maintainers\r\n\r\n- [ ] This was checked for
breaking API changes and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"81d6478beedeceedd4ae193c7a5ba0ee874cbf12","branchLabelMapping":{"^v8.13.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["test","release_note:skip","Team:Detections
and Resp","Team: SecuritySolution","Team:Detection Rule
Management","Feature:Prebuilt Detection
Rules","v8.12.0","v8.12.1","v8.13.0"],"number":173998,"url":"https://github.com/elastic/kibana/pull/173998","mergeCommit":{"message":"[Security
Solution] Unskip flaky tests in Prebuilt Rules FTR Integration tests
(#173998)\n\n**Addresses:**\r\nhttps://github.com//issues/172107\r\nhttps://github.com//issues/171380\r\n\r\n##
Summary\r\n\r\nUnskip skipped tests
in:\r\n\r\n1.\r\n`x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/bundled_prebuilt_rules_package/install_latest_bundled_prebuilt_rules.ts`\r\n2.\r\n`x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/management/fleet_integration.ts`\r\n\r\n-
Makes the `retryDelay` in the **RetryService**
in\r\n`packages/kbn-ftr-common-functional-services/services/retry/retry.ts`
a\r\nconfigurable parameter - used in our `retry` util to shorten the
wait\r\nperiod to 200ms.\r\n- Creates `retry` wrapper util for our FTR
Integration tests, that wraps\r\n`retry.try` from the **RetryService**,
to implement maximum attempts.\r\n- Uses this `retry` wrapper in test
utils that install the\r\n`security_detection_engine` Fleet package, and
asserts that rules have\r\nbeen succesfully installed - and retries if
they haven't.\r\n- Creates `refreshSavedObjectIndices` reusable util
that cleans cache\r\nand refreshes indices. Centralizes comment spread
around file into this\r\nsingle file.\r\n- Use this new util to clean
the cache in util that install the Fleet\r\npackage, and utils that read
the prebuilt rule status.\r\n\r\n## Flaky test runner\r\n\r\n**Before
changes:** \r\n- For both `bundled_prebuilt_rules_package` and
`management`:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4688\r\n🟢
(250 and 250 runs)\r\n- `bundled_prebuilt_rules_package`:\r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4805\r\n(500
runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4807\r\n(500
runs)\r\n- `management` \r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4806\r\n(500
runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4808\r\n(500
runs)\r\n \r\n**After changes:** \r\n-
`bundled_prebuilt_rules_package`:\r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4825\r\n🟢
(500 runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4826\r\n🟢
(500 runs)\r\n- `management` \r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4827\r\n🟢
(500 runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4828\r\n🟢
(500 runs)\r\n\r\n### For maintainers\r\n\r\n- [ ] This was checked for
breaking API changes and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"81d6478beedeceedd4ae193c7a5ba0ee874cbf12"}},"sourceBranch":"main","suggestedTargetBranches":["8.12"],"targetPullRequestStates":[{"branch":"8.12","label":"v8.12.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.13.0","labelRegex":"^v8.13.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/173998","number":173998,"mergeCommit":{"message":"[Security
Solution] Unskip flaky tests in Prebuilt Rules FTR Integration tests
(#173998)\n\n**Addresses:**\r\nhttps://github.com//issues/172107\r\nhttps://github.com//issues/171380\r\n\r\n##
Summary\r\n\r\nUnskip skipped tests
in:\r\n\r\n1.\r\n`x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/bundled_prebuilt_rules_package/install_latest_bundled_prebuilt_rules.ts`\r\n2.\r\n`x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/prebuilt_rules/management/fleet_integration.ts`\r\n\r\n-
Makes the `retryDelay` in the **RetryService**
in\r\n`packages/kbn-ftr-common-functional-services/services/retry/retry.ts`
a\r\nconfigurable parameter - used in our `retry` util to shorten the
wait\r\nperiod to 200ms.\r\n- Creates `retry` wrapper util for our FTR
Integration tests, that wraps\r\n`retry.try` from the **RetryService**,
to implement maximum attempts.\r\n- Uses this `retry` wrapper in test
utils that install the\r\n`security_detection_engine` Fleet package, and
asserts that rules have\r\nbeen succesfully installed - and retries if
they haven't.\r\n- Creates `refreshSavedObjectIndices` reusable util
that cleans cache\r\nand refreshes indices. Centralizes comment spread
around file into this\r\nsingle file.\r\n- Use this new util to clean
the cache in util that install the Fleet\r\npackage, and utils that read
the prebuilt rule status.\r\n\r\n## Flaky test runner\r\n\r\n**Before
changes:** \r\n- For both `bundled_prebuilt_rules_package` and
`management`:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4688\r\n🟢
(250 and 250 runs)\r\n- `bundled_prebuilt_rules_package`:\r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4805\r\n(500
runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4807\r\n(500
runs)\r\n- `management` \r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4806\r\n(500
runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4808\r\n(500
runs)\r\n \r\n**After changes:** \r\n-
`bundled_prebuilt_rules_package`:\r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4825\r\n🟢
(500 runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4826\r\n🟢
(500 runs)\r\n- `management` \r\n-
ESS:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4827\r\n🟢
(500 runs)\r\n-
Serverless:\r\nhttps://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/4828\r\n🟢
(500 runs)\r\n\r\n### For maintainers\r\n\r\n- [ ] This was checked for
breaking API changes and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"81d6478beedeceedd4ae193c7a5ba0ee874cbf12"}}]}]
BACKPORT-->

---------

Co-authored-by: kibanamachine <[email protected]>
@mistic mistic removed the v8.12.0 label Jan 17, 2024
maryam-saeidi added a commit that referenced this pull request Mar 19, 2024
…hod from retryService (#178515)

Related to #176401, #175776

## Summary

This PR:

- Improves logging (I've added debug logs to the helpers that does an
API request such as creating a data view)
- Uses retryService instead of pRetry
- In case of throwing an error in pRetry, when we have 10 retries, it
does not log the retry attempts and we end up in the situation that is
mentioned in this [comment, item
3](#176401 (comment))
    
|Before|After|
|---|---|

|![image](https://github.com/elastic/kibana/assets/12370520/576146f2-09da-4221-a570-6d47e047f229)|![image](https://github.com/elastic/kibana/assets/12370520/0a0897a3-0bd3-4d44-9b79-8f99fb580b4a)|
- Attempts to fix flakiness in rate reason message due to having
different data

![image](https://github.com/elastic/kibana/assets/12370520/dff48ac1-a9bf-4b93-addb-fd40acae382e)


### Flaky test runner
#### Current (after adding refresh index and adjusting timeout)
- [25]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5463
✅
- [200]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5465
✅

#### Old
- [25]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5452
✅
- [200]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5454
[1 Failed : 25 Canceled: 174 Passed ]
  ##### After checking data is generated in metric threshold
- [25]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5460
✅
- [200]
https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5462
[1 Failed : 199 Canceled ]

Inspired by #173998, special
thanks to @jpdjere and @dmlemeshko for their support and knowledge
sharing.
dmlemeshko added a commit that referenced this pull request Mar 19, 2024
## Summary

I took the retry wrapper created by
[jpdjere](https://github.com/jpdjere) in #173998 and extended
`retryForSuccess<T>` with required capabilities to provide the same
functionality.

This PR:
1) extends retry service with new function `tryWithRetries<T> => :
Promise<T>` to retry block `options.retryCount` times within
`options.timeout` period and return block result

```ts
  const response = await retry.tryWithRetries<SearchResponse>(
   'search request',
    async () => {
        const response = await supertest
          .post(`/internal/search/es`)
          .set(ELASTIC_HTTP_VERSION_HEADER, '1')
          .send({
            params: {
              body: {
                query: {
                  match_all: {},
                },
              },
            },
          })
          .expect(200);
        return response.body as SearchResponse;
    },
    {
      retryCount: 4,
      retryDelay: 100, // optional
      timeout: 30000, // optional
    }
```

2) removes `utils/retry.ts` wrapper and migrate tests to FTR Retry
service
3) Adds descriptions to Retry service functions explaining the default
use case

How the failures look like:

- when reached timeout before retry count limit 
```
 Error: retry.tryWithRetries reached timeout 11000 ms waiting for 'run console request'
 Error: expected '# Click the Variables button, above, to create your own variables.\nGET ${exampleVariable1} // _search\n{\n  "query": {\n    "${exampleVariable2}": {} // match_all\n  }\n}' to sort of equal '5'
     at Assertion.assert (expect.js:100:11)
     at Assertion.eql (expect.js:244:8)
     at _console.ts:47:41
     at processTicksAndRejections (node:internal/process/task_queues:95:5)
     at runAttempt (retry_with_retries.ts:29:15)
     at retryWithRetries (retry_with_retries.ts:97:21)
     at RetryService.tryForTime (retry.ts:38:12)
     at Context.<anonymous> (_console.ts:44:7)
     at Object.apply (wrap_function.js:73:16)
```
- when reached retry count limit before timeout
```
 Error: retry.tryWithRetries reached the limit of attempts waiting for 'run console request': 2 out of 2
 Error: expected '# Click the Variables button, above, to create your own variables.\nGET ${exampleVariable1} // _search\n{\n  "query": {\n    "${exampleVariable2}": {} // match_all\n  }\n}' to sort of equal '5'
     at Assertion.assert (expect.js:100:11)
     at Assertion.eql (expect.js:244:8)
     at _console.ts:47:41
     at processTicksAndRejections (node:internal/process/task_queues:95:5)
     at runAttempt (retry_for_success.ts:29:15)
     at retryForSuccess (retry_for_success.ts:97:21)
     at RetryService.tryWithRetries (retry.ts:115:12)
     at Context.<anonymous> (_console.ts:44:7)
     at Object.apply (wrap_function.js:73:16)
```
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Mar 19, 2024
…8660)

## Summary

I took the retry wrapper created by
[jpdjere](https://github.com/jpdjere) in elastic#173998 and extended
`retryForSuccess<T>` with required capabilities to provide the same
functionality.

This PR:
1) extends retry service with new function `tryWithRetries<T> => :
Promise<T>` to retry block `options.retryCount` times within
`options.timeout` period and return block result

```ts
  const response = await retry.tryWithRetries<SearchResponse>(
   'search request',
    async () => {
        const response = await supertest
          .post(`/internal/search/es`)
          .set(ELASTIC_HTTP_VERSION_HEADER, '1')
          .send({
            params: {
              body: {
                query: {
                  match_all: {},
                },
              },
            },
          })
          .expect(200);
        return response.body as SearchResponse;
    },
    {
      retryCount: 4,
      retryDelay: 100, // optional
      timeout: 30000, // optional
    }
```

2) removes `utils/retry.ts` wrapper and migrate tests to FTR Retry
service
3) Adds descriptions to Retry service functions explaining the default
use case

How the failures look like:

- when reached timeout before retry count limit
```
 Error: retry.tryWithRetries reached timeout 11000 ms waiting for 'run console request'
 Error: expected '# Click the Variables button, above, to create your own variables.\nGET ${exampleVariable1} // _search\n{\n  "query": {\n    "${exampleVariable2}": {} // match_all\n  }\n}' to sort of equal '5'
     at Assertion.assert (expect.js:100:11)
     at Assertion.eql (expect.js:244:8)
     at _console.ts:47:41
     at processTicksAndRejections (node:internal/process/task_queues:95:5)
     at runAttempt (retry_with_retries.ts:29:15)
     at retryWithRetries (retry_with_retries.ts:97:21)
     at RetryService.tryForTime (retry.ts:38:12)
     at Context.<anonymous> (_console.ts:44:7)
     at Object.apply (wrap_function.js:73:16)
```
- when reached retry count limit before timeout
```
 Error: retry.tryWithRetries reached the limit of attempts waiting for 'run console request': 2 out of 2
 Error: expected '# Click the Variables button, above, to create your own variables.\nGET ${exampleVariable1} // _search\n{\n  "query": {\n    "${exampleVariable2}": {} // match_all\n  }\n}' to sort of equal '5'
     at Assertion.assert (expect.js:100:11)
     at Assertion.eql (expect.js:244:8)
     at _console.ts:47:41
     at processTicksAndRejections (node:internal/process/task_queues:95:5)
     at runAttempt (retry_for_success.ts:29:15)
     at retryForSuccess (retry_for_success.ts:97:21)
     at RetryService.tryWithRetries (retry.ts:115:12)
     at Context.<anonymous> (_console.ts:44:7)
     at Object.apply (wrap_function.js:73:16)
```

(cherry picked from commit 277b3fb)
kibanamachine added a commit that referenced this pull request Mar 19, 2024
…8660) (#178978)

# Backport

This will backport the following commits from `main` to `8.13`:
- [[kbn-ftr-common-functional-services] extend retry service
(#178660)](#178660)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Dzmitry
Lemechko","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-03-19T15:32:59Z","message":"[kbn-ftr-common-functional-services]
extend retry service (#178660)\n\n## Summary\r\n\r\nI took the retry
wrapper created by\r\n[jpdjere](https://github.com/jpdjere) in #173998
and extended\r\n`retryForSuccess<T>` with required capabilities to
provide the same\r\nfunctionality.\r\n\r\nThis PR:\r\n1) extends retry
service with new function `tryWithRetries<T> => :\r\nPromise<T>` to
retry block `options.retryCount` times within\r\n`options.timeout`
period and return block result\r\n\r\n```ts\r\n const response = await
retry.tryWithRetries<SearchResponse>(\r\n 'search request',\r\n async ()
=> {\r\n const response = await supertest\r\n
.post(`/internal/search/es`)\r\n .set(ELASTIC_HTTP_VERSION_HEADER,
'1')\r\n .send({\r\n params: {\r\n body: {\r\n query: {\r\n match_all:
{},\r\n },\r\n },\r\n },\r\n })\r\n .expect(200);\r\n return
response.body as SearchResponse;\r\n },\r\n {\r\n retryCount: 4,\r\n
retryDelay: 100, // optional\r\n timeout: 30000, // optional\r\n
}\r\n```\r\n\r\n2) removes `utils/retry.ts` wrapper and migrate tests to
FTR Retry\r\nservice\r\n3) Adds descriptions to Retry service functions
explaining the default\r\nuse case\r\n\r\nHow the failures look
like:\r\n\r\n- when reached timeout before retry count limit \r\n```\r\n
Error: retry.tryWithRetries reached timeout 11000 ms waiting for 'run
console request'\r\n Error: expected '# Click the Variables button,
above, to create your own variables.\\nGET ${exampleVariable1} //
_search\\n{\\n \"query\": {\\n \"${exampleVariable2}\": {} //
match_all\\n }\\n}' to sort of equal '5'\r\n at Assertion.assert
(expect.js:100:11)\r\n at Assertion.eql (expect.js:244:8)\r\n at
_console.ts:47:41\r\n at processTicksAndRejections
(node:internal/process/task_queues:95:5)\r\n at runAttempt
(retry_with_retries.ts:29:15)\r\n at retryWithRetries
(retry_with_retries.ts:97:21)\r\n at RetryService.tryForTime
(retry.ts:38:12)\r\n at Context.<anonymous> (_console.ts:44:7)\r\n at
Object.apply (wrap_function.js:73:16)\r\n```\r\n- when reached retry
count limit before timeout\r\n```\r\n Error: retry.tryWithRetries
reached the limit of attempts waiting for 'run console request': 2 out
of 2\r\n Error: expected '# Click the Variables button, above, to create
your own variables.\\nGET ${exampleVariable1} // _search\\n{\\n
\"query\": {\\n \"${exampleVariable2}\": {} // match_all\\n }\\n}' to
sort of equal '5'\r\n at Assertion.assert (expect.js:100:11)\r\n at
Assertion.eql (expect.js:244:8)\r\n at _console.ts:47:41\r\n at
processTicksAndRejections (node:internal/process/task_queues:95:5)\r\n
at runAttempt (retry_for_success.ts:29:15)\r\n at retryForSuccess
(retry_for_success.ts:97:21)\r\n at RetryService.tryWithRetries
(retry.ts:115:12)\r\n at Context.<anonymous> (_console.ts:44:7)\r\n at
Object.apply
(wrap_function.js:73:16)\r\n```","sha":"277b3fbc24889bedd512f23674e768d2f4c43294","branchLabelMapping":{"^v8.14.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","FTR","v8.14.0","v8.13.1"],"title":"[kbn-ftr-common-functional-services]
extend retry
service","number":178660,"url":"https://github.com/elastic/kibana/pull/178660","mergeCommit":{"message":"[kbn-ftr-common-functional-services]
extend retry service (#178660)\n\n## Summary\r\n\r\nI took the retry
wrapper created by\r\n[jpdjere](https://github.com/jpdjere) in #173998
and extended\r\n`retryForSuccess<T>` with required capabilities to
provide the same\r\nfunctionality.\r\n\r\nThis PR:\r\n1) extends retry
service with new function `tryWithRetries<T> => :\r\nPromise<T>` to
retry block `options.retryCount` times within\r\n`options.timeout`
period and return block result\r\n\r\n```ts\r\n const response = await
retry.tryWithRetries<SearchResponse>(\r\n 'search request',\r\n async ()
=> {\r\n const response = await supertest\r\n
.post(`/internal/search/es`)\r\n .set(ELASTIC_HTTP_VERSION_HEADER,
'1')\r\n .send({\r\n params: {\r\n body: {\r\n query: {\r\n match_all:
{},\r\n },\r\n },\r\n },\r\n })\r\n .expect(200);\r\n return
response.body as SearchResponse;\r\n },\r\n {\r\n retryCount: 4,\r\n
retryDelay: 100, // optional\r\n timeout: 30000, // optional\r\n
}\r\n```\r\n\r\n2) removes `utils/retry.ts` wrapper and migrate tests to
FTR Retry\r\nservice\r\n3) Adds descriptions to Retry service functions
explaining the default\r\nuse case\r\n\r\nHow the failures look
like:\r\n\r\n- when reached timeout before retry count limit \r\n```\r\n
Error: retry.tryWithRetries reached timeout 11000 ms waiting for 'run
console request'\r\n Error: expected '# Click the Variables button,
above, to create your own variables.\\nGET ${exampleVariable1} //
_search\\n{\\n \"query\": {\\n \"${exampleVariable2}\": {} //
match_all\\n }\\n}' to sort of equal '5'\r\n at Assertion.assert
(expect.js:100:11)\r\n at Assertion.eql (expect.js:244:8)\r\n at
_console.ts:47:41\r\n at processTicksAndRejections
(node:internal/process/task_queues:95:5)\r\n at runAttempt
(retry_with_retries.ts:29:15)\r\n at retryWithRetries
(retry_with_retries.ts:97:21)\r\n at RetryService.tryForTime
(retry.ts:38:12)\r\n at Context.<anonymous> (_console.ts:44:7)\r\n at
Object.apply (wrap_function.js:73:16)\r\n```\r\n- when reached retry
count limit before timeout\r\n```\r\n Error: retry.tryWithRetries
reached the limit of attempts waiting for 'run console request': 2 out
of 2\r\n Error: expected '# Click the Variables button, above, to create
your own variables.\\nGET ${exampleVariable1} // _search\\n{\\n
\"query\": {\\n \"${exampleVariable2}\": {} // match_all\\n }\\n}' to
sort of equal '5'\r\n at Assertion.assert (expect.js:100:11)\r\n at
Assertion.eql (expect.js:244:8)\r\n at _console.ts:47:41\r\n at
processTicksAndRejections (node:internal/process/task_queues:95:5)\r\n
at runAttempt (retry_for_success.ts:29:15)\r\n at retryForSuccess
(retry_for_success.ts:97:21)\r\n at RetryService.tryWithRetries
(retry.ts:115:12)\r\n at Context.<anonymous> (_console.ts:44:7)\r\n at
Object.apply
(wrap_function.js:73:16)\r\n```","sha":"277b3fbc24889bedd512f23674e768d2f4c43294"}},"sourceBranch":"main","suggestedTargetBranches":["8.13"],"targetPullRequestStates":[{"branch":"main","label":"v8.14.0","branchLabelMappingKey":"^v8.14.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/178660","number":178660,"mergeCommit":{"message":"[kbn-ftr-common-functional-services]
extend retry service (#178660)\n\n## Summary\r\n\r\nI took the retry
wrapper created by\r\n[jpdjere](https://github.com/jpdjere) in #173998
and extended\r\n`retryForSuccess<T>` with required capabilities to
provide the same\r\nfunctionality.\r\n\r\nThis PR:\r\n1) extends retry
service with new function `tryWithRetries<T> => :\r\nPromise<T>` to
retry block `options.retryCount` times within\r\n`options.timeout`
period and return block result\r\n\r\n```ts\r\n const response = await
retry.tryWithRetries<SearchResponse>(\r\n 'search request',\r\n async ()
=> {\r\n const response = await supertest\r\n
.post(`/internal/search/es`)\r\n .set(ELASTIC_HTTP_VERSION_HEADER,
'1')\r\n .send({\r\n params: {\r\n body: {\r\n query: {\r\n match_all:
{},\r\n },\r\n },\r\n },\r\n })\r\n .expect(200);\r\n return
response.body as SearchResponse;\r\n },\r\n {\r\n retryCount: 4,\r\n
retryDelay: 100, // optional\r\n timeout: 30000, // optional\r\n
}\r\n```\r\n\r\n2) removes `utils/retry.ts` wrapper and migrate tests to
FTR Retry\r\nservice\r\n3) Adds descriptions to Retry service functions
explaining the default\r\nuse case\r\n\r\nHow the failures look
like:\r\n\r\n- when reached timeout before retry count limit \r\n```\r\n
Error: retry.tryWithRetries reached timeout 11000 ms waiting for 'run
console request'\r\n Error: expected '# Click the Variables button,
above, to create your own variables.\\nGET ${exampleVariable1} //
_search\\n{\\n \"query\": {\\n \"${exampleVariable2}\": {} //
match_all\\n }\\n}' to sort of equal '5'\r\n at Assertion.assert
(expect.js:100:11)\r\n at Assertion.eql (expect.js:244:8)\r\n at
_console.ts:47:41\r\n at processTicksAndRejections
(node:internal/process/task_queues:95:5)\r\n at runAttempt
(retry_with_retries.ts:29:15)\r\n at retryWithRetries
(retry_with_retries.ts:97:21)\r\n at RetryService.tryForTime
(retry.ts:38:12)\r\n at Context.<anonymous> (_console.ts:44:7)\r\n at
Object.apply (wrap_function.js:73:16)\r\n```\r\n- when reached retry
count limit before timeout\r\n```\r\n Error: retry.tryWithRetries
reached the limit of attempts waiting for 'run console request': 2 out
of 2\r\n Error: expected '# Click the Variables button, above, to create
your own variables.\\nGET ${exampleVariable1} // _search\\n{\\n
\"query\": {\\n \"${exampleVariable2}\": {} // match_all\\n }\\n}' to
sort of equal '5'\r\n at Assertion.assert (expect.js:100:11)\r\n at
Assertion.eql (expect.js:244:8)\r\n at _console.ts:47:41\r\n at
processTicksAndRejections (node:internal/process/task_queues:95:5)\r\n
at runAttempt (retry_for_success.ts:29:15)\r\n at retryForSuccess
(retry_for_success.ts:97:21)\r\n at RetryService.tryWithRetries
(retry.ts:115:12)\r\n at Context.<anonymous> (_console.ts:44:7)\r\n at
Object.apply
(wrap_function.js:73:16)\r\n```","sha":"277b3fbc24889bedd512f23674e768d2f4c43294"}},{"branch":"8.13","label":"v8.13.1","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Dzmitry Lemechko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area release_note:skip Skip the PR/issue when compiling release notes Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. test v8.12.1 v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants