Improve logging in API integration test and replace pRetry with a method from retryService #178515

maryam-saeidi · 2024-03-12T14:53:12Z

Summary

This PR:

Improves logging (I've added debug logs to the helpers that does an API request such as creating a data view)
Uses retryService instead of pRetry
- In case of throwing an error in pRetry, when we have 10 retries, it does not log the retry attempts and we end up in the situation that is mentioned in this comment, item 3

Before	After

Attempts to fix flakiness in rate reason message due to having different data

Flaky test runner

Current (after adding refresh index and adjusting timeout)

Old

[25] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5452 ✅
[200] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5454 [1 Failed : 25 Canceled: 174 Passed ]
After checking data is generated in metric threshold
[25] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5460 ✅
[200] https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/5462 [1 Failed : 199 Canceled ]

Inspired by #173998, special thanks to @jpdjere and @dmlemeshko for their support and knowledge sharing.

… rate reason message

apmmachine · 2024-03-12T14:53:26Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
/oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

maryam-saeidi · 2024-03-12T15:15:35Z

x-pack/test/alerting_api_integration/observability/custom_threshold_rule/rate_bytes_fired.ts

+            start: 'now-10m',
+            end: 'now+5m',
+            metrics: [
+              { name: 'system.network.in.bytes', method: 'linear', start: 0, end: 54000000 },


I've adjusted the data generation schema and related time range (start, end) to fix the following issue:

Special thanks to @simianhacker for helping with the math aspect of it!

maryam-saeidi · 2024-03-12T16:46:46Z

x-pack/test/alerting_api_integration/common/retry.ts

+import type { ToolingLog } from '@kbn/tooling-log';
+
+/**
+ * Copied from x-pack/test/security_solution_api_integration/test_suites/detections_response/utils/retry.ts


Create a ticket to share this logic via retryService or a package or ...

…shold test.

maryam-saeidi · 2024-03-13T16:08:20Z

...st/alerting_api_integration/observability/custom_threshold_rule/custom_eq_avg_bytes_fired.ts

      });
      await esDeleteAllIndices([ALERT_ACTION_INDEX, ...dataForgeIndices]);
      await cleanup({ client: esClient, config: dataForgeConfig, logger });
    });

-    // FLAKY: https://github.com/elastic/kibana/issues/175360
-    describe.skip('Rule creation', () => {


I think this was fixed in #175479

benakansara

I checked the test locally and it passes. 👍

In case of test failure, I don't get the message related to maximum retry count reached as mentioned in the PR. Is that changed?

Also, in this case, 120 retries are made every 0.5 second. I think we can optimize this by making retry every 2-3 seconds and limiting retries to 10-20. Wdyt?

Screen.Recording.2024-03-18.at.12.49.11.mov

benakansara · 2024-03-18T12:27:37Z

x-pack/test/alerting_api_integration/observability/helpers/refresh_index.ts

+export const refreshSavedObjectIndices = async (es: Client) => {
+  // Refresh indices to prevent a race condition between a write and subsequent read operation. To
+  // fix it deterministically we have to refresh saved object indices and wait until it's done.
+  await es.indices.refresh({ index: ALL_SAVED_OBJECT_INDICES });


For my curiosity - Does it only apply to saved object indices? or could we also add kbn-data-forge indices to make sure test data is available after indexing when tests ran?

This only applies to the saved objects. This is because when we update the rule SO after execution, we set refresh:false, so I added this refresh to ensure data is searchable.
I am not sure if a similar logic is applicable for kbn-data-forge. It is worth checking it but let's do it outside of this PR :)

pmuellr

LGTM, but left a comment about stringifying the error message in the new retry function.

pmuellr · 2024-03-18T12:32:56Z

x-pack/test/alerting_api_integration/common/retry.ts

+          retryAttempt - 1
+        }/${retries}`;
+        logger.error(errorMessage);
+        return new Error(JSON.stringify(errorMessage));


Probably don't need to JSON.stringify() errorMessage, do we? Seems like it can only be a string at this point ...

maryam-saeidi · 2024-03-18T17:37:11Z

@benakansara

I checked the test locally and it passes. 👍

In case of test failure, I don't get the message related to maximum retry count reached as mentioned in the PR. Is that changed?

I will check the message.

Also, in this case, 120 retries are made every 0.5 second. I think we can optimize this by making retry every 2-3 seconds and limiting retries to 10-20. Wdyt?

This is actually intentional, if we wait 2-3 seconds for each attempt, then it will add to the time of running the tests. By default, it should return the result in 2 seconds, so we will not have that many attempts. I saw in the alertApi client, the waiting time is 2 min and it retries every 0.5 seconds but it seemed a bit too much to me, so I've added the waiting time for 1 min instead.

benakansara · 2024-03-19T09:16:24Z

This is actually intentional, if we wait 2-3 seconds for each attempt, then it will add to the time of running the tests. By default, it should return the result in 2 seconds, so we will not have that many attempts.

If we increase wait time between retries, we need to decrease total retries so that it doesn't increase overall time of running tests. In case of flaky test failure like "rule is active", the api call returns successfully (in 1-2 seconds or maybe less), but rule status is "ok" and it never changes to "active" (like in the screen recording in my previous comment). Tbh 120 retries seems a bit much, but then I assume this will happen only in case of flaky test and I don't know if there is any downside of having many retries. So I'll leave it up-to you whether this should be optimized.

maryam-saeidi · 2024-03-19T11:32:41Z

@benakansara The different message that you saw was due to reaching timeout first instead of the number of retries, so I've adjusted the timeout a bit in 535dab2

About changing the delay, the 2 seconds was about when the rule is ready, not receiving the first API call, and I don't think overall it will exceed 5 seconds in total, so I wouldn't worry about it and I want to make sure the tests run as fast as it could, so let's keep it as it is and we can adjust it in future if the need arises. (I will also create a ticket to replace our utilities with alertingApi which then changes this to 0.5s delay for 2 minutes 🙈 but maybe we can discuss with responseOps and come up with an agreement.)

benakansara

LGTM

kibana-ci · 2024-03-19T12:41:41Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 40e2018

Failed CI Steps

FTR Configs #50

Metrics [docs]

✅ unchanged

History

💚 Build #197897 succeeded d948c58
💚 Build #197621 succeeded 95711e0
💔 Build #197481 failed 1168ecc
💔 Build #197455 failed 38cff83
💔 Build #197378 failed aceaf5f

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

maryam-saeidi added 2 commits March 12, 2024 15:25

Improve logging, use retryService instead of pRetry, fix flakyness in…

f0ad217

… rate reason message

Adjust timeout

a03e1bc

maryam-saeidi added the release_note:skip Skip the PR/issue when compiling release notes label Mar 12, 2024

maryam-saeidi commented Mar 12, 2024

View reviewed changes

Merge branch 'main' into 176401-rate-custom-threshold-flaky-test

796fd58

maryam-saeidi marked this pull request as ready for review March 12, 2024 16:17

maryam-saeidi requested review from a team as code owners March 12, 2024 16:17

maryam-saeidi mentioned this pull request Mar 12, 2024

Adding a more flexible retry function to the retryService #178535

Closed

maryam-saeidi commented Mar 12, 2024

View reviewed changes

maryam-saeidi added 4 commits March 13, 2024 11:05

Generate data after now in test and wait for documents in metric thre…

aceaf5f

…shold test.

Add refreshSavedObjectIndices and increase timeout

86ab613

Merge branch 'main' into 176401-rate-custom-threshold-flaky-test

38cff83

Merge branch 'main' into 176401-rate-custom-threshold-flaky-test

1168ecc

maryam-saeidi commented Mar 13, 2024

View reviewed changes

Fix type

95711e0

maryam-saeidi mentioned this pull request Mar 14, 2024

Attempt to fix flaky test by adding host.mac to all the fake_hosts documents #178648

Merged

benakansara self-requested a review March 14, 2024 20:39

maryam-saeidi added the test-failure-flaky label Mar 15, 2024

Merge branch 'main' into 176401-rate-custom-threshold-flaky-test

d948c58

benakansara reviewed Mar 18, 2024

View reviewed changes

pmuellr approved these changes Mar 18, 2024

View reviewed changes

maryam-saeidi added 2 commits March 19, 2024 12:26

Remove unnecessary JSON.stringify and adjust timeout

535dab2

Merge branch 'main' into 176401-rate-custom-threshold-flaky-test

40e2018

maryam-saeidi enabled auto-merge (squash) March 19, 2024 11:34

benakansara approved these changes Mar 19, 2024

View reviewed changes

maryam-saeidi merged commit 57522d6 into elastic:main Mar 19, 2024
18 checks passed

kibanamachine added v8.14.0 backport:skip This commit does not require backporting labels Mar 19, 2024

maryam-saeidi deleted the 176401-rate-custom-threshold-flaky-test branch March 19, 2024 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve logging in API integration test and replace pRetry with a method from retryService #178515

Improve logging in API integration test and replace pRetry with a method from retryService #178515

maryam-saeidi commented Mar 12, 2024 •

edited

Loading

apmmachine commented Mar 12, 2024

maryam-saeidi Mar 12, 2024

maryam-saeidi Mar 12, 2024

maryam-saeidi Mar 13, 2024

benakansara left a comment

benakansara Mar 18, 2024

maryam-saeidi Mar 18, 2024

pmuellr left a comment

pmuellr Mar 18, 2024

maryam-saeidi commented Mar 18, 2024

benakansara commented Mar 19, 2024

maryam-saeidi commented Mar 19, 2024

benakansara left a comment

kibana-ci commented Mar 19, 2024

Improve logging in API integration test and replace pRetry with a method from retryService #178515

Improve logging in API integration test and replace pRetry with a method from retryService #178515

Conversation

maryam-saeidi commented Mar 12, 2024 • edited Loading

Summary

Flaky test runner

Current (after adding refresh index and adjusting timeout)

Old

After checking data is generated in metric threshold

apmmachine commented Mar 12, 2024

🤖 GitHub comments

maryam-saeidi Mar 12, 2024

Choose a reason for hiding this comment

maryam-saeidi Mar 12, 2024

Choose a reason for hiding this comment

maryam-saeidi Mar 13, 2024

Choose a reason for hiding this comment

benakansara left a comment

Choose a reason for hiding this comment

benakansara Mar 18, 2024

Choose a reason for hiding this comment

maryam-saeidi Mar 18, 2024

Choose a reason for hiding this comment

pmuellr left a comment

Choose a reason for hiding this comment

pmuellr Mar 18, 2024

Choose a reason for hiding this comment

maryam-saeidi commented Mar 18, 2024

benakansara commented Mar 19, 2024

maryam-saeidi commented Mar 19, 2024

benakansara left a comment

Choose a reason for hiding this comment

kibana-ci commented Mar 19, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

History

maryam-saeidi commented Mar 12, 2024 •

edited

Loading