-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve logging in API integration test and replace pRetry with a method from retryService #178515
Improve logging in API integration test and replace pRetry with a method from retryService #178515
Conversation
… rate reason message
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
start: 'now-10m', | ||
end: 'now+5m', | ||
metrics: [ | ||
{ name: 'system.network.in.bytes', method: 'linear', start: 0, end: 54000000 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adjusted the data generation schema and related time range (start, end) to fix the following issue:
Special thanks to @simianhacker for helping with the math aspect of it!
import type { ToolingLog } from '@kbn/tooling-log'; | ||
|
||
/** | ||
* Copied from x-pack/test/security_solution_api_integration/test_suites/detections_response/utils/retry.ts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a ticket to share this logic via retryService or a package or ...
}); | ||
await esDeleteAllIndices([ALERT_ACTION_INDEX, ...dataForgeIndices]); | ||
await cleanup({ client: esClient, config: dataForgeConfig, logger }); | ||
}); | ||
|
||
// FLAKY: https://github.com/elastic/kibana/issues/175360 | ||
describe.skip('Rule creation', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was fixed in #175479
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the test locally and it passes. 👍
In case of test failure, I don't get the message related to maximum retry count reached as mentioned in the PR. Is that changed?
Also, in this case, 120 retries are made every 0.5 second. I think we can optimize this by making retry every 2-3 seconds and limiting retries to 10-20. Wdyt?
Screen.Recording.2024-03-18.at.12.49.11.mov
export const refreshSavedObjectIndices = async (es: Client) => { | ||
// Refresh indices to prevent a race condition between a write and subsequent read operation. To | ||
// fix it deterministically we have to refresh saved object indices and wait until it's done. | ||
await es.indices.refresh({ index: ALL_SAVED_OBJECT_INDICES }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my curiosity - Does it only apply to saved object indices? or could we also add kbn-data-forge
indices to make sure test data is available after indexing when tests ran?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only applies to the saved objects. This is because when we update the rule SO after execution, we set refresh:false
, so I added this refresh to ensure data is searchable.
I am not sure if a similar logic is applicable for kbn-data-forge
. It is worth checking it but let's do it outside of this PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but left a comment about stringifying the error message in the new retry function.
retryAttempt - 1 | ||
}/${retries}`; | ||
logger.error(errorMessage); | ||
return new Error(JSON.stringify(errorMessage)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't need to JSON.stringify()
errorMessage
, do we? Seems like it can only be a string at this point ...
I will check the message.
This is actually intentional, if we wait 2-3 seconds for each attempt, then it will add to the time of running the tests. By default, it should return the result in 2 seconds, so we will not have that many attempts. I saw in the alertApi client, the waiting time is 2 min and it retries every 0.5 seconds but it seemed a bit too much to me, so I've added the waiting time for 1 min instead. |
If we increase wait time between retries, we need to decrease total retries so that it doesn't increase overall time of running tests. In case of flaky test failure like "rule is active", the api call returns successfully (in 1-2 seconds or maybe less), but rule status is "ok" and it never changes to "active" (like in the screen recording in my previous comment). Tbh 120 retries seems a bit much, but then I assume this will happen only in case of flaky test and I don't know if there is any downside of having many retries. So I'll leave it up-to you whether this should be optimized. |
@benakansara The different message that you saw was due to reaching timeout first instead of the number of retries, so I've adjusted the timeout a bit in 535dab2 About changing the delay, the 2 seconds was about when the rule is ready, not receiving the first API call, and I don't think overall it will exceed 5 seconds in total, so I wouldn't worry about it and I want to make sure the tests run as fast as it could, so let's keep it as it is and we can adjust it in future if the need arises. (I will also create a ticket to replace our utilities with alertingApi which then changes this to 0.5s delay for 2 minutes 🙈 but maybe we can discuss with responseOps and come up with an agreement.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
💛 Build succeeded, but was flaky
Failed CI StepsMetrics [docs]
History
To update your PR or re-run it, just comment with: |
Related to #176401, #175776
Summary
This PR:
Flaky test runner
Current (after adding refresh index and adjusting timeout)
Old
After checking data is generated in metric threshold
Inspired by #173998, special thanks to @jpdjere and @dmlemeshko for their support and knowledge sharing.