Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging in API integration test and replace pRetry with a method from retryService #178515

Merged

Conversation

maryam-saeidi
Copy link
Member

@maryam-saeidi maryam-saeidi commented Mar 12, 2024

Related to #176401, #175776

Summary

This PR:

  • Improves logging (I've added debug logs to the helpers that does an API request such as creating a data view)
  • Uses retryService instead of pRetry
    • In case of throwing an error in pRetry, when we have 10 retries, it does not log the retry attempts and we end up in the situation that is mentioned in this comment, item 3
Before After
image image
  • Attempts to fix flakiness in rate reason message due to having different data
    image

Flaky test runner

Current (after adding refresh index and adjusting timeout)

Old

Inspired by #173998, special thanks to @jpdjere and @dmlemeshko for their support and knowledge sharing.

@maryam-saeidi maryam-saeidi added the release_note:skip Skip the PR/issue when compiling release notes label Mar 12, 2024
@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • /oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

start: 'now-10m',
end: 'now+5m',
metrics: [
{ name: 'system.network.in.bytes', method: 'linear', start: 0, end: 54000000 },
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted the data generation schema and related time range (start, end) to fix the following issue:

image

Special thanks to @simianhacker for helping with the math aspect of it!

@maryam-saeidi maryam-saeidi marked this pull request as ready for review March 12, 2024 16:17
@maryam-saeidi maryam-saeidi requested review from a team as code owners March 12, 2024 16:17
import type { ToolingLog } from '@kbn/tooling-log';

/**
* Copied from x-pack/test/security_solution_api_integration/test_suites/detections_response/utils/retry.ts
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a ticket to share this logic via retryService or a package or ...

});
await esDeleteAllIndices([ALERT_ACTION_INDEX, ...dataForgeIndices]);
await cleanup({ client: esClient, config: dataForgeConfig, logger });
});

// FLAKY: https://github.com/elastic/kibana/issues/175360
describe.skip('Rule creation', () => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was fixed in #175479

Copy link
Contributor

@benakansara benakansara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the test locally and it passes. 👍

In case of test failure, I don't get the message related to maximum retry count reached as mentioned in the PR. Is that changed?

Also, in this case, 120 retries are made every 0.5 second. I think we can optimize this by making retry every 2-3 seconds and limiting retries to 10-20. Wdyt?

Screen.Recording.2024-03-18.at.12.49.11.mov

export const refreshSavedObjectIndices = async (es: Client) => {
// Refresh indices to prevent a race condition between a write and subsequent read operation. To
// fix it deterministically we have to refresh saved object indices and wait until it's done.
await es.indices.refresh({ index: ALL_SAVED_OBJECT_INDICES });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my curiosity - Does it only apply to saved object indices? or could we also add kbn-data-forge indices to make sure test data is available after indexing when tests ran?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only applies to the saved objects. This is because when we update the rule SO after execution, we set refresh:false, so I added this refresh to ensure data is searchable.
I am not sure if a similar logic is applicable for kbn-data-forge. It is worth checking it but let's do it outside of this PR :)

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but left a comment about stringifying the error message in the new retry function.

retryAttempt - 1
}/${retries}`;
logger.error(errorMessage);
return new Error(JSON.stringify(errorMessage));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need to JSON.stringify() errorMessage, do we? Seems like it can only be a string at this point ...

@maryam-saeidi
Copy link
Member Author

@benakansara

I checked the test locally and it passes. 👍

In case of test failure, I don't get the message related to maximum retry count reached as mentioned in the PR. Is that changed?

I will check the message.

Also, in this case, 120 retries are made every 0.5 second. I think we can optimize this by making retry every 2-3 seconds and limiting retries to 10-20. Wdyt?

This is actually intentional, if we wait 2-3 seconds for each attempt, then it will add to the time of running the tests. By default, it should return the result in 2 seconds, so we will not have that many attempts. I saw in the alertApi client, the waiting time is 2 min and it retries every 0.5 seconds but it seemed a bit too much to me, so I've added the waiting time for 1 min instead.

@benakansara
Copy link
Contributor

This is actually intentional, if we wait 2-3 seconds for each attempt, then it will add to the time of running the tests. By default, it should return the result in 2 seconds, so we will not have that many attempts.

If we increase wait time between retries, we need to decrease total retries so that it doesn't increase overall time of running tests. In case of flaky test failure like "rule is active", the api call returns successfully (in 1-2 seconds or maybe less), but rule status is "ok" and it never changes to "active" (like in the screen recording in my previous comment). Tbh 120 retries seems a bit much, but then I assume this will happen only in case of flaky test and I don't know if there is any downside of having many retries. So I'll leave it up-to you whether this should be optimized.

@maryam-saeidi
Copy link
Member Author

@benakansara The different message that you saw was due to reaching timeout first instead of the number of retries, so I've adjusted the timeout a bit in 535dab2

About changing the delay, the 2 seconds was about when the rule is ready, not receiving the first API call, and I don't think overall it will exceed 5 seconds in total, so I wouldn't worry about it and I want to make sure the tests run as fast as it could, so let's keep it as it is and we can adjust it in future if the need arises. (I will also create a ticket to replace our utilities with alertingApi which then changes this to 0.5s delay for 2 minutes 🙈 but maybe we can discuss with responseOps and come up with an agreement.)

@maryam-saeidi maryam-saeidi enabled auto-merge (squash) March 19, 2024 11:34
Copy link
Contributor

@benakansara benakansara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes test-failure-flaky v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants