Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kbn-ftr-common-functional-services] extend retry service #178660

Merged
merged 13 commits into from
Mar 19, 2024
Merged
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,31 @@ import { FtrService } from '../ftr_provider_context';
import { retryForSuccess } from './retry_for_success';
import { retryForTruthy } from './retry_for_truthy';

interface TryWithRetriesOptions {
retryCount: number;
retryDelay?: number;
timeout?: number;
}

export class RetryService extends FtrService {
private readonly config = this.ctx.getService('config');
private readonly log = this.ctx.getService('log');

/**
* Use to retry block within {timeout} period and return block result.
* @param timeout retrying timeout
* @param block retriable action
* @param onFailureBlock optional action to run before the new retriable action attempt
* @param retryDelay optional delay before the new attempt
* @returns result from retriable action
*/
public async tryForTime<T>(
timeout: number,
block: () => Promise<T>,
onFailureBlock?: () => Promise<T>,
retryDelay?: number
) {
return await retryForSuccess(this.log, {
return await retryForSuccess<T>(this.log, {
timeout,
methodName: 'retry.tryForTime',
block,
Expand All @@ -43,6 +57,13 @@ export class RetryService extends FtrService {
});
}

/**
* Use to wait for block condition to be true
* @param description description for retriable action
* @param timeout retrying timeout
* @param block retriable action
* @param onFailureBlock optional action to run before the new retriable action attempt
*/
public async waitForWithTimeout(
description: string,
timeout: number,
Expand Down Expand Up @@ -71,4 +92,31 @@ export class RetryService extends FtrService {
onFailureBlock,
});
}

/**
* Use to retry block {options.retryCount} times within {options.timeout} period and return block result
* @param description description for retriable action
* @param block retriable action
* @param options options.retryCount for how many attempts to retry
* @param onFailureBlock optional action to run before the new retriable action attempt
* @returns result from retriable action
*/
public async tryWithRetries<T>(
description: string,
block: () => Promise<T>,
options: TryWithRetriesOptions,
onFailureBlock?: () => Promise<T>
): Promise<T> {
const { retryCount, timeout = this.config.get('timeouts.try'), retryDelay = 200 } = options;

return await retryForSuccess<T>(this.log, {
description,
timeout,
methodName: 'retry.tryWithRetries',
block,
onFailureBlock,
retryDelay,
retryCount,
});
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));

const returnTrue = () => true;

const defaultOnFailure = (methodName: string) => (lastError: Error | undefined) => {
const defaultOnFailure = (methodName: string) => (lastError: Error | undefined, reason: string) => {
throw new Error(
`${methodName} timeout${lastError ? `: ${lastError.stack || lastError.message}` : ''}`
`${methodName} ${reason}\n${lastError ? `${lastError.stack || lastError.message}` : ''}`
);
};

Expand Down Expand Up @@ -44,32 +44,51 @@ interface Options<T> {
onFailureBlock?: () => Promise<T>;
onFailure?: ReturnType<typeof defaultOnFailure>;
accept?: (v: T) => boolean;
description?: string;
retryDelay?: number;
retryCount?: number;
}

export async function retryForSuccess<T>(log: ToolingLog, options: Options<T>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you investigated if p-retry is applicable here? Additionally exponential back off timeout may work better.

Copy link
Member Author

@dmlemeshko dmlemeshko Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a chat with @maryam-saeidi about p-retry and the issues she faced while using it. This PR is actually about to help migrate away from p-retry.
I think p-retry implementation is more complex vs FTR retry service, but since we never had stability issues with FTR retry (folks question logging & interface, understood :) ) I would keep logic as simple as possible.
As for logging and capabilities, I still believe we can collaborate with PRs like this one to achieve improvements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmlemeshko Could you share found issues with p-retry?

My observation is p-retry feels like a quite popular library with 14M weekly downloads. A lot of bugs should've been fixed already. On top of that p-retry provides flexible configuration like exponential back off retry timeout while it still can be configured linearly. It allows to abort retries as well. Checking Kibana's codebase it's not hard to see it's used in multiple packages and plugins.

Having a comment for retryForSuccess with some explanation and improvements over p-retry will be really helpful for future maintenance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maximpn Here is the ticket that I explained what challenge we faced, and in our case, we didn't have any log about the retry attempts which made it hard for us to understand if it was one request that timeout or there were multiple requests and due to exponential back off, the overall attempt failed. (The way that we tested it locally was by throwing an error in the test function and setting the retry to 10 times)

In general, I think it would be better to only rely on one library/package for retry purposes, for easier maintenance and a better understanding of the tool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maryam-saeidi thanks for the details 👍

we didn't have any log about the retry attempts
Yes, it sounds like a bug. Or exponential timeout was too high so the test was interrupted due to a timeout. Btw, you could configure p-retry to have a constant retry interval.

const {
description,
timeout,
methodName,
block,
onFailureBlock,
onFailure = defaultOnFailure(methodName),
accept = returnTrue,
retryDelay = 502,
retryCount,
} = options;
const { onFailure = defaultOnFailure(methodName) } = options;

const start = Date.now();
const criticalWebDriverErrors = ['NoSuchSessionError', 'NoSuchWindowError'];
let lastError;
let attemptCounter = 0;
const addText = (str: string | undefined) => (str ? ` waiting for '${str}'` : '');

while (true) {
// Aborting if no retry attempts are left (opt-in)
if (retryCount && ++attemptCounter > retryCount) {
onFailure(
lastError,
// optionally extend error message with description
`reached the limit of attempts${addText(description)}: ${
attemptCounter - 1
} out of ${retryCount}`
);
}
// Aborting if timeout is reached
if (Date.now() - start > timeout) {
await onFailure(lastError);
throw new Error('expected onFailure() option to throw an error');
} else if (lastError && criticalWebDriverErrors.includes(lastError.name)) {
// Aborting retry since WebDriver session is invalid or browser window is closed
onFailure(lastError, `reached timeout ${timeout} ms${addText(description)}`);
}
// Aborting if WebDriver session is invalid or browser window is closed
if (lastError && criticalWebDriverErrors.includes(lastError.name)) {
throw new Error('WebDriver session is invalid, retry was aborted');
} else if (lastError && onFailureBlock) {
}
// Run opt-in onFailureBlock before the next attempt
if (lastError && onFailureBlock) {
const before = await runAttempt(onFailureBlock);
if ('error' in before) {
log.debug(`--- onRetryBlock error: ${before.error.message}`);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,7 @@ export default ({ getService }: FtrProviderContext): void => {
es,
supertest,
'99.0.0',
retry,
log
retry
);

// As opposed to "registry"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,7 @@ export default ({ getService }: FtrProviderContext): void => {
const fleetPackageInstallationResponse = await installPrebuiltRulesPackageViaFleetAPI(
es,
supertest,
retry,
log
retry
);

expect(fleetPackageInstallationResponse.items.length).toBe(1);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ export default ({ getService }: FtrProviderContext): void => {
supertest,
overrideExistingPackage: true,
retryService: retry,
log,
});

// Verify that status is updated after package installation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -107,8 +107,7 @@ export default ({ getService }: FtrProviderContext): void => {
es,
supertest,
previousVersion,
retry,
log
retry
);

expect(installPreviousPackageResponse._meta.install_source).toBe('registry');
Expand Down Expand Up @@ -161,8 +160,7 @@ export default ({ getService }: FtrProviderContext): void => {
es,
supertest,
currentVersion,
retry,
log
retry
);
expect(installLatestPackageResponse.items.length).toBeGreaterThanOrEqual(0);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,4 @@ export * from './wait_for_index_to_populate';
export * from './get_stats';
export * from './get_detection_metrics_from_body';
export * from './get_stats_url';
export * from './retry';
export * from './combine_to_ndjson';

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,7 @@ import type SuperTest from 'supertest';
import { InstallPackageResponse } from '@kbn/fleet-plugin/common/types';
import { epmRouteService } from '@kbn/fleet-plugin/common';
import { RetryService } from '@kbn/ftr-common-functional-services';
import type { ToolingLog } from '@kbn/tooling-log';
import expect from 'expect';
import { retry } from '../../retry';
import { refreshSavedObjectIndices } from '../../refresh_index';

const MAX_RETRIES = 2;
Expand All @@ -29,11 +27,11 @@ const ATTEMPT_TIMEOUT = 120000;
export const installPrebuiltRulesPackageViaFleetAPI = async (
es: Client,
supertest: SuperTest.SuperTest<SuperTest.Test>,
retryService: RetryService,
log: ToolingLog
retryService: RetryService
): Promise<InstallPackageResponse> => {
const fleetResponse = await retry<InstallPackageResponse>({
test: async () => {
const fleetResponse = await retryService.tryWithRetries<InstallPackageResponse>(
installPrebuiltRulesPackageViaFleetAPI.name,
async () => {
const testResponse = await supertest
.post(`/api/fleet/epm/packages/security_detection_engine`)
.set('kbn-xsrf', 'xxxx')
Expand All @@ -46,12 +44,11 @@ export const installPrebuiltRulesPackageViaFleetAPI = async (

return testResponse.body;
},
utilityName: installPrebuiltRulesPackageViaFleetAPI.name,
retryService,
retries: MAX_RETRIES,
timeout: ATTEMPT_TIMEOUT,
log,
});
{
retryCount: MAX_RETRIES,
timeout: ATTEMPT_TIMEOUT,
}
);

await refreshSavedObjectIndices(es);

Expand All @@ -71,11 +68,11 @@ export const installPrebuiltRulesPackageByVersion = async (
es: Client,
supertest: SuperTest.SuperTest<SuperTest.Test>,
version: string,
retryService: RetryService,
log: ToolingLog
retryService: RetryService
): Promise<InstallPackageResponse> => {
const fleetResponse = await retry<InstallPackageResponse>({
test: async () => {
const fleetResponse = await retryService.tryWithRetries<InstallPackageResponse>(
installPrebuiltRulesPackageByVersion.name,
async () => {
const testResponse = await supertest
.post(epmRouteService.getInstallPath('security_detection_engine', version))
.set('kbn-xsrf', 'xxxx')
Expand All @@ -88,12 +85,11 @@ export const installPrebuiltRulesPackageByVersion = async (

return testResponse.body;
},
utilityName: installPrebuiltRulesPackageByVersion.name,
retryService,
retries: MAX_RETRIES,
timeout: ATTEMPT_TIMEOUT,
log,
});
{
retryCount: MAX_RETRIES,
timeout: ATTEMPT_TIMEOUT,
}
);

await refreshSavedObjectIndices(es);

Expand Down
Loading