Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jitter/backoff times to job retires #116

Merged
merged 1 commit into from
Dec 17, 2020

Conversation

aef-
Copy link
Collaborator

@aef- aef- commented Dec 15, 2020

ACCESS may be submitting tasks by the hundreds. Jobs are failing arbitrarily due to the inability to submit to LSF Failed to submit job. This will prevent the jobs from retrying all at once. May require larger backoff times. Currently, it should be 1 minute, 2 minutes, 4 minutes. Maybe we want to do 2,4,8. Or extend the max retries to 4 or 5. Jitter will add randomness to the backoff retries, so if 20 fail at once, they all won't retry at the same time.

@ionox0
Copy link
Member

ionox0 commented Dec 15, 2020

I think retry_jitter is applied by default when using retry_backoff

By default, this exponential backoff will also introduce random jitter to avoid having all the tasks run at the same moment. It will also cap the maximum backoff delay to 10 minutes. All these settings can be customized via options documented below.

@ionox0
Copy link
Member

ionox0 commented Dec 15, 2020

But either way this LGTM

@aef- aef- merged commit 92bb2b3 into develop Dec 17, 2020
@aef- aef- deleted the aef-/add-randomness-and-longertimes-to-job-retries branch December 17, 2020 18:40
@nikhil nikhil restored the aef-/add-randomness-and-longertimes-to-job-retries branch December 21, 2020 20:52
@nikhil nikhil deleted the aef-/add-randomness-and-longertimes-to-job-retries branch December 21, 2020 20:52
nikhil added a commit that referenced this pull request Dec 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants