Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not fail retry of scale up operation #511

Merged
merged 3 commits into from
Mar 14, 2024
Merged

Conversation

deepthidevaki
Copy link
Contributor

Scale up operation can some times take longer depending on the size of data in each partition. By default the job is locked for 5 minutes and retried after. If the scale up operation do not complete with in this 5 minutes, the retry failed because there is already an operation in progress. This results in an incident after 3 retries. Manual retry to resolve incident also failed because the command fails when the clusterSize is already the requested one.

This PR fixes this by:

  1. Do not fail scale up command if clusterSize is already the requested one
  2. Do not wait for the operation to complete in scale up command. Instead, run two commands separately - scale and wait. This way only wait has to be retried. Since it is a query, it can be safely retried.
  3. To allow using wait in chaos experiments and e2e tests, allow it to run without specifiying a changeId. When no changeId is specified, it reads the changeId from the pendingChange or lastChange.

Add it as a separate step, because this can take longer than the default 5 minutes. When the whole operation is retried
scale up request fails because there is an ongoing change. By splitting up into two steps, only query has to be retried.
@deepthidevaki deepthidevaki removed the request for review from ChrisKujawa March 14, 2024 11:11
Copy link
Member

@lenaschoenburg lenaschoenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 Makes sense to me.

@deepthidevaki deepthidevaki merged commit c62497a into main Mar 14, 2024
2 checks passed
@deepthidevaki deepthidevaki deleted the dd-do-not-fail-retry branch March 14, 2024 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants