-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle error on disk.dump Salt call #2493
Labels
complexity:medium
Something that requires one or few days to fix
kind:bug
Something isn't working
topic:storage
Issues related to storage
Comments
slaperche-scality
added
complexity:medium
Something that requires one or few days to fix
kind:bug
Something isn't working
topic:storage
Issues related to storage
labels
Apr 30, 2020
I've updated the original message using the information we got during the troubleshooting. |
slaperche-scality
changed the title
Automatic error handler for Salt
Handle error on disk.dump Salt call
Apr 30, 2020
slaperche-scality
added a commit
that referenced
this issue
May 14, 2020
Until now, we only had a single Salt async job per step (one to prepare, one to finalize), thus we knew which job was running by knowing the step we were in. This is gonna change, as we want to call `disk.dump` in an async fashion and thus we will have two async jobs for the prepare step. In order to know which is which, we now carry the job name alongside its ID and bundle this up into a JobHandle. This commit only introduces the JobHandle and uses it instead of the JobID. Making `disk.dump` async will be done in another commit. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 14, 2020
Until now, we only had a single Salt async job per step (one to prepare, one to finalize), thus we knew which job was running by knowing the step we were in. This is gonna change, as we want to call `disk.dump` in an async fashion and thus we will have two async jobs for the prepare step. In order to know which is which, we now carry the job name alongside its ID and bundle this up into a JobHandle. This commit only introduces the JobHandle and uses it instead of the JobID. Making `disk.dump` async will be done in another commit. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 14, 2020
This allows to act on the result of the job. Unused for now, but we will exploit it when `disk.dump` is made async. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 15, 2020
Until now, we only had a single Salt async job per step (one to prepare, one to finalize), thus we knew which job was running by knowing the step we were in. This is gonna change, as we want to call `disk.dump` in an async fashion and thus we will have two async jobs for the prepare step. In order to know which is which, we now carry the job name alongside its ID and bundle this up into a JobHandle. This commit only introduces the JobHandle and uses it instead of the JobID. Making `disk.dump` async will be done in another commit. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 15, 2020
This allows to act on the result of the job. Unused for now, but we will exploit it when `disk.dump` is made async. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 15, 2020
TODO Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 25, 2020
The sync call to `disk.dump` was flaky for two reasons: - we had a 1s timeout, but no logic to retry on timeout error => volume creation could fail on "slow" platform - Salt API doesn't handle well error triggered by concurrent execution of Salt state when you make an sync call (an error 500 with no details is returned), thus we couldn't retry. For those reasons, we now use an async call to retrieve the volume size. Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 25, 2020
The sync call to `disk.dump` was flaky for two reasons: - we had a 1s timeout, but no logic to retry on timeout error => volume creation could fail on "slow" platform - Salt API doesn't handle well error triggered by concurrent execution of Salt state when you make an sync call (an error 500 with no details is returned), thus we couldn't retry. For those reasons, we now use an async call to retrieve the volume size. Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 25, 2020
Until now, we only had a single Salt async job per step (one to prepare, one to finalize), thus we knew which job was running by knowing the step we were in. This is gonna change, as we want to call `disk.dump` in an async fashion and thus we will have two async jobs for the prepare step. In order to know which is which, we now carry the job name alongside its ID and bundle this up into a JobHandle. This commit only introduces the JobHandle and uses it instead of the JobID. Making `disk.dump` async will be done in another commit. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 25, 2020
This allows to act on the result of the job. Unused for now, but we will exploit it when `disk.dump` is made async. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 25, 2020
The sync call to `disk.dump` was flaky for two reasons: - we had a 1s timeout, but no logic to retry on timeout error => volume creation could fail on "slow" platform - Salt API doesn't handle well error triggered by concurrent execution of Salt state when you make an sync call (an error 500 with no details is returned), thus we couldn't retry. For those reasons, we now use an async call to retrieve the volume size. Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 25, 2020
The sync call to `disk.dump` was flaky for two reasons: - we had a 1s timeout, but no logic to retry on timeout error => volume creation could fail on "slow" platform - Salt API doesn't handle well error triggered by concurrent execution of Salt state when you make an sync call (an error 500 with no details is returned), thus we couldn't retry. For those reasons, we now use an async call to retrieve the volume size. Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 25, 2020
The Salt client code already handles the retcode, the caller is only interested in the returned payload. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
May 26, 2020
The sync call to `disk.dump` was flaky for two reasons: - we had a 1s timeout, but no logic to retry on timeout error => volume creation could fail on "slow" platform - Salt API doesn't handle well error triggered by concurrent execution of Salt state when you make an sync call (an error 500 with no details is returned), thus we couldn't retry. For those reasons, we now use an async call to retrieve the volume size. Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality
added a commit
that referenced
this issue
Jun 4, 2020
Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
complexity:medium
Something that requires one or few days to fix
kind:bug
Something isn't working
topic:storage
Issues related to storage
Component:
operator
What happened:
Sometimes, Volume creation fails with:
If we enable debug log on Salt API side, we get:
The problem may be related to a busy Salt master and/or more than one state trying to run at once.
What was expected:
The Operator should retry this operation and eventually the Volume should be created.
Steps to reproduce
@gdemonet managed to reproduce it by creating 10 sparseloop volumes at once on the same node
Resolution proposal:
Use async call for
disk.dump
(it's currently sync), that way we will benefit from the auto-retry logic that we've already implementedThe text was updated successfully, but these errors were encountered: