Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle error on disk.dump Salt call #2493

Closed
ftalbart opened this issue Apr 27, 2020 · 1 comment · Fixed by #2571
Closed

Handle error on disk.dump Salt call #2493

ftalbart opened this issue Apr 27, 2020 · 1 comment · Fixed by #2571
Assignees
Labels
complexity:medium Something that requires one or few days to fix kind:bug Something isn't working topic:storage Issues related to storage

Comments

@ftalbart
Copy link

ftalbart commented Apr 27, 2020

Component:

operator

What happened:

Sometimes, Volume creation fails with:

Message:               call to GetVolumeSize failed: disk.dump failed (target=worker-3, path=/dev/disk/by-uuid/567c960e-dcba-4cef-b5b2-4d8bb0a3446e): Salt API failed with code 500: {"status": 500, "return": "An unexpected error occurred"}

If we enable debug log on Salt API side, we get:

[DEBUG   ] Trying to connect to: tcp://192.168.1.40:4506
[DEBUG   ] Error while processing request for: /
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/netapi/rest_cherrypy/app.py", line 858, in hypermedia_handler
    ret = cherrypy.serving.request._hypermedia_inner_handler(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 60, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/salt/netapi/rest_cherrypy/app.py", line 1300, in POST
    token=cherrypy.session.get('token')))
  File "/usr/lib/python2.7/site-packages/salt/netapi/rest_cherrypy/app.py", line 1192, in exec_lowstate
    ret = self.api.run(chunk)
  File "/usr/lib/python2.7/site-packages/salt/netapi/__init__.py", line 75, in run
    return l_fun(*f_call.get('args', ()), **f_call.get('kwargs', {}))
  File "/usr/lib/python2.7/site-packages/salt/netapi/__init__.py", line 103, in local
    return local.cmd(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/salt/client/__init__.py", line 754, in cmd
    **kwargs)
  File "/usr/lib/python2.7/site-packages/salt/client/__init__.py", line 358, in run_job
    raise SaltClientError(general_exception)
SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
[INFO    ] 10.233.132.74 - - [27/Apr/2020:15:56:05] "POST /?timeout=1 HTTP/1.1" 500 57 "" "Go-http-client/1.1"

The problem may be related to a busy Salt master and/or more than one state trying to run at once.

What was expected:

The Operator should retry this operation and eventually the Volume should be created.

Steps to reproduce

@gdemonet managed to reproduce it by creating 10 sparseloop volumes at once on the same node

Resolution proposal:

Use async call for disk.dump (it's currently sync), that way we will benefit from the auto-retry logic that we've already implemented

@slaperche-scality slaperche-scality added complexity:medium Something that requires one or few days to fix kind:bug Something isn't working topic:storage Issues related to storage labels Apr 30, 2020
@slaperche-scality slaperche-scality self-assigned this Apr 30, 2020
@slaperche-scality
Copy link
Contributor

I've updated the original message using the information we got during the troubleshooting.

@slaperche-scality slaperche-scality changed the title Automatic error handler for Salt Handle error on disk.dump Salt call Apr 30, 2020
slaperche-scality added a commit that referenced this issue May 14, 2020
Until now, we only had a single Salt async job per step (one to prepare,
one to finalize), thus we knew which job was running by knowing the step
we were in.

This is gonna change, as we want to call `disk.dump` in an async fashion
and thus we will have two async jobs for the prepare step.
In order to know which is which, we now carry the job name alongside its
ID and bundle this up into a JobHandle.

This commit only introduces the JobHandle and uses it instead of the
JobID.
Making `disk.dump` async will be done in another commit.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 14, 2020
Until now, we only had a single Salt async job per step (one to prepare,
one to finalize), thus we knew which job was running by knowing the step
we were in.

This is gonna change, as we want to call `disk.dump` in an async fashion
and thus we will have two async jobs for the prepare step.
In order to know which is which, we now carry the job name alongside its
ID and bundle this up into a JobHandle.

This commit only introduces the JobHandle and uses it instead of the
JobID.
Making `disk.dump` async will be done in another commit.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 14, 2020
This allows to act on the result of the job.
Unused for now, but we will exploit it when `disk.dump` is made async.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 15, 2020
Until now, we only had a single Salt async job per step (one to prepare,
one to finalize), thus we knew which job was running by knowing the step
we were in.

This is gonna change, as we want to call `disk.dump` in an async fashion
and thus we will have two async jobs for the prepare step.
In order to know which is which, we now carry the job name alongside its
ID and bundle this up into a JobHandle.

This commit only introduces the JobHandle and uses it instead of the
JobID.
Making `disk.dump` async will be done in another commit.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 15, 2020
This allows to act on the result of the job.
Unused for now, but we will exploit it when `disk.dump` is made async.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 15, 2020
TODO

Closes: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 25, 2020
The sync call to `disk.dump` was flaky for two reasons:
- we had a 1s timeout, but no logic to retry on timeout error => volume
  creation could fail on "slow" platform
- Salt API doesn't handle well error triggered by concurrent execution
  of Salt state when you make an sync call (an error 500 with no
  details is returned), thus we couldn't retry.

For those reasons, we now use an async call to retrieve the volume size.

Closes: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 25, 2020
The sync call to `disk.dump` was flaky for two reasons:
- we had a 1s timeout, but no logic to retry on timeout error => volume
  creation could fail on "slow" platform
- Salt API doesn't handle well error triggered by concurrent execution
  of Salt state when you make an sync call (an error 500 with no
  details is returned), thus we couldn't retry.

For those reasons, we now use an async call to retrieve the volume size.

Closes: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 25, 2020
Until now, we only had a single Salt async job per step (one to prepare,
one to finalize), thus we knew which job was running by knowing the step
we were in.

This is gonna change, as we want to call `disk.dump` in an async fashion
and thus we will have two async jobs for the prepare step.
In order to know which is which, we now carry the job name alongside its
ID and bundle this up into a JobHandle.

This commit only introduces the JobHandle and uses it instead of the
JobID.
Making `disk.dump` async will be done in another commit.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 25, 2020
This allows to act on the result of the job.
Unused for now, but we will exploit it when `disk.dump` is made async.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 25, 2020
The sync call to `disk.dump` was flaky for two reasons:
- we had a 1s timeout, but no logic to retry on timeout error => volume
  creation could fail on "slow" platform
- Salt API doesn't handle well error triggered by concurrent execution
  of Salt state when you make an sync call (an error 500 with no
  details is returned), thus we couldn't retry.

For those reasons, we now use an async call to retrieve the volume size.

Closes: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 25, 2020
The sync call to `disk.dump` was flaky for two reasons:
- we had a 1s timeout, but no logic to retry on timeout error => volume
  creation could fail on "slow" platform
- Salt API doesn't handle well error triggered by concurrent execution
  of Salt state when you make an sync call (an error 500 with no
  details is returned), thus we couldn't retry.

For those reasons, we now use an async call to retrieve the volume size.

Closes: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 25, 2020
The Salt client code already handles the retcode, the caller is only
interested in the returned payload.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
slaperche-scality added a commit that referenced this issue May 26, 2020
The sync call to `disk.dump` was flaky for two reasons:
- we had a 1s timeout, but no logic to retry on timeout error => volume
  creation could fail on "slow" platform
- Salt API doesn't handle well error triggered by concurrent execution
  of Salt state when you make an sync call (an error 500 with no
  details is returned), thus we couldn't retry.

For those reasons, we now use an async call to retrieve the volume size.

Closes: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
@bert-e bert-e closed this as completed in 09e77ea May 27, 2020
slaperche-scality added a commit that referenced this issue Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity:medium Something that requires one or few days to fix kind:bug Something isn't working topic:storage Issues related to storage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants