Handle error on disk.dump Salt call #2493

ftalbart · 2020-04-27T14:29:02Z

Component:

operator

What happened:

Sometimes, Volume creation fails with:

Message:               call to GetVolumeSize failed: disk.dump failed (target=worker-3, path=/dev/disk/by-uuid/567c960e-dcba-4cef-b5b2-4d8bb0a3446e): Salt API failed with code 500: {"status": 500, "return": "An unexpected error occurred"}

If we enable debug log on Salt API side, we get:

[DEBUG   ] Trying to connect to: tcp://192.168.1.40:4506
[DEBUG   ] Error while processing request for: /
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/netapi/rest_cherrypy/app.py", line 858, in hypermedia_handler
    ret = cherrypy.serving.request._hypermedia_inner_handler(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py", line 60, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/salt/netapi/rest_cherrypy/app.py", line 1300, in POST
    token=cherrypy.session.get('token')))
  File "/usr/lib/python2.7/site-packages/salt/netapi/rest_cherrypy/app.py", line 1192, in exec_lowstate
    ret = self.api.run(chunk)
  File "/usr/lib/python2.7/site-packages/salt/netapi/__init__.py", line 75, in run
    return l_fun(*f_call.get('args', ()), **f_call.get('kwargs', {}))
  File "/usr/lib/python2.7/site-packages/salt/netapi/__init__.py", line 103, in local
    return local.cmd(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/salt/client/__init__.py", line 754, in cmd
    **kwargs)
  File "/usr/lib/python2.7/site-packages/salt/client/__init__.py", line 358, in run_job
    raise SaltClientError(general_exception)
SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
[INFO    ] 10.233.132.74 - - [27/Apr/2020:15:56:05] "POST /?timeout=1 HTTP/1.1" 500 57 "" "Go-http-client/1.1"

The problem may be related to a busy Salt master and/or more than one state trying to run at once.

What was expected:

The Operator should retry this operation and eventually the Volume should be created.

Steps to reproduce

@gdemonet managed to reproduce it by creating 10 sparseloop volumes at once on the same node

Resolution proposal:

Use async call for disk.dump (it's currently sync), that way we will benefit from the auto-retry logic that we've already implemented

The text was updated successfully, but these errors were encountered:

slaperche-scality · 2020-04-30T11:35:42Z

I've updated the original message using the information we got during the troubleshooting.

Until now, we only had a single Salt async job per step (one to prepare, one to finalize), thus we knew which job was running by knowing the step we were in. This is gonna change, as we want to call `disk.dump` in an async fashion and thus we will have two async jobs for the prepare step. In order to know which is which, we now carry the job name alongside its ID and bundle this up into a JobHandle. This commit only introduces the JobHandle and uses it instead of the JobID. Making `disk.dump` async will be done in another commit. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

This allows to act on the result of the job. Unused for now, but we will exploit it when `disk.dump` is made async. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

Until now, we only had a single Salt async job per step (one to prepare, one to finalize), thus we knew which job was running by knowing the step we were in. This is gonna change, as we want to call `disk.dump` in an async fashion and thus we will have two async jobs for the prepare step. In order to know which is which, we now carry the job name alongside its ID and bundle this up into a JobHandle. This commit only introduces the JobHandle and uses it instead of the JobID. Making `disk.dump` async will be done in another commit. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

This allows to act on the result of the job. Unused for now, but we will exploit it when `disk.dump` is made async. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

TODO Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

The sync call to `disk.dump` was flaky for two reasons: - we had a 1s timeout, but no logic to retry on timeout error => volume creation could fail on "slow" platform - Salt API doesn't handle well error triggered by concurrent execution of Salt state when you make an sync call (an error 500 with no details is returned), thus we couldn't retry. For those reasons, we now use an async call to retrieve the volume size. Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

Until now, we only had a single Salt async job per step (one to prepare, one to finalize), thus we knew which job was running by knowing the step we were in. This is gonna change, as we want to call `disk.dump` in an async fashion and thus we will have two async jobs for the prepare step. In order to know which is which, we now carry the job name alongside its ID and bundle this up into a JobHandle. This commit only introduces the JobHandle and uses it instead of the JobID. Making `disk.dump` async will be done in another commit. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

This allows to act on the result of the job. Unused for now, but we will exploit it when `disk.dump` is made async. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

The sync call to `disk.dump` was flaky for two reasons: - we had a 1s timeout, but no logic to retry on timeout error => volume creation could fail on "slow" platform - Salt API doesn't handle well error triggered by concurrent execution of Salt state when you make an sync call (an error 500 with no details is returned), thus we couldn't retry. For those reasons, we now use an async call to retrieve the volume size. Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

The Salt client code already handles the retcode, the caller is only interested in the returned payload. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

The sync call to `disk.dump` was flaky for two reasons: - we had a 1s timeout, but no logic to retry on timeout error => volume creation could fail on "slow" platform - Salt API doesn't handle well error triggered by concurrent execution of Salt state when you make an sync call (an error 500 with no details is returned), thus we couldn't retry. For those reasons, we now use an async call to retrieve the volume size. Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

slaperche-scality added complexity:medium Something that requires one or few days to fix kind:bug Something isn't working topic:storage Issues related to storage labels Apr 30, 2020

slaperche-scality self-assigned this Apr 30, 2020

slaperche-scality changed the title ~~Automatic error handler for Salt~~ Handle error on disk.dump Salt call Apr 30, 2020

slaperche-scality added a commit that referenced this issue May 15, 2020

storage-operator: make call to disk.dump async

a27dc3f

TODO Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

slaperche-scality mentioned this issue May 25, 2020

Use async call for disk.dump #2571

Merged

bert-e closed this as completed in 09e77ea May 27, 2020

slaperche-scality added a commit that referenced this issue Jun 4, 2020

docs/volume: update description with async disk.dump

49a409b

Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle error on disk.dump Salt call #2493

Handle error on disk.dump Salt call #2493

ftalbart commented Apr 27, 2020 •

edited by slaperche-scality

Loading

slaperche-scality commented Apr 30, 2020

Handle error on disk.dump Salt call #2493

Handle error on disk.dump Salt call #2493

Comments

ftalbart commented Apr 27, 2020 • edited by slaperche-scality Loading

slaperche-scality commented Apr 30, 2020

ftalbart commented Apr 27, 2020 •

edited by slaperche-scality

Loading