Quickly completed commands sometimes stall #344

rward-ddn · 2022-04-25T15:19:39Z

I've done some work to convert from paramiko to parallel-ssh, but have hit an issue where I'm sometimes seeing very short commands stall and take several minutes to complete. In this case, we're running cat on a small fio config file. I've been able to reproduce this with a simple script, getting results like this:

Started at 2022-04-25 09:01:52.893054, ended at 2022-04-25 09:01:53.672184, total time is 0:00:00.779130
Started at 2022-04-25 09:01:54.592475, ended at 2022-04-25 09:01:55.372624, total time is 0:00:00.780149
Started at 2022-04-25 09:01:56.312288, ended at 2022-04-25 09:01:57.041410, total time is 0:00:00.729122
Started at 2022-04-25 09:01:57.896660, ended at 2022-04-25 09:04:58.563031, total time is 0:03:00.666371

I'm running this on an Ubuntu 20.04 system with the target also being an Ubuntu 20.04 system. I have not seen this issue with commands that take longer to run.

Script:

#!/usr/bin/env python3
'''
Quick script to try to reproduce stall with paralle-ssh
'''

from pssh.clients.native import SSHClient
from pssh import exceptions
from datetime import datetime

hostname = "<target host>"
cmd = "cat /tmp/red-bdev-rand-rw.fio"
stdout = ""
stderr = ""
cmd_timeout = 180.0
login = "<user>"
password = "<password>"
port_num = 22
connect_retry_count = 3
keyfile = "<keyfile>"

client = SSHClient(host=hostname, user=login, password=password, port=port_num,
            num_retries=connect_retry_count, allow_agent=False, identity_auth=False, pkey=keyfile, timeout=cmd_timeout)

start = datetime.now()
host_out = client.run_command(cmd, use_pty=True, timeout=cmd_timeout)
client.wait_finished(host_output=host_out)

try:
    for line in host_out.stdout:
        stdout += line
    for line in host_out.stderr:
        stderr += line
    retcode = host_out.exit_code
except exceptions.Timeout as err:
    # May as well pull all available output
    for line in host_out.stdout:
        stdout += line
    for line in host_out.stderr:
        stderr += line
    retcode = host_out.exit_code
    raise AssertionError(f"Command {cmd} timed out on host {hostname} after {cmd_timeout} seconds. "
                                f"Partial output: {stdout} stderr: {stderr}") from err
except Exception as err:
    raise AssertionError(f"Failed in rtfutils with error {err}") from err
finally:
    client.close_channel(channel=host_out.channel)
    done = datetime.now()

print(f"Started at {start}, ended at {done}, total time is {done - start}")

Contents of the red-bdev-rand-rw.fio file:

#red-bdev-rand-rw test
[global]
name=red-bdev-rand-rw
ioengine=${IOENGINE}
filename=${FILENAME}
size=${SIZE}
direct=1
group_reporting=1
thread=1
time_based=1
runtime=90
blocksize_range=4k:3m
rw=randrw
[file1]
iodepth=16
numjobs=1
[file2]
iodepth=16
numjobs=2
[file3]
iodepth=16
numjobs=4
[file4]
iodepth=8
numjobs=8
[file5]
iodepth=4
numjobs=16
[file6]
iodepth=4
numjobs=32

The text was updated successfully, but these errors were encountered:

pkittenis · 2022-04-26T08:32:33Z

Hi there,

Thanks for the interest.

It sounds like the client is doing connection retries after authentication failed rather than the command stalling.

Retries can be disabled with SSHClient(<..>, num_retries=1) to see what the error from the server is. Can also adjust delay between retries with retry_delay. The above script runs fine for me, ran it 100 times with no stalls:

$ for (( i=0; i<100; i++ )); do python test.py; done
Total time 0.043349 seconds
Total time 0.043384 seconds
Total time 0.043329 seconds
Total time 0.043166 seconds
Total time 0.043997 seconds
<..>
Total time 0.043954 seconds

SSH servers also have limits on number of sessions they allow, as well as number of startups they allow at one time. See MaxStartups, MaxSessions and MaxAuthTries sections of man sshd_config. Check your server logs to see why the server is not allowing a login if that is the case. PAM may also have its own limits. Check authentication logs.

rward-ddn · 2022-04-26T21:36:52Z

Hmm, it's not the connection retries, I've tried running with enable_debug_logger() and it only shows the single connection.

Connecting to <ip>
Proceeding with private key authentication
Private key is provided as str, loading from private key file path
Authentication completed successfully - setting session to non-blocking mode
Executing command 'b'cat /tmp/red-bdev-rand-rw.fio''
Closing channel
Reading from stdout buffer, timeout=180.0

The stall happens between Executing command and Closing channel. I'll poke around our environment stuff and see if anything seems relevant.

pkittenis · 2022-04-28T11:38:43Z

Thank you for the debug output.

The library does not do anything between Executing command <..> and Closing channel other than wait for the server to respond that the command has finished executing.

In your authentication/SSH server logs, you should see lines like (this will vary depending on the system)

Opening session for <username>
Session closed for <username>

PAM logs will have similar lines for the opening of a PTY and there are limits on those as well.

I expect running the same command with ssh -t <hostname> "cat /tmp/red-bdev-rand-rw.fio" behaves the same. In which case it is an environment issue.

To be closed unless an issue specific to the library can be reproduced.

rward-ddn · 2022-04-28T15:08:53Z

Closing looks good. I'm currently thinking this is an issue with the native client somehow tickling a driver issue. I've tried again with the other client and haven't been able to hit the stall, but when running with the native client I see log messages related to a driver we have some known issues with, although since it has nothing to do with networking I'm not sure how.

pkittenis · 2022-04-28T15:36:02Z

I've tried again with the other client

Do you mean pssh.clients.ssh.SSHClient or another library?

rward-ddn · 2022-04-28T22:14:25Z

With pssh.clients.ssh.SSHClient and no other changes.

pkittenis · 2022-05-03T16:30:18Z

Thanks for the feedback.

That is a difference in behaviour in the two clients, which I consider a bug. The underlying libraries are different, but the purpose of the clients is to normalise that different behaviour.

I have a good idea of the cause and it should be able to be handled in the native client as well. The two libraries handle the stdout/stderr streams differently and it looks like that can cause a race condition in the native client when they are combined, as they are when a PTY is used.

Thanks for reporting.

tatokis · 2022-05-23T09:12:22Z

I believe I am encountering the same issue, however

as they are when a PTY is used

I am not using a pty, and I read stdout and stderr separately.

It seems related to setting any kind of timeout in the SSHClient constructor. If timeout is not specified, I do not observe any kind of stalling. However, since I require a timeout (as in my use case, the target can stop responding completely) I dug into the source and I tried hacking my way into setting a timeout for my use case without triggering the issue (as unfortunately I do not have the time to try to create a proper patch that fixes the actual issue).

My "solution" is to keep specifying timeout=None in the constructor, but manually forcing the timeout here:

parallel-ssh/pssh/clients/base/single.py

Line 549 in 5cea5c1

with GTimeout(seconds=self.timeout):

For example, with GTimeout(seconds=15):.

This successfully works around the stalls for me, while still preserving the general timeout in case the ssh server stops responding.

Hope this helps in some way. If it is actually a different problem, I can open a new issue.

pkittenis · 2022-06-06T15:34:33Z

That's very helpful in trying to reproduce this, sounds like it's the same issue, thank you.

SvanT · 2022-06-20T13:00:28Z

After playing around a bit with this (really nice library btw :) ) I think this issue comes from this line:
https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/base/single.py#L699

Changing this to
poller.poll(10)
seems to make the stalls go away, and socket timeouts still seems to be triggered from my limited testing. I haven't dug deeper for the root cause but maybe it causes some kind of deadlock waiting for data that is fetched in the same greenlet if it is waiting for 60s on that line for example?

On a sidenote, if you are doing really short calls like touching a file on a local network or so, the 100ms read sleeps are very high, I get a lot higher requests per seconds throughput by setting this line
https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/reader.py#L80
to sleep(.001)
but that might be a CPU/latency tradeoff, however I don't see any high CPU usage by setting these values a bit lower (there are also two such sleeps in https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/ssh/single.py ).

pkittenis · 2022-07-31T14:46:59Z

On a sidenote, if you are doing really short calls like touching a file on a local network or so, the 100ms read sleeps are very high, I get a lot higher requests per seconds throughput by setting this line
https://github.com/ParallelSSH/parallel-ssh/blob/master/pssh/clients/reader.py#L80
to sleep(.001)

Yes, you are right, it is a CPU/latency tradeoff. I have a branch with performance enhancements I am experimenting with and the above is one of the changes it makes. There are some very promising results, but I want to do a lot more testing with real-world environments before merging those changes. Watch out for updates soon.

On this issue, thank you for the investigation, have been able to replicate.

…is used and running short lived commands. Resolves #344.

#353) * Fixed race condition with native clients when global timeout setting is used and running short lived commands. Resolves #344. * Updated changelog * Updated default log formatter set by `pssh.utils` enable logger functions.

pkittenis · 2022-07-31T17:11:07Z

2.11.1 resolves this issue. It was indeed a race condition when timeout was used in native client.

Thanks for the investigation @SvanT

pkittenis added the cannot reproduce label Apr 26, 2022

pkittenis added bug and removed cannot reproduce labels May 3, 2022

pkittenis added a commit that referenced this issue Jul 31, 2022

Fixed race condition with native clients when global timeout setting …

6e9db15

…is used and running short lived commands. Resolves #344.

pkittenis mentioned this issue Jul 31, 2022

Fixed race condition with native clients when global timeout setting … #353

Merged

pkittenis closed this as completed in #353 Jul 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quickly completed commands sometimes stall #344

Quickly completed commands sometimes stall #344

rward-ddn commented Apr 25, 2022 •

edited

Loading

pkittenis commented Apr 26, 2022 •

edited

Loading

rward-ddn commented Apr 26, 2022

pkittenis commented Apr 28, 2022 •

edited

Loading

rward-ddn commented Apr 28, 2022

pkittenis commented Apr 28, 2022

rward-ddn commented Apr 28, 2022

pkittenis commented May 3, 2022 •

edited

Loading

tatokis commented May 23, 2022

pkittenis commented Jun 6, 2022

SvanT commented Jun 20, 2022 •

edited

Loading

pkittenis commented Jul 31, 2022

pkittenis commented Jul 31, 2022

Quickly completed commands sometimes stall #344

Quickly completed commands sometimes stall #344

Comments

rward-ddn commented Apr 25, 2022 • edited Loading

pkittenis commented Apr 26, 2022 • edited Loading

rward-ddn commented Apr 26, 2022

pkittenis commented Apr 28, 2022 • edited Loading

rward-ddn commented Apr 28, 2022

pkittenis commented Apr 28, 2022

rward-ddn commented Apr 28, 2022

pkittenis commented May 3, 2022 • edited Loading

tatokis commented May 23, 2022

pkittenis commented Jun 6, 2022

SvanT commented Jun 20, 2022 • edited Loading

pkittenis commented Jul 31, 2022

pkittenis commented Jul 31, 2022

rward-ddn commented Apr 25, 2022 •

edited

Loading

pkittenis commented Apr 26, 2022 •

edited

Loading

pkittenis commented Apr 28, 2022 •

edited

Loading

pkittenis commented May 3, 2022 •

edited

Loading

SvanT commented Jun 20, 2022 •

edited

Loading