Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

SSH locks up after few hours #909

Closed
vaijab opened this issue Oct 7, 2015 · 21 comments
Closed

SSH locks up after few hours #909

vaijab opened this issue Oct 7, 2015 · 21 comments

Comments

@vaijab
Copy link

vaijab commented Oct 7, 2015

I found that sshd locks up after few hours a machine has been up on EC2. I noticed this started to happen quite a few versions back.

$ cat /etc/os-release 
NAME=CoreOS
ID=coreos
VERSION=815.0.0
VERSION_ID=815.0.0
BUILD_ID=
PRETTY_NAME="CoreOS 815.0.0"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

If I have an ssh connection up it stays up, but when I try to establish a new one, ssh just hangs. I see that at TCP level, a new connection gets established, but seems that there is nothing to hand over a new ssh connection to.

I believe that systemd accepts a new connection and spawns a new sshd for that, which seems to be failing.

$ ssh -l core ec2-52-30-1-206.eu-west-1.compute.amazonaws.com -vvv
OpenSSH_7.1p1, OpenSSL 1.0.2d-fips 9 Jul 2015
debug1: Reading configuration data /home/vaijab/.ssh/config
debug1: /home/vaijab/.ssh/config line 1: Applying options for *
debug1: /home/vaijab/.ssh/config line 20: Applying options for *.amazonaws.com
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 56: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to ec2-52-30-1-206.eu-west-1.compute.amazonaws.com [52.30.1.206] port 22.
debug1: Connection established.
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.1

... and it just hangs forever.

Same happens from inside of the same instance:

$ ssh 127.0.0.1 -vvv
OpenSSH_6.9p1, OpenSSL 1.0.1p 9 Jul 2015
debug1: Reading configuration data /etc/ssh/ssh_config
debug2: ssh_connect: needpriv 0
debug1: Connecting to 127.0.0.1 [127.0.0.1] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_6.9

Just to add that rebooting an instance temporarily fixes the issue until a few hours after.

@vaijab
Copy link
Author

vaijab commented Oct 7, 2015

/cc @danharvey - I believe you experienced something similar?

@danharvey
Copy link

Yes I've noticed something similar too on EC2 instances. We are running 766.4.0 now but also noticed this in 766.3.0. We have some instances that after a few hours (sometimes less) would no longer allow new SSH connections. All other TCP connects still work fine and EC2 reports the instance as healthy.

I've never had a connection open to one that's failed to work so can't provide more details sorry.

We also had an issue that caused load on our 4-core instances to raise to around 6-8 due to overly keen health checks which we thought might have been triggering it. Now we've fixed that we no longer have issues with SSH and we've not changed anything else so it might be related. It's been fine for around two days now.

What's the load on the instances you are seeing issues with?

@vaijab
Copy link
Author

vaijab commented Oct 7, 2015

What's the load on the instances you are seeing issues with?

There is almost zero load.

@vaijab
Copy link
Author

vaijab commented Oct 8, 2015

I kept an existing ssh connection open so that I can investigate more. Waited a few hours and now when I try to establish a new connection, I see that my client ssh makes a connection and then hangs waiting for something.

On the server side I see below in the logs:

Oct 08 13:48:59 ip-10-50-2-5.eu-west-1.compute.internal systemd[1]: Started OpenSSH per-connection server daemon (81.101.6.153:51228).

ps output

$ ps aux | grep sshd
root      4344  0.0  0.0 185272  1948 ?        Ss   13:48   0:00 (sshd)                                                            
root      4477  0.0  0.0  26432  4252 ?        Ss   11:21   0:00 sshd: core [priv]   
core      4479  0.0  0.0  26432  2696 ?        S    11:21   0:00 sshd: core@pts/0    
core     13474  0.0  0.0   4404   776 pts/0    S+   13:54   0:00 grep --colour=auto sshd
root     18774  0.0  0.0 185272  1944 ?        Ss   13:38   0:00 (sshd)                                                            
root     18920  0.0  0.0 185272  1944 ?        Ss   13:38   0:00 (sshd)                                                            
root     20677  0.0  0.0 185272  1944 ?        Ss   13:39   0:00 (sshd)                                                            
root     21410  0.0  0.0 185272  1944 ?        Ss   13:40   0:00 (sshd)              

Every time I try to connect via ssh now, a new (sshd) gets created and it never dies.

@vaijab
Copy link
Author

vaijab commented Oct 8, 2015

AHA! strace to the rescue. It's great that strace is included in CoreOS nowadays!

Did a strace on one of (sshd) PIDs:

$ sudo strace -p 32480
Process 32480 attached
connect(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/journal/stdout"}, 29^CProcess 32480 detached
 <detached ...>

As you can see, sshd is waiting on journal socket which never becomes available. Just did a sudo systemctl restart systemd-journald.service and voila! SSH works again.

Now the question is why journal gets locked up?

@danharvey
Copy link

Interestingly it was also journald that was eating up the CPU before (and increasing the load) way more than it should be. It was around ~200 lines/s which journald should easily be able to deal with. Again we saw that on 766.4.0.

@pmoust
Copy link

pmoust commented Oct 9, 2015

We experience the same issue across 6xx to 766.4.0.

On hosts with very 'noisy' systemd services (streaming lots of output to journal) spawning an ssh connection is a no go.

In most cases I was able to verify that journal checks were failing via journalctl --verify.
Work around for me was to rm the FAILing parts and systemctl restart systemd-journald.

The behaviour is reproducible, create a a Unit that echoes noise and wait for your journald to fail you.

After fiddling with my Units and piping stderr/stdout to /dev/null the issue did not occur again.

vaijab added a commit to UKHomeOffice/application-container-platform that referenced this issue Oct 12, 2015
Well this is a workaround. kube proxy spits way too many messages for
journal to handle (it's possible that journal has bugs) which makes
journal to lock up.

See coreos/bugs#909

I believe that kube proxy logging issues have been resolved in the next
release.
@vcaputo
Copy link

vcaputo commented Oct 26, 2015

While it's concerning to see journald malfunctioning interfering with sshd, the journald issue is probably related to #322, which is now fixed in alpha.

@antoineco
Copy link

I also experience that on the current beta, 835.2.0.

@vaijab how did you take control of your EC2 instance when this happened to you? You mentioned debugging "from inside".

@vaijab
Copy link
Author

vaijab commented Nov 6, 2015

@antoineco existing ssh sessions do not get closed when journal starts to fail.

@vcaputo
Copy link

vcaputo commented Nov 10, 2015

This issue I filed the other day may be related #966

Upon review of this issue, I'm pretty certain it's a product of a systemd bug and the socket activation of sshd. We're going to be switching sshd to a persistent non-socket-activated service to make it more available in the face of a malfunctioning systemd (and hopefully so we can all better identify what systemd is doing in these situations)

@mischief
Copy link

mischief commented Dec 7, 2015

can anyone reproduce this on the latest alpha (884.0.0)? my ec2 instance has been up for ~2.5hrs, using a dummy load of:

for i in `seq 1 100`; do docker run -d --name nginx-$i nginx; done
while sleep 1; do docker ps --format "{{.Names}}" | xargs docker inspect -f "{{.NetworkSettings.IPAddress}}" | xargs curl -s | logger; done

and:

while ssh [email protected] systemctl status --no-pager; do sleep 5; done; xmessage ssh broke

it hasn't failed yet.

@mischief mischief removed their assignment Dec 9, 2015
@crawford crawford removed this from the CoreOS 891.0.0 milestone Dec 9, 2015
@crawford crawford removed this from the CoreOS 891.0.0 milestone Dec 9, 2015
@vaijab
Copy link
Author

vaijab commented Jan 28, 2016

Has been resolved in recent coreos versions.

@vaijab vaijab closed this as completed Jan 28, 2016
@tsuna
Copy link

tsuna commented Feb 11, 2016

Experiencing this with CoreOS 835.11.0, any idea when the fix will make it to the stable channel?

@catalin-ilea
Copy link

We're experiencing this with CoreOS 899.13.0. Any updates on this?

@davidquarles
Copy link

We're experiencing this with 899.15.0. Any updates on this? We're stuck rebooting production servers in the meantime.

@crawford crawford reopened this Apr 13, 2016
@crawford
Copy link
Contributor

Can you provide logs from the previous boot after you reboot the machine? journalctl --unit=sshd.socket --boot=-1 and journalctl --identifier=sshd --boot=-1

@davidquarles
Copy link

Sure thing.
journalctl -u sshd.socket: https://gobin.io/JuwC
journalctl --identifier=sshd: https://gobin.io/GYfk

@crawford crawford added this to the CoreOS 1038.0.0 milestone Apr 21, 2016
@crawford crawford modified the milestones: CoreOS 1038.0.0, CoreOS 1052.0.0 May 6, 2016
@crawford crawford removed this from the CoreOS 1066.0.0 milestone Jun 2, 2016
@vaijab
Copy link
Author

vaijab commented Jun 8, 2016

I have not noticed this issue for a long time in recent versions of CoreOS. Maybe we should close this issue?

@crawford
Copy link
Contributor

OK. Let's reopen this if it happens again.

@tstivers1990
Copy link

I'm experiencing this issue on a DigitalOcean droplet running latest. I haven't investigated much so I don't know if it's the same issue or not. But after a few days, I can no longer connect to the CoreOS instance via SSH.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.