SSH locks up after few hours #909

vaijab · 2015-10-07T16:27:05Z

I found that sshd locks up after few hours a machine has been up on EC2. I noticed this started to happen quite a few versions back.

$ cat /etc/os-release 
NAME=CoreOS
ID=coreos
VERSION=815.0.0
VERSION_ID=815.0.0
BUILD_ID=
PRETTY_NAME="CoreOS 815.0.0"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

If I have an ssh connection up it stays up, but when I try to establish a new one, ssh just hangs. I see that at TCP level, a new connection gets established, but seems that there is nothing to hand over a new ssh connection to.

I believe that systemd accepts a new connection and spawns a new sshd for that, which seems to be failing.

$ ssh -l core ec2-52-30-1-206.eu-west-1.compute.amazonaws.com -vvv
OpenSSH_7.1p1, OpenSSL 1.0.2d-fips 9 Jul 2015
debug1: Reading configuration data /home/vaijab/.ssh/config
debug1: /home/vaijab/.ssh/config line 1: Applying options for *
debug1: /home/vaijab/.ssh/config line 20: Applying options for *.amazonaws.com
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 56: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to ec2-52-30-1-206.eu-west-1.compute.amazonaws.com [52.30.1.206] port 22.
debug1: Connection established.
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.1

... and it just hangs forever.

Same happens from inside of the same instance:

$ ssh 127.0.0.1 -vvv
OpenSSH_6.9p1, OpenSSL 1.0.1p 9 Jul 2015
debug1: Reading configuration data /etc/ssh/ssh_config
debug2: ssh_connect: needpriv 0
debug1: Connecting to 127.0.0.1 [127.0.0.1] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/core/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_6.9

Just to add that rebooting an instance temporarily fixes the issue until a few hours after.

The text was updated successfully, but these errors were encountered:

vaijab · 2015-10-07T16:27:44Z

/cc @danharvey - I believe you experienced something similar?

danharvey · 2015-10-07T16:46:48Z

Yes I've noticed something similar too on EC2 instances. We are running 766.4.0 now but also noticed this in 766.3.0. We have some instances that after a few hours (sometimes less) would no longer allow new SSH connections. All other TCP connects still work fine and EC2 reports the instance as healthy.

I've never had a connection open to one that's failed to work so can't provide more details sorry.

We also had an issue that caused load on our 4-core instances to raise to around 6-8 due to overly keen health checks which we thought might have been triggering it. Now we've fixed that we no longer have issues with SSH and we've not changed anything else so it might be related. It's been fine for around two days now.

What's the load on the instances you are seeing issues with?

vaijab · 2015-10-07T21:52:14Z

What's the load on the instances you are seeing issues with?

There is almost zero load.

vaijab · 2015-10-08T14:06:29Z

I kept an existing ssh connection open so that I can investigate more. Waited a few hours and now when I try to establish a new connection, I see that my client ssh makes a connection and then hangs waiting for something.

On the server side I see below in the logs:

Oct 08 13:48:59 ip-10-50-2-5.eu-west-1.compute.internal systemd[1]: Started OpenSSH per-connection server daemon (81.101.6.153:51228).

ps output

$ ps aux | grep sshd
root      4344  0.0  0.0 185272  1948 ?        Ss   13:48   0:00 (sshd)                                                            
root      4477  0.0  0.0  26432  4252 ?        Ss   11:21   0:00 sshd: core [priv]   
core      4479  0.0  0.0  26432  2696 ?        S    11:21   0:00 sshd: core@pts/0    
core     13474  0.0  0.0   4404   776 pts/0    S+   13:54   0:00 grep --colour=auto sshd
root     18774  0.0  0.0 185272  1944 ?        Ss   13:38   0:00 (sshd)                                                            
root     18920  0.0  0.0 185272  1944 ?        Ss   13:38   0:00 (sshd)                                                            
root     20677  0.0  0.0 185272  1944 ?        Ss   13:39   0:00 (sshd)                                                            
root     21410  0.0  0.0 185272  1944 ?        Ss   13:40   0:00 (sshd)

Every time I try to connect via ssh now, a new (sshd) gets created and it never dies.

vaijab · 2015-10-08T14:09:18Z

AHA! strace to the rescue. It's great that strace is included in CoreOS nowadays!

Did a strace on one of (sshd) PIDs:

$ sudo strace -p 32480
Process 32480 attached
connect(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/journal/stdout"}, 29^CProcess 32480 detached
 <detached ...>

As you can see, sshd is waiting on journal socket which never becomes available. Just did a sudo systemctl restart systemd-journald.service and voila! SSH works again.

Now the question is why journal gets locked up?

danharvey · 2015-10-08T14:31:07Z

Interestingly it was also journald that was eating up the CPU before (and increasing the load) way more than it should be. It was around ~200 lines/s which journald should easily be able to deal with. Again we saw that on 766.4.0.

pmoust · 2015-10-09T23:06:19Z

We experience the same issue across 6xx to 766.4.0.

On hosts with very 'noisy' systemd services (streaming lots of output to journal) spawning an ssh connection is a no go.

In most cases I was able to verify that journal checks were failing via journalctl --verify.
Work around for me was to rm the FAILing parts and systemctl restart systemd-journald.

The behaviour is reproducible, create a a Unit that echoes noise and wait for your journald to fail you.

After fiddling with my Units and piping stderr/stdout to /dev/null the issue did not occur again.

Well this is a workaround. kube proxy spits way too many messages for journal to handle (it's possible that journal has bugs) which makes journal to lock up. See coreos/bugs#909 I believe that kube proxy logging issues have been resolved in the next release.

vcaputo · 2015-10-26T07:55:46Z

While it's concerning to see journald malfunctioning interfering with sshd, the journald issue is probably related to #322, which is now fixed in alpha.

antoineco · 2015-11-06T08:43:34Z

I also experience that on the current beta, 835.2.0.

@vaijab how did you take control of your EC2 instance when this happened to you? You mentioned debugging "from inside".

vaijab · 2015-11-06T08:59:41Z

@antoineco existing ssh sessions do not get closed when journal starts to fail.

vcaputo · 2015-11-10T22:39:23Z

This issue I filed the other day may be related #966

Upon review of this issue, I'm pretty certain it's a product of a systemd bug and the socket activation of sshd. We're going to be switching sshd to a persistent non-socket-activated service to make it more available in the face of a malfunctioning systemd (and hopefully so we can all better identify what systemd is doing in these situations)

mischief · 2015-12-07T22:59:57Z

can anyone reproduce this on the latest alpha (884.0.0)? my ec2 instance has been up for ~2.5hrs, using a dummy load of:

for i in `seq 1 100`; do docker run -d --name nginx-$i nginx; done
while sleep 1; do docker ps --format "{{.Names}}" | xargs docker inspect -f "{{.NetworkSettings.IPAddress}}" | xargs curl -s | logger; done

and:

while ssh [email protected] systemctl status --no-pager; do sleep 5; done; xmessage ssh broke

it hasn't failed yet.

vaijab · 2016-01-28T10:40:57Z

Has been resolved in recent coreos versions.

tsuna · 2016-02-11T05:23:16Z

Experiencing this with CoreOS 835.11.0, any idea when the fix will make it to the stable channel?

catalin-ilea · 2016-04-04T15:30:25Z

We're experiencing this with CoreOS 899.13.0. Any updates on this?

davidquarles · 2016-04-13T21:51:25Z

We're experiencing this with 899.15.0. Any updates on this? We're stuck rebooting production servers in the meantime.

crawford · 2016-04-13T21:55:13Z

Can you provide logs from the previous boot after you reboot the machine? journalctl --unit=sshd.socket --boot=-1 and journalctl --identifier=sshd --boot=-1

davidquarles · 2016-04-13T22:02:21Z

Sure thing.
journalctl -u sshd.socket: https://gobin.io/JuwC
journalctl --identifier=sshd: https://gobin.io/GYfk

vaijab · 2016-06-08T16:29:40Z

I have not noticed this issue for a long time in recent versions of CoreOS. Maybe we should close this issue?

crawford · 2016-06-17T23:56:44Z

OK. Let's reopen this if it happens again.

tstivers1990 · 2019-11-08T14:47:38Z

I'm experiencing this issue on a DigitalOcean droplet running latest. I haven't investigated much so I don't know if it's the same issue or not. But after a few days, I can no longer connect to the CoreOS instance via SSH.

vaijab mentioned this issue Oct 12, 2015

units: fix kube proxy logging UKHomeOffice/application-container-platform#94

Merged

crawford added component/systemd kind/regression team/os labels Oct 13, 2015

crawford added the priority/P0 label Oct 22, 2015

vcaputo mentioned this issue Nov 10, 2015

coreos-init: cease enabling socket-activated sshd coreos/coreos-overlay#1630

Closed

crawford modified the milestones: CoreOS 877.0.0, CoreOS 884.0.0 Nov 18, 2015

crawford modified the milestones: CoreOS 884.0.0, CoreOS 891.0.0 Dec 2, 2015

mischief self-assigned this Dec 7, 2015

mischief removed their assignment Dec 9, 2015

crawford removed this from the CoreOS 891.0.0 milestone Dec 9, 2015

vaijab closed this as completed Jan 28, 2016

crawford reopened this Apr 13, 2016

crawford added this to the CoreOS 1038.0.0 milestone Apr 21, 2016

crawford modified the milestones: CoreOS 1038.0.0, CoreOS 1052.0.0 May 6, 2016

crawford modified the milestones: CoreOS 1053.0.0, CoreOS 1066.0.0 May 19, 2016

crawford removed this from the CoreOS 1066.0.0 milestone Jun 2, 2016

crawford closed this as completed Jun 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSH locks up after few hours #909

SSH locks up after few hours #909

vaijab commented Oct 7, 2015

vaijab commented Oct 7, 2015

danharvey commented Oct 7, 2015

vaijab commented Oct 7, 2015

vaijab commented Oct 8, 2015

vaijab commented Oct 8, 2015

danharvey commented Oct 8, 2015

pmoust commented Oct 9, 2015

vcaputo commented Oct 26, 2015

antoineco commented Nov 6, 2015

vaijab commented Nov 6, 2015

vcaputo commented Nov 10, 2015

mischief commented Dec 7, 2015

vaijab commented Jan 28, 2016

tsuna commented Feb 11, 2016

catalin-ilea commented Apr 4, 2016

davidquarles commented Apr 13, 2016

crawford commented Apr 13, 2016

davidquarles commented Apr 13, 2016

vaijab commented Jun 8, 2016

crawford commented Jun 17, 2016

tstivers1990 commented Nov 8, 2019

SSH locks up after few hours #909

SSH locks up after few hours #909

Comments

vaijab commented Oct 7, 2015

vaijab commented Oct 7, 2015

danharvey commented Oct 7, 2015

vaijab commented Oct 7, 2015

vaijab commented Oct 8, 2015

vaijab commented Oct 8, 2015

danharvey commented Oct 8, 2015

pmoust commented Oct 9, 2015

vcaputo commented Oct 26, 2015

antoineco commented Nov 6, 2015

vaijab commented Nov 6, 2015

vcaputo commented Nov 10, 2015

mischief commented Dec 7, 2015

vaijab commented Jan 28, 2016

tsuna commented Feb 11, 2016

catalin-ilea commented Apr 4, 2016

davidquarles commented Apr 13, 2016

crawford commented Apr 13, 2016

davidquarles commented Apr 13, 2016

vaijab commented Jun 8, 2016

crawford commented Jun 17, 2016

tstivers1990 commented Nov 8, 2019