-
Notifications
You must be signed in to change notification settings - Fork 30
SSH locks up after few hours #909
Comments
/cc @danharvey - I believe you experienced something similar? |
Yes I've noticed something similar too on EC2 instances. We are running 766.4.0 now but also noticed this in 766.3.0. We have some instances that after a few hours (sometimes less) would no longer allow new SSH connections. All other TCP connects still work fine and EC2 reports the instance as healthy. I've never had a connection open to one that's failed to work so can't provide more details sorry. We also had an issue that caused load on our 4-core instances to raise to around 6-8 due to overly keen health checks which we thought might have been triggering it. Now we've fixed that we no longer have issues with SSH and we've not changed anything else so it might be related. It's been fine for around two days now. What's the load on the instances you are seeing issues with? |
There is almost zero load. |
I kept an existing ssh connection open so that I can investigate more. Waited a few hours and now when I try to establish a new connection, I see that my client ssh makes a connection and then hangs waiting for something. On the server side I see below in the logs:
ps output $ ps aux | grep sshd
root 4344 0.0 0.0 185272 1948 ? Ss 13:48 0:00 (sshd)
root 4477 0.0 0.0 26432 4252 ? Ss 11:21 0:00 sshd: core [priv]
core 4479 0.0 0.0 26432 2696 ? S 11:21 0:00 sshd: core@pts/0
core 13474 0.0 0.0 4404 776 pts/0 S+ 13:54 0:00 grep --colour=auto sshd
root 18774 0.0 0.0 185272 1944 ? Ss 13:38 0:00 (sshd)
root 18920 0.0 0.0 185272 1944 ? Ss 13:38 0:00 (sshd)
root 20677 0.0 0.0 185272 1944 ? Ss 13:39 0:00 (sshd)
root 21410 0.0 0.0 185272 1944 ? Ss 13:40 0:00 (sshd) Every time I try to connect via ssh now, a new |
AHA! Did a strace on one of
As you can see, sshd is waiting on journal socket which never becomes available. Just did a Now the question is why journal gets locked up? |
Interestingly it was also journald that was eating up the CPU before (and increasing the load) way more than it should be. It was around ~200 lines/s which journald should easily be able to deal with. Again we saw that on 766.4.0. |
We experience the same issue across 6xx to 766.4.0. On hosts with very 'noisy' systemd services (streaming lots of output to journal) spawning an ssh connection is a no go. In most cases I was able to verify that journal checks were failing via The behaviour is reproducible, create a a Unit that echoes noise and wait for your journald to fail you. After fiddling with my Units and piping stderr/stdout to /dev/null the issue did not occur again. |
Well this is a workaround. kube proxy spits way too many messages for journal to handle (it's possible that journal has bugs) which makes journal to lock up. See coreos/bugs#909 I believe that kube proxy logging issues have been resolved in the next release.
While it's concerning to see journald malfunctioning interfering with sshd, the journald issue is probably related to #322, which is now fixed in alpha. |
I also experience that on the current beta, 835.2.0. @vaijab how did you take control of your EC2 instance when this happened to you? You mentioned debugging "from inside". |
@antoineco existing ssh sessions do not get closed when journal starts to fail. |
This issue I filed the other day may be related #966 Upon review of this issue, I'm pretty certain it's a product of a systemd bug and the socket activation of sshd. We're going to be switching sshd to a persistent non-socket-activated service to make it more available in the face of a malfunctioning systemd (and hopefully so we can all better identify what systemd is doing in these situations) |
can anyone reproduce this on the latest alpha (884.0.0)? my ec2 instance has been up for ~2.5hrs, using a dummy load of: for i in `seq 1 100`; do docker run -d --name nginx-$i nginx; done
while sleep 1; do docker ps --format "{{.Names}}" | xargs docker inspect -f "{{.NetworkSettings.IPAddress}}" | xargs curl -s | logger; done and: while ssh [email protected] systemctl status --no-pager; do sleep 5; done; xmessage ssh broke it hasn't failed yet. |
Has been resolved in recent coreos versions. |
Experiencing this with CoreOS 835.11.0, any idea when the fix will make it to the stable channel? |
We're experiencing this with CoreOS 899.13.0. Any updates on this? |
We're experiencing this with 899.15.0. Any updates on this? We're stuck rebooting production servers in the meantime. |
Can you provide logs from the previous boot after you reboot the machine? |
Sure thing. |
I have not noticed this issue for a long time in recent versions of CoreOS. Maybe we should close this issue? |
OK. Let's reopen this if it happens again. |
I'm experiencing this issue on a DigitalOcean droplet running latest. I haven't investigated much so I don't know if it's the same issue or not. But after a few days, I can no longer connect to the CoreOS instance via SSH. |
I found that sshd locks up after few hours a machine has been up on EC2. I noticed this started to happen quite a few versions back.
If I have an ssh connection up it stays up, but when I try to establish a new one, ssh just hangs. I see that at TCP level, a new connection gets established, but seems that there is nothing to hand over a new ssh connection to.
I believe that systemd accepts a new connection and spawns a new sshd for that, which seems to be failing.
... and it just hangs forever.
Same happens from inside of the same instance:
Just to add that rebooting an instance temporarily fixes the issue until a few hours after.
The text was updated successfully, but these errors were encountered: