-
Notifications
You must be signed in to change notification settings - Fork 30
sshd.socket stops working after a while #2181
Comments
@euank , sure, I already made the switch on some nodes which are more problematic than others, thanks a lot! Published this because I spent a good couple of days trying to pinpoint the exact cause of that weird shell-like 'too many arguments' message but failed horribly. Just thought I was bringing something new but I guess my searching skills just plain suck! |
@mrrandrade I haven't seen specifically that error before from the sshd socket. Thanks for filing the issue! |
We are experiencing this as well. We can switch to sshd.service, but I'm not particularly happy about doing that without figuring out the root cause. Saying that "systemd is in a broken state" just makes me want to run away and start a life as a gardener or something. I've tried to get to the bottom of this, but I haven't had any success, partially because reproducing this issue takes a day. But what I have noticed is: We are only experiencing this on our test-cluster. Which is identical to our production clusters in most wayt (except name). We do not use etcd proxy, and have configured locksmith and kubernetes to access the etcd cluster direct. However, we have still configured etcd-proxy, just never bothered making it work (Largely because we want end-to-end encryption and certificate authentication). I suspect an ever-failing etcd-member.service isn't helping the systemd-state. I will disable it on one of the test-boxen and see. I also noticed something alarming today, as I wanted to enable debugging on sshd and needed to do a daemon-reload of systemd. "systemctl daemon-reload" triggered a "No buffer space" error, and required two attempts to complete. I suspect this speaks to a really bad state for systemd. The more I look at this, the more it is starting to seem like sshd.socket is just a symptom of a bigger problem. I will keep investigating. |
Hi, I have the exact same issue on many baremetal hosts too ( [edit] those hosts are Kubernetes nodes |
It's very likely linked to the limit bumped here: systemd/systemd@5ddda46 Reproduction steps to try (I can't right now): systemd/systemd#4068 |
Same issue on SSH socket doesn't start sshd. I have this situation on all 12 nodes of a Kubernetes cluster, deployed a month ago. Journalctl is full of
I am wondering how Kubernetes / kubelet can still run properly under this condition, but it seems to run so far. |
@johnmarcou I think it works by retrying enough. You can try to run a |
I will tried that next time, thanks for the trick. |
Issue Report
Bug
Container Linux Version
Environment
Hosted VMware.
Expected Behavior
ssh connections are fine with sshd daemon triggered by sshd.socket.
Actual Behavior
connection refused (port closed).
Reproduction Steps
Unfortunately, can't map to something special. Some of my machines the sshd.socket just stop working after a while.
Other Information
sshd.socket puts systemd on port 22 waiting for incoming connections. There is an alternative sshd.service but it's disabled.
After a while, some of my machines, for unknown reasons, just lose their sshd capabilities.
If I try to SSH to them, I can see the following message on journalctl:
These machines are Kubernetes nodes using docker and a rather old calico-node version. It is usual for them to have 70+ network interfaces, wonder if this might be messing dbus.
The text was updated successfully, but these errors were encountered: