sshd.socket stops working after a while #2181

ghost · 2017-10-04T22:58:50Z

Issue Report

Bug

Container Linux Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1465.6.0
VERSION_ID=1465.6.0
BUILD_ID=2017-08-16-0012
PRETTY_NAME="Container Linux by CoreOS 1465.6.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

...
BUG_REPORT_URL="https://issues.coreos.com"

Environment

Hosted VMware.

Expected Behavior

ssh connections are fine with sshd daemon triggered by sshd.socket.

Actual Behavior

connection refused (port closed).

Reproduction Steps

Unfortunately, can't map to something special. Some of my machines the sshd.socket just stop working after a while.

Other Information

sshd.socket puts systemd on port 22 waiting for incoming connections. There is an alternative sshd.service but it's disabled.

After a while, some of my machines, for unknown reasons, just lose their sshd capabilities.

If I try to SSH to them, I can see the following message on journalctl:

# journalctl -fu sshd.socket
...
systemd[1]: sshd.socket: Failed to queue service startup job (Maybe the service file is missing or not a template unit?): Argument list too long
systemd[1]: sshd.socket: Unit entered failed state.
systemd[1]: sshd.socket: Failed to listen on sockets: Argument list too long
systemd[1]: Failed to listen on OpenSSH Server Socket.
systemd[1]: Failed to listen on OpenSSH Server Socket.

These machines are Kubernetes nodes using docker and a rather old calico-node version. It is usual for them to have 70+ network interfaces, wonder if this might be messing dbus.

The text was updated successfully, but these errors were encountered:

euank · 2017-10-11T23:01:05Z

Would switching your machines to use a non-socket-activated sshd, per the documentation here, work for you?

We've been meaning to switch the default for a while (#966)
I think as a result of this, and some other problems we've seen, we'll re-prioritize switching away from socket activated sshd.

ghost · 2017-10-12T19:49:59Z

@euank , sure, I already made the switch on some nodes which are more problematic than others, thanks a lot!

Published this because I spent a good couple of days trying to pinpoint the exact cause of that weird shell-like 'too many arguments' message but failed horribly.

Just thought I was bringing something new but I guess my searching skills just plain suck!

euank · 2017-10-12T21:33:03Z

@mrrandrade I haven't seen specifically that error before from the sshd socket. Thanks for filing the issue!

KristianLyng · 2017-12-13T13:10:25Z

We are experiencing this as well. We can switch to sshd.service, but I'm not particularly happy about doing that without figuring out the root cause.

Saying that "systemd is in a broken state" just makes me want to run away and start a life as a gardener or something.

I've tried to get to the bottom of this, but I haven't had any success, partially because reproducing this issue takes a day. But what I have noticed is:

We are only experiencing this on our test-cluster. Which is identical to our production clusters in most wayt (except name).

We do not use etcd proxy, and have configured locksmith and kubernetes to access the etcd cluster direct. However, we have still configured etcd-proxy, just never bothered making it work (Largely because we want end-to-end encryption and certificate authentication). I suspect an ever-failing etcd-member.service isn't helping the systemd-state. I will disable it on one of the test-boxen and see.

I also noticed something alarming today, as I wanted to enable debugging on sshd and needed to do a daemon-reload of systemd. "systemctl daemon-reload" triggered a "No buffer space" error, and required two attempts to complete. I suspect this speaks to a really bad state for systemd.

The more I look at this, the more it is starting to seem like sshd.socket is just a symptom of a bigger problem. I will keep investigating.

mcluseau · 2018-01-04T05:38:26Z

Hi, I have the exact same issue on many baremetal hosts too (sshd.socket can't be started, systemctl daemon-reload throwing No buffer space available)

[edit] those hosts are Kubernetes nodes

mcluseau · 2018-01-04T05:48:11Z

It's very likely linked to the limit bumped here: systemd/systemd@5ddda46

Reproduction steps to try (I can't right now): systemd/systemd#4068

johnmarcou · 2018-07-17T00:58:09Z

Same issue on Container Linux by CoreOS stable (1745.4.0), same symptoms.

SSH socket doesn't start sshd.
On local console, systemctl gives Failed to list units: No buffer space available. Disk usage and memory usage are both less than 10%.

I have this situation on all 12 nodes of a Kubernetes cluster, deployed a month ago.

Journalctl is full of Argumment list too long:

Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Argument list too long
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Argument list too long
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Argument list too long
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Argument list too long
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Argument list too long
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Argument list too long
Jul 17 01:04:51 pm-09 systemd-logind[1645]: Failed to start user slice user-0.slice, ignoring: Argument list too long (System.Error.E2BIG)
Jul 17 01:04:51 pm-09 systemd-logind[1645]: Failed to start user service, ignoring: Argument list too long
Jul 17 01:04:51 pm-09 systemd-logind[1645]: Failed to start session scope session-12.scope: Argument list too long
Jul 17 01:04:51 pm-09 su[54787]: pam_systemd(su:session): Failed to create session: Argument list too long
Jul 17 01:04:51 pm-09 systemd-logind[1645]: Failed to stop user service: Argument list too long
Jul 17 01:04:51 pm-09 systemd-logind[1645]: Failed to stop user slice: Argument list too long
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Invalid argument
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Invalid argument
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Invalid argument
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Argument list too long
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Argument list too long
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Argument list too long
Jul 17 01:04:51 pm-09 systemd[1]: Failed to set up mount unit: Argument list too long

I am wondering how Kubernetes / kubelet can still run properly under this condition, but it seems to run so far.

mcluseau · 2018-07-17T17:38:22Z

@johnmarcou I think it works by retrying enough. You can try to run a while sleep 1; do systemctl daemon-reload; done it should get a slot in the wait queue at some point.

johnmarcou · 2018-07-18T06:19:16Z

I will tried that next time, thanks for the trick.

ghost changed the title ~~sshd.socket stops working after~~ sshd.socket stops working after a while Oct 5, 2017

euank added area/stability component/systemd kind/bug team/os labels Oct 11, 2017

bgilbert added the maintenance label May 22, 2018

bgilbert added jira Makes a copy of an issue onto a Jira card. and removed maintenance labels Jun 14, 2018

cu12 mentioned this issue Oct 9, 2018

Improve the backup schedule cron job so that only one backup runs at a time. bitpoke/mysql-operator#85

Closed

bgilbert removed the jira Makes a copy of an issue onto a Jira card. label Mar 19, 2019

xmudrii mentioned this issue Sep 19, 2019

Failed to get SSH session: EOF flake on CoreOS clusters kubermatic/kubeone#669

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sshd.socket stops working after a while #2181

sshd.socket stops working after a while #2181

ghost commented Oct 4, 2017

euank commented Oct 11, 2017

ghost commented Oct 12, 2017

euank commented Oct 12, 2017

KristianLyng commented Dec 13, 2017

mcluseau commented Jan 4, 2018 •

edited

Loading

mcluseau commented Jan 4, 2018 •

edited

Loading

johnmarcou commented Jul 17, 2018 •

edited

Loading

mcluseau commented Jul 17, 2018

johnmarcou commented Jul 18, 2018

sshd.socket stops working after a while #2181

sshd.socket stops working after a while #2181

Comments

ghost commented Oct 4, 2017

Issue Report

Bug

Container Linux Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

euank commented Oct 11, 2017

ghost commented Oct 12, 2017

euank commented Oct 12, 2017

KristianLyng commented Dec 13, 2017

mcluseau commented Jan 4, 2018 • edited Loading

mcluseau commented Jan 4, 2018 • edited Loading

johnmarcou commented Jul 17, 2018 • edited Loading

mcluseau commented Jul 17, 2018

johnmarcou commented Jul 18, 2018

mcluseau commented Jan 4, 2018 •

edited

Loading

mcluseau commented Jan 4, 2018 •

edited

Loading

johnmarcou commented Jul 17, 2018 •

edited

Loading