Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul is dying from SIGPIPE under systemd with no log output #1688

Closed
lattwood opened this issue Feb 5, 2016 · 15 comments
Closed

consul is dying from SIGPIPE under systemd with no log output #1688

lattwood opened this issue Feb 5, 2016 · 15 comments

Comments

@lattwood
Copy link

lattwood commented Feb 5, 2016

We're having this happen in a multi-vm vagrant environment backed by the vmware-fusion plugin randomly. This happened on 0.5.0, and 0.6.0 on CentOS 7. Currently in the process of upgrading to 0.6.3 locally, but I don't have any way to reliably reproduce it.

I noticed there's nothing in the signal handing code for dealing with SIGPIPE.

func (c *Command) handleSignals(config *Config, retryJoin <-chan struct{}, retryJoinWan <-chan struct{}) int {

@lattwood
Copy link
Author

lattwood commented Feb 5, 2016

Update: Still happens with 0.6.3

@lattwood
Copy link
Author

lattwood commented Feb 5, 2016

Here's the output from strace when it happens. Looks like systemd is closing stdout on consul and consul isn't handling it well.

[pid  2820] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  2822] write(19, "\t\202\244Algo\0\243Buf\332\2F\6\6\221R\362&\215\233a\245\234\274!Sf\316\0"..., 597) = 597
[pid  2822] write(1, "    2016/02/05 15:25:32 [INFO] s"..., 90) = -1 EPIPE (Broken pipe)
[pid  2822] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2819, si_uid=999} ---
[pid  2822] rt_sigreturn()              = -1 EPIPE (Broken pipe)
[pid  2822] write(1, "\n", 1)           = -1 EPIPE (Broken pipe)
[pid  2822] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2819, si_uid=999} ---
[pid  2822] rt_sigreturn()              = -1 EPIPE (Broken pipe)
[pid  2822] rt_sigaction(SIGPIPE, {SIG_DFL, ~[], SA_RESTORER|SA_STACK|SA_SIGINFO, 0x4635e0},  <unfinished ...>
[pid  2825] <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=3547076704, u64=140255704127584}}}, 128, -1) = 1
[pid  2822] <... rt_sigaction resumed> NULL, 8) = 0
[pid  2820] <... select resumed> )      = 0 (Timeout)
[pid  2825] epoll_wait(5,  <unfinished ...>
[pid  2822] gettid()                    = 2822
[pid  2820] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  2822] tkill(2822, SIGPIPE)        = 0
[pid  2822] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_TKILL, si_pid=2819, si_uid=999} ---
[pid  3667] +++ killed by SIGPIPE +++
[pid  2825] +++ killed by SIGPIPE +++
[pid  2824] +++ killed by SIGPIPE +++
[pid  2823] +++ killed by SIGPIPE +++
[pid  2822] +++ killed by SIGPIPE +++
[pid  2821] +++ killed by SIGPIPE +++
[pid  2820] +++ killed by SIGPIPE +++
+++ killed by SIGPIPE +++

@lattwood
Copy link
Author

lattwood commented Feb 5, 2016

Figured out the "cause", IMO it should be worked around in consul/golang. The new docker RPMs for EL7 required an update of systemd. This makes journald restart. EL7's journald doesn't do the restart gracefully. There's a fix for that in systemd that hasn't been backported: https://bugs.freedesktop.org/show_bug.cgi?id=84923#c9

Then it was harder to track down, because consul was still letting other servers join the cluster. That's fixed by this golang/go#11845

Man, I thought today was Friday, not Monday.

(resolution, make sure yum update is run before starting consul)

@slackpad
Copy link
Contributor

slackpad commented Feb 5, 2016

Hi @lattwood if I understand what happened, our next release on Go 1.6 should fix this by causing Consul to exit immediately. Does that seem correct? Thanks!

@lattwood
Copy link
Author

lattwood commented Feb 6, 2016

According to the change https://go-review.googlesource.com/#/c/18151/ you'll need to handle SIGPIPE or die with 1.6. Should a loss of stdout (and the ability to log) result in a crash?

I'm looking into what's involved in getting the systemd patch backported, so journald doesn't close every fd on reload.

@slackpad
Copy link
Contributor

slackpad commented Feb 6, 2016

It's kind of a weird situation and letting Consul die and restart (assuming you have a process monitor) will probably heal the problem. I'd hate to silently lose logging and soldier on.

@lattwood
Copy link
Author

lattwood commented Feb 6, 2016

Unfortunately we're using expect 1

@lattwood
Copy link
Author

lattwood commented Feb 6, 2016

(Dev environment)

@justinclayton
Copy link
Contributor

We're running into this too. It should be noted that we do set Restart=on-failure, but systemd doesn't seem to be considering it failed enough to restart. Here's our output if it's useful:

# /bin/systemctl status  consul.service
● consul.service - Consul Agent
   Loaded: loaded (/etc/systemd/system/consul.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2016-02-10 19:34:19 UTC; 5min ago
 Main PID: 11979 (code=killed, signal=PIPE)
 ...

@joemiller
Copy link

@justinclayton SIGPIPE is one of the 4 signals that systemd considers as OK when Restart=on-failure is set. You could use restart=always here.

https://www.freedesktop.org/software/systemd/man/systemd.exec.html

@johnjelinek
Copy link

johnjelinek commented Apr 22, 2016

I'm getting this too on 0.6.4, is modifying the systemd service file to Restart=always the recommended solution?

tarjei added a commit to tarjei/puppet-prometheus that referenced this issue May 8, 2016
See here for a discussion on some of the issues: hashicorp/consul#1688 (comment) 

I got this problem running Centos 7 with Prom 0.18.0.
@jameshartig
Copy link
Contributor

It looks like you could also use RestartForceExitStatus=SIGPIPE to cause a restart if a SIGPIPE is received. See: https://www.freedesktop.org/software/systemd/man/systemd.service.html

@slackpad
Copy link
Contributor

I'm not much of a systemd expert - do we need any changes on the Consul side to help here?

@slackpad
Copy link
Contributor

slackpad commented May 2, 2017

Closing this as we never heard back. Please let us know if you still need help.

@slackpad slackpad closed this as completed May 2, 2017
@ss75710541
Copy link

ss75710541 commented May 23, 2017

I also encountered this problem

Consul v0.7.0

Just start the consul

For the first time to use the curl registered service can be a problem

After the restart the consul, curl registration service is normal

curl -L -X PUT -d '{"id":"mysql-master-192.168.1.242","name": "mysql-master","address": "192.168.1.242" ,"tags": ["mysql"],"port": 3306, "check": {"script": "mysql -h192.168.1.242 --port=3306 --user=root --password=123456 -e \"show databases\" |grep \"mysql\"|| exit 2", "interval": "10s","timeout": "1s"}}' 192.168.1.242:8500/v1/agent/service/register

curl: (52) Empty reply from server

Redirecting to /bin/systemctl status  consul.service
● consul.service - consul agent
   Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

May 23 16:02:46 localhost.localdomain consul[3327]: 2017/05/23 16:02:46 [INFO] consul: Adding WAN server localhost.localdomain.dataman (Addr: tcp/192.168.1.242:8300) (DC: dataman)
May 23 16:02:49 localhost.localdomain consul[3327]: ==> Failed to check for updates: Get https://checkpoint-api.hashicorp.com/v1/check/consul?arch=amd64&os=linux&signature=7904c3...tion refused
May 23 16:02:53 localhost.localdomain consul[3327]: 2017/05/23 16:02:53 [WARN] raft: Heartbeat timeout from "" reached, starting election
May 23 16:02:53 localhost.localdomain consul[3327]: 2017/05/23 16:02:53 [INFO] raft: Node at 192.168.1.242:8300 [Candidate] entering Candidate state in term 2
May 23 16:02:53 localhost.localdomain consul[3327]: 2017/05/23 16:02:53 [INFO] raft: Election won. Tally: 1
May 23 16:02:53 localhost.localdomain consul[3327]: 2017/05/23 16:02:53 [INFO] raft: Node at 192.168.1.242:8300 [Leader] entering Leader state
May 23 16:02:53 localhost.localdomain consul[3327]: 2017/05/23 16:02:53 [INFO] consul: cluster leadership acquired
May 23 16:02:53 localhost.localdomain consul[3327]: 2017/05/23 16:02:53 [INFO] consul: New leader elected: localhost.localdomain
May 23 16:02:53 localhost.localdomain consul[3327]: 2017/05/23 16:02:53 [INFO] consul: member 'localhost.localdomain' joined, marking health alive
May 23 16:02:54 localhost.localdomain consul[3327]: 2017/05/23 16:02:54 [INFO] agent: Synced service 'consul'
Hint: Some lines were ellipsized, use -l to show in full.

service consul start

curl -L -X PUT -d '{"id":"mysql-master-192.168.1.242","name": "mysql-master","address": "192.168.1.242" ,"tags": ["mysql"],"port": 3306, "check": {"script": "mysql -h192.168.1.242 --port=3306 --user=root --password=123456 -e \"show databases\" |grep \"mysql\"|| exit 2", "interval": "10s","timeout": "1s"}}' 192.168.1.242:8500/v1/agent/service/register
Redirecting to /bin/systemctl status  consul.service
● consul.service - consul agent
   Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2017-05-23 16:07:43 CST; 4min 21s ago
 Main PID: 7321 (consul)
   Memory: 6.0M
   CGroup: /system.slice/consul.service
           └─7321 /usr/local/sbin/consul agent -server -config-dir=/consul.d -bootstrap-expect 1 -data-dir=/data/consul/consul_data -ui-dir=/data/consul/web_ui -bind=192.168.1.242 -client=19...

May 23 16:11:14 srymaster1 consul[7321]: 2017/05/23 16:11:14 [WARN] agent: Check 'service:mysql-master-192.168.1.242' is now critical
May 23 16:11:15 srymaster1 consul[7321]: 2017/05/23 16:11:15 [WARN] agent: Check 'service:mysql-vip-192.168.1.242' is now critical
May 23 16:11:24 srymaster1 consul[7321]: 2017/05/23 16:11:24 [WARN] agent: Check 'service:mysql-master-192.168.1.242' is now critical
May 23 16:11:25 srymaster1 consul[7321]: 2017/05/23 16:11:25 [WARN] agent: Check 'service:mysql-vip-192.168.1.242' is now critical
May 23 16:11:34 srymaster1 consul[7321]: 2017/05/23 16:11:34 [WARN] agent: Check 'service:mysql-master-192.168.1.242' is now critical
May 23 16:11:35 srymaster1 consul[7321]: 2017/05/23 16:11:35 [WARN] agent: Check 'service:mysql-vip-192.168.1.242' is now critical
May 23 16:11:44 srymaster1 consul[7321]: 2017/05/23 16:11:44 [WARN] agent: Check 'service:mysql-master-192.168.1.242' is now critical
May 23 16:11:45 srymaster1 consul[7321]: 2017/05/23 16:11:45 [WARN] agent: Check 'service:mysql-vip-192.168.1.242' is now critical
May 23 16:11:54 srymaster1 consul[7321]: 2017/05/23 16:11:54 [WARN] agent: Check 'service:mysql-master-192.168.1.242' is now critical
May 23 16:11:55 srymaster1 consul[7321]: 2017/05/23 16:11:55 [WARN] agent: Check 'service:mysql-vip-192.168.1.242' is now critical

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants