[0.10.0] Nodes don't re-join cluster after a cluster-wide service restart #5464

rossmcdonald · 2016-01-27T20:50:57Z

Version:

version 0.10.0.n1453795235, branch master, commit 9d3c9329a69c1dbd2ed7058acd8c2711c781d571

When issuing a service start followed shortly (< 5 seconds) by a service restart on the cluster seed node, the meta service seems to have issues closing properly. Log output:

[run] 2016/01/27 20:33:33 Signal received, initializing clean shutdown...
[run] 2016/01/27 20:33:33 Waiting for clean shutdown...
[cluster] 2016/01/27 20:33:33 cluster service accept error: network connection closed
[snapshot] 2016/01/27 20:33:33 snapshot listener closed
[copier] 2016/01/27 20:33:33 copier listener closed
[shard-precreation] 2016/01/27 20:33:33 Precreation service terminating
[continuous_querier] 2016/01/27 20:33:33 continuous query service terminating
[retention] 2016/01/27 20:33:33 retention policy enforcement terminating
[monitor] 2016/01/27 20:33:33 shutting down monitor system
[monitor] 2016/01/27 20:33:33 terminating storage of statistics
[handoff] 2016/01/27 20:33:33 shutting down hh service
[subscriber] 2016/01/27 20:33:33 closed service
[meta] 2016/01/27 20:33:33 server closed
[meta] 2016/01/27 20:33:33 172.28.128.49 - - [27/Jan/2016:20:33:31 +0000] GET /?index=9 HTTP/1.1 500 25 - Go 1.1 package http 3b01e799-c535-11e5-801b-000000000000 1.605537495s
[meta] 2016/01/27 20:33:33 server closed
[meta] 2016/01/27 20:33:33 127.0.0.1 - - [27/Jan/2016:20:33:31 +0000] GET /?index=9 HTTP/1.1 500 25 - Go 1.1 package http 3b01f350-c535-11e5-801c-000000000000 1.605878792s
[meta] 2016/01/27 20:33:33 server closed
[meta] 2016/01/27 20:33:33 172.28.128.48 - - [27/Jan/2016:20:33:31 +0000] GET /?index=9 HTTP/1.1 500 25 - Go 1.1 package http 3b01f462-c535-11e5-801d-000000000000 1.607085805s
[metaclient] 2016/01/27 20:33:33 failure getting snapshot from influx1:8091: meta server returned non-200: 500 Internal Server Error
[run] 2016/01/27 20:33:33 server shutdown completed
2016/01/27 20:33:34 InfluxDB starting, version 0.10.0.n1453795235, branch master, commit 9d3c9329a69c1dbd2ed7058acd8c2711c781d571, built 2016-01-26T08:01:53.793059

Where the meta client is still receiving requests after it has supposedly stopped (returning 500's). Once this issue occurs, the service fails to start properly again afterwards, stopping at:

2016/01/27 20:33:34 InfluxDB starting, version 0.10.0.n1453795235, branch master, commit 9d3c9329a69c1dbd2ed7058acd8c2711c781d571, built 2016-01-26T08:01:53.793059
2016/01/27 20:33:34 Go version go1.4.3, GOMAXPROCS set to 1
2016/01/27 20:33:34 Using configuration at: /etc/influxdb/influxdb.conf
[meta] 2016/01/27 20:33:34 Starting meta service
[meta] 2016/01/27 20:33:34 Listening on HTTP: [::]:8091
[metastore] 2016/01/27 20:33:34 Using data dir: /var/lib/influxdb/meta
[metastore] 2016/01/27 20:33:34 Node at influx1:8088 [Follower]

And never binding to port 8086.

The text was updated successfully, but these errors were encountered:

rossmcdonald · 2016-01-27T20:52:18Z

/cc @dgnorton @corylanou

rossmcdonald · 2016-02-02T16:46:23Z

/cc @e-dard

More information on this issue: I believe this actually has more to do with the a service restart across every member of the cluster. If I run a service restart against a single node in the cluster, the node has no issues re-joining the cluster. If I restart all nodes in the cluster at once, I can't get the cluster to reform again.

On my three node cluster, on nodes 2 and 3 (the non-seed nodes) the logs stop at:

[run] 2016/02/02 16:32:35 server shutdown completed
2016/02/02 16:40:45 InfluxDB starting, version 0.10.0.n1453795235, branch master, commit 9d3c9329a69c1dbd2ed7058acd8c2711c781d571, built 2016-01-26T08:01:53.793059
2016/02/02 16:40:45 Go version go1.4.3, GOMAXPROCS set to 1
2016/02/02 16:40:45 Using configuration at: /etc/influxdb/influxdb.conf
[meta] 2016/02/02 16:40:45 Starting meta service
[meta] 2016/02/02 16:40:45 Listening on HTTP: [::]:8091
[metastore] 2016/02/02 16:40:45 Using data dir: /var/lib/influxdb/meta

On the seed node, the logs stops at:

2016/02/02 16:33:10 InfluxDB starting, version 0.10.0.n1453795235, branch master, commit 9d3c9329a69c1dbd2ed7058acd8c2711c781d571, built 2016-01-26T08:01:53.793059
2016/02/02 16:33:10 Go version go1.4.3, GOMAXPROCS set to 1
2016/02/02 16:33:10 Using configuration at: /etc/influxdb/influxdb.conf
[meta] 2016/02/02 16:33:10 Starting meta service
[meta] 2016/02/02 16:33:10 Listening on HTTP: [::]:8091
[metastore] 2016/02/02 16:33:10 Using data dir: /var/lib/influxdb/meta
[metastore] 2016/02/02 16:33:10 Node at influx1:8088 [Follower]

And the service on all three servers just sits idle, never continuing on with the rest of the startup process.

Prior to the service restart on the non-seed nodes, these messages start appearing:

[cluster] 2016/02/02 16:27:12 accept remote connection from 172.28.128.92:34294
[metaclient] 2016/02/02 16:27:43 failure getting snapshot from 172.28.128.92:8091: meta server returned non-200: 500 Internal Server Error
[cluster] 2016/02/02 16:27:43 close remote connection from 172.28.128.92:34294
[write] 2016/02/02 16:27:49 write failed for shard 3 on node 2: read message type: EOF
[monitor] 2016/02/02 16:27:49 failed to store statistics: write failed: read message type: EOF
[meta] 2016/02/02 16:27:53 172.28.128.93 - - [02/Feb/2016:16:27:44 +0000] GET /?index=39 HTTP/1.1 200 204 - Go 1.1 package http e3564126-c9c9-11e5-800c-000000000000 9.485829756s
[cluster] 2016/02/02 16:27:54 accept remote connection from 172.28.128.92:34300
[metaclient] 2016/02/02 16:27:54 failure getting snapshot from 172.28.128.92:8091: meta server returned non-200: 500 Internal Server Error
[cluster] 2016/02/02 16:27:54 close remote connection from 172.28.128.92:34300
[write] 2016/02/02 16:27:59 write failed for shard 3 on node 2: read message type: EOF
[monitor] 2016/02/02 16:27:59 failed to store statistics: write failed: read message type: EOF
[meta] 2016/02/02 16:28:08 172.28.128.93 - - [02/Feb/2016:16:27:55 +0000] GET /?index=41 HTTP/1.1 200 204 - Go 1.1 package http ea2be387-c9c9-11e5-800d-000000000000 12.594775415s
[cluster] 2016/02/02 16:28:08 accept remote connection from 172.28.128.92:34307
[run] 2016/02/02 16:32:34 Signal received, initializing clean shutdown...
[run] 2016/02/02 16:32:34 Waiting for clean shutdown...
[copier] 2016/02/02 16:32:34 copier listener closed
[cluster] 2016/02/02 16:32:34 cluster service accept error: network connection closed
[snapshot] 2016/02/02 16:32:34 snapshot listener closed
[cluster] 2016/02/02 16:32:34 unable to read type-length-value read message type: read tcp 172.28.128.93:49875: use of closed network connection
[cluster] 2016/02/02 16:32:34 close remote connection from 172.28.128.93:49875
[cluster] 2016/02/02 16:32:34 unable to read type-length-value read message type: read tcp 172.28.128.92:34307: use of closed network connection
[cluster] 2016/02/02 16:32:34 close remote connection from 172.28.128.92:34307
[shard-precreation] 2016/02/02 16:32:34 Precreation service terminating
[continuous_querier] 2016/02/02 16:32:34 continuous query service terminating
[retention] 2016/02/02 16:32:34 retention policy enforcement terminating
[monitor] 2016/02/02 16:32:34 shutting down monitor system
[monitor] 2016/02/02 16:32:34 terminating storage of statistics
[handoff] 2016/02/02 16:32:34 shutting down hh service
[metaclient] 2016/02/02 16:32:34 failure getting snapshot from 172.28.128.92:8091: meta server returned non-200: 500 Internal Server Error
[subscriber] 2016/02/02 16:32:35 closed service
[run] 2016/02/02 16:32:35 server shutdown completed

rossmcdonald · 2016-02-05T13:49:46Z

@grvr If you restart the nodes individually are they still unable to cluster properly? If so, can you paste the contents of the logs since the last restart?

corylanou · 2016-02-11T14:50:37Z

This should be fixed by this PR #5602

rossmcdonald · 2016-03-14T18:45:17Z

Closing, as this was fixed by #5602

rossmcdonald added the category/clustering label Jan 27, 2016

rossmcdonald changed the title ~~[0.10.0] Meta service doesn't shut down correctly~~ [0.10.0] Nodes don't re-join cluster after a cluster-wide service restart Feb 2, 2016

jwilder assigned corylanou Feb 2, 2016

jwilder modified the milestone: 0.11.0 Feb 3, 2016

rossmcdonald closed this as completed Mar 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.10.0] Nodes don't re-join cluster after a cluster-wide service restart #5464

[0.10.0] Nodes don't re-join cluster after a cluster-wide service restart #5464

rossmcdonald commented Jan 27, 2016

rossmcdonald commented Jan 27, 2016

rossmcdonald commented Feb 2, 2016

rossmcdonald commented Feb 5, 2016

corylanou commented Feb 11, 2016

rossmcdonald commented Mar 14, 2016

[0.10.0] Nodes don't re-join cluster after a cluster-wide service restart #5464

[0.10.0] Nodes don't re-join cluster after a cluster-wide service restart #5464

Comments

rossmcdonald commented Jan 27, 2016

rossmcdonald commented Jan 27, 2016

rossmcdonald commented Feb 2, 2016

rossmcdonald commented Feb 5, 2016

corylanou commented Feb 11, 2016

rossmcdonald commented Mar 14, 2016