Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.10.0] Nodes don't re-join cluster after a cluster-wide service restart #5464

Closed
rossmcdonald opened this issue Jan 27, 2016 · 5 comments
Closed
Assignees
Milestone

Comments

@rossmcdonald
Copy link
Contributor

Version:

version 0.10.0.n1453795235, branch master, commit 9d3c9329a69c1dbd2ed7058acd8c2711c781d571

When issuing a service start followed shortly (< 5 seconds) by a service restart on the cluster seed node, the meta service seems to have issues closing properly. Log output:

[run] 2016/01/27 20:33:33 Signal received, initializing clean shutdown...
[run] 2016/01/27 20:33:33 Waiting for clean shutdown...
[cluster] 2016/01/27 20:33:33 cluster service accept error: network connection closed
[snapshot] 2016/01/27 20:33:33 snapshot listener closed
[copier] 2016/01/27 20:33:33 copier listener closed
[shard-precreation] 2016/01/27 20:33:33 Precreation service terminating
[continuous_querier] 2016/01/27 20:33:33 continuous query service terminating
[retention] 2016/01/27 20:33:33 retention policy enforcement terminating
[monitor] 2016/01/27 20:33:33 shutting down monitor system
[monitor] 2016/01/27 20:33:33 terminating storage of statistics
[handoff] 2016/01/27 20:33:33 shutting down hh service
[subscriber] 2016/01/27 20:33:33 closed service
[meta] 2016/01/27 20:33:33 server closed
[meta] 2016/01/27 20:33:33 172.28.128.49 - - [27/Jan/2016:20:33:31 +0000] GET /?index=9 HTTP/1.1 500 25 - Go 1.1 package http 3b01e799-c535-11e5-801b-000000000000 1.605537495s
[meta] 2016/01/27 20:33:33 server closed
[meta] 2016/01/27 20:33:33 127.0.0.1 - - [27/Jan/2016:20:33:31 +0000] GET /?index=9 HTTP/1.1 500 25 - Go 1.1 package http 3b01f350-c535-11e5-801c-000000000000 1.605878792s
[meta] 2016/01/27 20:33:33 server closed
[meta] 2016/01/27 20:33:33 172.28.128.48 - - [27/Jan/2016:20:33:31 +0000] GET /?index=9 HTTP/1.1 500 25 - Go 1.1 package http 3b01f462-c535-11e5-801d-000000000000 1.607085805s
[metaclient] 2016/01/27 20:33:33 failure getting snapshot from influx1:8091: meta server returned non-200: 500 Internal Server Error
[run] 2016/01/27 20:33:33 server shutdown completed
2016/01/27 20:33:34 InfluxDB starting, version 0.10.0.n1453795235, branch master, commit 9d3c9329a69c1dbd2ed7058acd8c2711c781d571, built 2016-01-26T08:01:53.793059

Where the meta client is still receiving requests after it has supposedly stopped (returning 500's). Once this issue occurs, the service fails to start properly again afterwards, stopping at:

2016/01/27 20:33:34 InfluxDB starting, version 0.10.0.n1453795235, branch master, commit 9d3c9329a69c1dbd2ed7058acd8c2711c781d571, built 2016-01-26T08:01:53.793059
2016/01/27 20:33:34 Go version go1.4.3, GOMAXPROCS set to 1
2016/01/27 20:33:34 Using configuration at: /etc/influxdb/influxdb.conf
[meta] 2016/01/27 20:33:34 Starting meta service
[meta] 2016/01/27 20:33:34 Listening on HTTP: [::]:8091
[metastore] 2016/01/27 20:33:34 Using data dir: /var/lib/influxdb/meta
[metastore] 2016/01/27 20:33:34 Node at influx1:8088 [Follower]

And never binding to port 8086.

@rossmcdonald
Copy link
Contributor Author

/cc @dgnorton @corylanou

@rossmcdonald
Copy link
Contributor Author

/cc @e-dard

More information on this issue: I believe this actually has more to do with the a service restart across every member of the cluster. If I run a service restart against a single node in the cluster, the node has no issues re-joining the cluster. If I restart all nodes in the cluster at once, I can't get the cluster to reform again.

On my three node cluster, on nodes 2 and 3 (the non-seed nodes) the logs stop at:

[run] 2016/02/02 16:32:35 server shutdown completed
2016/02/02 16:40:45 InfluxDB starting, version 0.10.0.n1453795235, branch master, commit 9d3c9329a69c1dbd2ed7058acd8c2711c781d571, built 2016-01-26T08:01:53.793059
2016/02/02 16:40:45 Go version go1.4.3, GOMAXPROCS set to 1
2016/02/02 16:40:45 Using configuration at: /etc/influxdb/influxdb.conf
[meta] 2016/02/02 16:40:45 Starting meta service
[meta] 2016/02/02 16:40:45 Listening on HTTP: [::]:8091
[metastore] 2016/02/02 16:40:45 Using data dir: /var/lib/influxdb/meta

On the seed node, the logs stops at:

2016/02/02 16:33:10 InfluxDB starting, version 0.10.0.n1453795235, branch master, commit 9d3c9329a69c1dbd2ed7058acd8c2711c781d571, built 2016-01-26T08:01:53.793059
2016/02/02 16:33:10 Go version go1.4.3, GOMAXPROCS set to 1
2016/02/02 16:33:10 Using configuration at: /etc/influxdb/influxdb.conf
[meta] 2016/02/02 16:33:10 Starting meta service
[meta] 2016/02/02 16:33:10 Listening on HTTP: [::]:8091
[metastore] 2016/02/02 16:33:10 Using data dir: /var/lib/influxdb/meta
[metastore] 2016/02/02 16:33:10 Node at influx1:8088 [Follower]

And the service on all three servers just sits idle, never continuing on with the rest of the startup process.

Prior to the service restart on the non-seed nodes, these messages start appearing:

[cluster] 2016/02/02 16:27:12 accept remote connection from 172.28.128.92:34294
[metaclient] 2016/02/02 16:27:43 failure getting snapshot from 172.28.128.92:8091: meta server returned non-200: 500 Internal Server Error
[cluster] 2016/02/02 16:27:43 close remote connection from 172.28.128.92:34294
[write] 2016/02/02 16:27:49 write failed for shard 3 on node 2: read message type: EOF
[monitor] 2016/02/02 16:27:49 failed to store statistics: write failed: read message type: EOF
[meta] 2016/02/02 16:27:53 172.28.128.93 - - [02/Feb/2016:16:27:44 +0000] GET /?index=39 HTTP/1.1 200 204 - Go 1.1 package http e3564126-c9c9-11e5-800c-000000000000 9.485829756s
[cluster] 2016/02/02 16:27:54 accept remote connection from 172.28.128.92:34300
[metaclient] 2016/02/02 16:27:54 failure getting snapshot from 172.28.128.92:8091: meta server returned non-200: 500 Internal Server Error
[cluster] 2016/02/02 16:27:54 close remote connection from 172.28.128.92:34300
[write] 2016/02/02 16:27:59 write failed for shard 3 on node 2: read message type: EOF
[monitor] 2016/02/02 16:27:59 failed to store statistics: write failed: read message type: EOF
[meta] 2016/02/02 16:28:08 172.28.128.93 - - [02/Feb/2016:16:27:55 +0000] GET /?index=41 HTTP/1.1 200 204 - Go 1.1 package http ea2be387-c9c9-11e5-800d-000000000000 12.594775415s
[cluster] 2016/02/02 16:28:08 accept remote connection from 172.28.128.92:34307
[run] 2016/02/02 16:32:34 Signal received, initializing clean shutdown...
[run] 2016/02/02 16:32:34 Waiting for clean shutdown...
[copier] 2016/02/02 16:32:34 copier listener closed
[cluster] 2016/02/02 16:32:34 cluster service accept error: network connection closed
[snapshot] 2016/02/02 16:32:34 snapshot listener closed
[cluster] 2016/02/02 16:32:34 unable to read type-length-value read message type: read tcp 172.28.128.93:49875: use of closed network connection
[cluster] 2016/02/02 16:32:34 close remote connection from 172.28.128.93:49875
[cluster] 2016/02/02 16:32:34 unable to read type-length-value read message type: read tcp 172.28.128.92:34307: use of closed network connection
[cluster] 2016/02/02 16:32:34 close remote connection from 172.28.128.92:34307
[shard-precreation] 2016/02/02 16:32:34 Precreation service terminating
[continuous_querier] 2016/02/02 16:32:34 continuous query service terminating
[retention] 2016/02/02 16:32:34 retention policy enforcement terminating
[monitor] 2016/02/02 16:32:34 shutting down monitor system
[monitor] 2016/02/02 16:32:34 terminating storage of statistics
[handoff] 2016/02/02 16:32:34 shutting down hh service
[metaclient] 2016/02/02 16:32:34 failure getting snapshot from 172.28.128.92:8091: meta server returned non-200: 500 Internal Server Error
[subscriber] 2016/02/02 16:32:35 closed service
[run] 2016/02/02 16:32:35 server shutdown completed

@rossmcdonald rossmcdonald changed the title [0.10.0] Meta service doesn't shut down correctly [0.10.0] Nodes don't re-join cluster after a cluster-wide service restart Feb 2, 2016
@jwilder jwilder modified the milestone: 0.11.0 Feb 3, 2016
@rossmcdonald
Copy link
Contributor Author

@grvr If you restart the nodes individually are they still unable to cluster properly? If so, can you paste the contents of the logs since the last restart?

@corylanou
Copy link
Contributor

This should be fixed by this PR #5602

@rossmcdonald
Copy link
Contributor Author

Closing, as this was fixed by #5602

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants