Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ES 6.0 - Duplicate Alias after rollover timeout #28720

Closed
shaharmor opened this issue Feb 18, 2018 · 9 comments
Closed

[BUG] ES 6.0 - Duplicate Alias after rollover timeout #28720

shaharmor opened this issue Feb 18, 2018 · 9 comments
Assignees
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates

Comments

@shaharmor
Copy link

shaharmor commented Feb 18, 2018

Elasticsearch version (bin/elasticsearch --version):
Version: 6.0.0, Build: 8f0685b/2017-11-10T18:41:22.859Z, JVM: 1.8.0_151

Plugins installed: [repository-azure]

JVM version (java -version):
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux xxx-001 4.11.0-1014-azure #14-Ubuntu SMP Tue Oct 17 12:10:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
We have a cron job that runs the rollover command on an alias named customers every 1 hour, at 1 minute past the start of the hour.
The job has been running for the past few months flawlessly.

We recently (1 hour ago) had a critical issue that caused Elasticsearch to become un-writeable.

The cluster architecture is as follow:
3 master nodes
3 hot data nodes
3 cold data nodes

Alias points to an index on the hot data nodes only.

What happened is that 3 seconds before the cronjob started, one of the cold data nodes suddenly rebooted (Not related to Elasticsearch).
Relevant log:

[2018-02-18T15:00:58,703][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-master-001] failed to execute on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-data-cold-003][10.0.0.52:9300][cluster:monitor/nodes/stats[n]] request_id [40478282] timed out after [15001ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:953) [elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0.jar:6.0.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

3 seconds later the cronjob started and tried to rollover the index, as seen in these logs:

[2018-02-18T15:01:01,831][INFO ][o.e.c.m.MetaDataCreateIndexService] [elasticsearch-master-001] [customers-raw-003282] creating index, cause [rollover_index], templates [customers-raw], shards [6]/[1], mappings [_default_]

30 seconds later, with no other logs in the middle, the rollover times out because of the crashed server:

[2018-02-18T15:01:34,026][WARN ][o.e.d.z.PublishClusterStateAction] [elasticsearch-master-001] timed out waiting for all nodes to process published state [474491] (timeout [30s], pending nodes: [{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}])
[2018-02-18T15:01:34,107][WARN ][o.e.c.s.MasterService    ] [elasticsearch-master-001] cluster state update task [create-index [customers-raw-003282], cause [rollover_index]] took [32.2s] above the warn threshold of 30s
[2018-02-18T15:01:56,881][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elasticsearch-master-001] failed to execute [indices:monitor/stats] on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-data-cold-003][10.0.0.52:9300][indices:monitor/stats[n]] disconnected
[2018-02-18T15:01:56,881][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elasticsearch-master-001] failed to execute [indices:monitor/stats] on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-data-cold-003][10.0.0.52:9300][indices:monitor/stats[n]] disconnected
[2018-02-18T15:01:56,899][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-master-001] failed to execute on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-data-cold-003][10.0.0.52:9300][cluster:monitor/nodes/stats[n]] disconnected
[2018-02-18T15:01:58,094][WARN ][o.e.c.NodeConnectionsService] [elasticsearch-master-001] failed to connect to node {elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elasticsearch-data-cold-003][10.0.0.52:9300] connect_timeout[30s]
	at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:284) ~[?:?]
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:591) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:495) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:334) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:321) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:183) [elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.0.0.jar:6.0.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.0.0.52/10.0.0.52:9300
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
	... 1 more
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
	... 1 more

Following that are some errors about timeouts and the removal of the crashed server from the pool:

[2018-02-18T15:01:59,964][INFO ][o.e.c.s.MasterService    ] [elasticsearch-master-001] zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(transport disconnected)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} transport disconnected], zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold},}
[2018-02-18T15:02:26,908][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-master-001] failed to execute on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.NodeNotConnectedException: [elasticsearch-data-cold-003][10.0.0.52:9300] Node not connected
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:657) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:121) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:532) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:508) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:197) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:89) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:52) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:405) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:712) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.nodesStats(AbstractClient.java:808) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.updateNodeStats(InternalClusterInfoService.java:254) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.refresh(InternalClusterInfoService.java:290) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.maybeRefresh(InternalClusterInfoService.java:275) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.access$500(InternalClusterInfoService.java:68) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService$SubmitReschedulingClusterInfoUpdatedJob.lambda$run$0(InternalClusterInfoService.java:222) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0.jar:6.0.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-02-18T15:02:26,933][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elasticsearch-master-001] failed to execute [indices:monitor/stats] on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.NodeNotConnectedException: [elasticsearch-data-cold-003][10.0.0.52:9300] Node not connected
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:657) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:121) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:532) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:495) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.sendNodeRequest(TransportBroadcastByNodeAction.java:322) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.start(TransportBroadcastByNodeAction.java:311) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:234) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:79) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:405) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1253) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.stats(AbstractClient.java:1574) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:268) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.refresh(InternalClusterInfoService.java:319) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.maybeRefresh(InternalClusterInfoService.java:275) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.access$500(InternalClusterInfoService.java:68) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService$SubmitReschedulingClusterInfoUpdatedJob.lambda$run$0(InternalClusterInfoService.java:222) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0.jar:6.0.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-02-18T15:02:29,994][WARN ][o.e.d.z.PublishClusterStateAction] [elasticsearch-master-001] timed out waiting for all nodes to process published state [474493] (timeout [30s], pending nodes: [{elasticsearch-master-003}{vLonCmnMQFCn4TZMNtYb6w}{V2lQPN1kSP2RnWzmpIvAYg}{10.0.0.45}{10.0.0.45:9300}, {elasticsearch-data-cold-002}{6cMfsHpHTxuQN_wcZ8FC6w}{BbOL9NBSRBSTYk3egr4Ijg}{10.0.0.38}{10.0.0.38:9300}{type=cold}, {elasticsearch-master-002}{0pqpLYZdQK-0O26N1Uts2Q}{CXWV6KDqQiGnrXkMk5fdXg}{10.0.0.44}{10.0.0.44:9300}, {elasticsearch-data-cold-001}{B93cHubSRa2i3ogHi7NoMA}{2MaiGuzrRiuEwjZcPiGGKw}{10.0.0.37}{10.0.0.37:9300}{type=cold}])
[2018-02-18T15:02:29,994][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-master-001] removed {{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold},}, reason: apply cluster state (from master [master {elasticsearch-master-001}{KGg45JFdTDaySNwmNIX3_w}{TWQ_nJjXS-aQqrIxaPlQeA}{10.0.0.43}{10.0.0.43:9300} committed version [474493] source [zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(transport disconnected)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} transport disconnected], zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} failed to ping, tried [3] times, each with maximum [30s] timeout]]])
[2018-02-18T15:02:56,544][INFO ][o.e.c.r.DelayedAllocationService] [elasticsearch-master-001] scheduling reroute for delayed shards in [459ms] (2691 delayed shards)
[2018-02-18T15:02:56,578][WARN ][o.e.c.s.MasterService    ] [elasticsearch-master-001] cluster state update task [zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(transport disconnected)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} transport disconnected], zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} failed to ping, tried [3] times, each with maximum [30s] timeout]] took [59.5s] above the warn threshold of 30s
[2018-02-18T15:03:26,737][INFO ][o.e.c.s.MasterService    ] [elasticsearch-master-001] zen-disco-node-join[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{hJMBK9RrRTGXlsPTbBxqlg}{10.0.0.52}{10.0.0.52:9300}{type=cold}], reason: added {{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{hJMBK9RrRTGXlsPTbBxqlg}{10.0.0.52}{10.0.0.52:9300}{type=cold},}
[2018-02-18T15:03:29,381][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-master-001] added {{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{hJMBK9RrRTGXlsPTbBxqlg}{10.0.0.52}{10.0.0.52:9300}{type=cold},}, reason: apply cluster state (from master [master {elasticsearch-master-001}{KGg45JFdTDaySNwmNIX3_w}{TWQ_nJjXS-aQqrIxaPlQeA}{10.0.0.43}{10.0.0.43:9300} committed version [474495] source [zen-disco-node-join[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{hJMBK9RrRTGXlsPTbBxqlg}{10.0.0.52}{10.0.0.52:9300}{type=cold}]]])

What happened following that whole thing is that Elasticsearch created the new alias (As it should have, because we called the rollover command), but it did not delete the old alias, thus reaching a point that is has two aliases, both named customers pointing to two different indices, as can be seen here:

# curl -s 'localhost:9200/_cat/aliases?v' | sort
alias     index                filter routing.index routing.search
customers customers-raw-003281     -      -             -
customers customers-raw-003282     -      -             -

The customers-raw-003281 is the old index that was supposed to be rolled-over to customers-raw-003282 by the cronjob. Apparently it managed to roll it over but not delete the old alias.
This caused the entire cluster to become un-writeable to the customers alias because Elasticsearch didn't know where to write the data to (I guess).

We had to manually delete the old alias reference so that the cluster will be writeable again.

There were no logs what so ever about the problem, and we had to manually figure it out by ourselves.

Elasticsearch appeared to be operating (health status green), nothing in the logs.

Logstash only logged a message that it has an issue that it can retry and kept retrying for ever, again, with no specific log about the error.

@dnhatn
Copy link
Member

dnhatn commented Feb 18, 2018

@shaharmor, Thank you for reporting the issue. I am not sure if crashing the data node was the root cause. Are you using index template in customers index? If yes, is there any alias in that template?

@dnhatn dnhatn added feedback_needed :Data Management/Indices APIs APIs to create and manage indices and templates labels Feb 18, 2018
@dnhatn dnhatn self-assigned this Feb 18, 2018
@shaharmor
Copy link
Author

There is no customers index. The customers is the alias.
The indices are named customers-raw-000001, customers-raw-000002, etc.

I initially created the first customers-raw-000001 index using this template:

{
  "index_patterns": [
    "customers-raw-*"
  ],
  "settings": {
    "index": {
      "refresh_interval": "30s",
      "number_of_shards": "6",
      "number_of_replicas": "1",
      "routing.allocation.include.type": "hot"
    }
  },
  "mappings": {
  // some mapping here
  },
  "aliases": {
    "customers":  {}
  }
}

@dnhatn
Copy link
Member

dnhatn commented Feb 19, 2018

Thank you for your feedback @shaharmor. Rollover consists of two separate steps: (1) Create a new index and wait for that index to be ready; (2) Updates the alias to point to the new index. If the index template's aliases contain the rollover alias, the rollover alias will point to two indices between step 1 and step 2. Unfortunately, this is your case; you should remove the customer alias from the index template. We will fail a rollover request in 6.2.0 if the rollover alias found in the index templates (See #28110). I am closing this but feel free to comment or re-open.

@shaharmor
Copy link
Author

shaharmor commented Feb 19, 2018

@dnhatn So how come it worked until today? And how come it keeps working as we speak?

@shaharmor
Copy link
Author

Also, correct me if I'm wrong, but your guide: https://www.elastic.co/blog/managing-time-based-indices-efficiently specifically say to add the alias to the index template

@dnhatn
Copy link
Member

dnhatn commented Feb 19, 2018

Unfortunately we only see the issue recently. We should update the blog.

@shaharmor
Copy link
Author

I also think that this change (#28110) should be marked as a breaking change for ES 6.2

@dnhatn
Copy link
Member

dnhatn commented Feb 19, 2018

@shaharmor, Sorry I missed your comment.

So how come it worked until today? And how come it keeps working as we speak?

This works because the window between step 1 and step 2 is very small and it's no problem if there is no write operation in that window.

@shaharmor
Copy link
Author

So basically we were lucky all this time 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates
Projects
None yet
Development

No branches or pull requests

2 participants