[BUG] ES 6.0 - Duplicate Alias after rollover timeout #28720

shaharmor · 2018-02-18T16:17:11Z

Elasticsearch version (bin/elasticsearch --version):
Version: 6.0.0, Build: 8f0685b/2017-11-10T18:41:22.859Z, JVM: 1.8.0_151

Plugins installed: [repository-azure]

JVM version (java -version):
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux xxx-001 4.11.0-1014-azure #14-Ubuntu SMP Tue Oct 17 12:10:56 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
We have a cron job that runs the rollover command on an alias named customers every 1 hour, at 1 minute past the start of the hour.
The job has been running for the past few months flawlessly.

We recently (1 hour ago) had a critical issue that caused Elasticsearch to become un-writeable.

The cluster architecture is as follow:
3 master nodes
3 hot data nodes
3 cold data nodes

Alias points to an index on the hot data nodes only.

What happened is that 3 seconds before the cronjob started, one of the cold data nodes suddenly rebooted (Not related to Elasticsearch).
Relevant log:

[2018-02-18T15:00:58,703][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-master-001] failed to execute on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-data-cold-003][10.0.0.52:9300][cluster:monitor/nodes/stats[n]] request_id [40478282] timed out after [15001ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:953) [elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0.jar:6.0.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

3 seconds later the cronjob started and tried to rollover the index, as seen in these logs:

[2018-02-18T15:01:01,831][INFO ][o.e.c.m.MetaDataCreateIndexService] [elasticsearch-master-001] [customers-raw-003282] creating index, cause [rollover_index], templates [customers-raw], shards [6]/[1], mappings [_default_]

30 seconds later, with no other logs in the middle, the rollover times out because of the crashed server:

[2018-02-18T15:01:34,026][WARN ][o.e.d.z.PublishClusterStateAction] [elasticsearch-master-001] timed out waiting for all nodes to process published state [474491] (timeout [30s], pending nodes: [{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}])
[2018-02-18T15:01:34,107][WARN ][o.e.c.s.MasterService    ] [elasticsearch-master-001] cluster state update task [create-index [customers-raw-003282], cause [rollover_index]] took [32.2s] above the warn threshold of 30s
[2018-02-18T15:01:56,881][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elasticsearch-master-001] failed to execute [indices:monitor/stats] on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-data-cold-003][10.0.0.52:9300][indices:monitor/stats[n]] disconnected
[2018-02-18T15:01:56,881][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elasticsearch-master-001] failed to execute [indices:monitor/stats] on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-data-cold-003][10.0.0.52:9300][indices:monitor/stats[n]] disconnected
[2018-02-18T15:01:56,899][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-master-001] failed to execute on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-data-cold-003][10.0.0.52:9300][cluster:monitor/nodes/stats[n]] disconnected
[2018-02-18T15:01:58,094][WARN ][o.e.c.NodeConnectionsService] [elasticsearch-master-001] failed to connect to node {elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elasticsearch-data-cold-003][10.0.0.52:9300] connect_timeout[30s]
	at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:284) ~[?:?]
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:591) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:495) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:334) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:321) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:183) [elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.0.0.jar:6.0.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.0.0.52/10.0.0.52:9300
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
	... 1 more
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
	... 1 more

Following that are some errors about timeouts and the removal of the crashed server from the pool:

[2018-02-18T15:01:59,964][INFO ][o.e.c.s.MasterService    ] [elasticsearch-master-001] zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(transport disconnected)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} transport disconnected], zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} failed to ping, tried [3] times, each with maximum [30s] timeout], reason: removed {{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold},}
[2018-02-18T15:02:26,908][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elasticsearch-master-001] failed to execute on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.NodeNotConnectedException: [elasticsearch-data-cold-003][10.0.0.52:9300] Node not connected
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:657) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:121) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:532) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:508) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:197) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:89) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:52) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:405) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:712) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.nodesStats(AbstractClient.java:808) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.updateNodeStats(InternalClusterInfoService.java:254) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.refresh(InternalClusterInfoService.java:290) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.maybeRefresh(InternalClusterInfoService.java:275) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.access$500(InternalClusterInfoService.java:68) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService$SubmitReschedulingClusterInfoUpdatedJob.lambda$run$0(InternalClusterInfoService.java:222) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0.jar:6.0.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-02-18T15:02:26,933][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [elasticsearch-master-001] failed to execute [indices:monitor/stats] on node [qbtMFuBaSfCWbVztvvi6Kg]
org.elasticsearch.transport.NodeNotConnectedException: [elasticsearch-data-cold-003][10.0.0.52:9300] Node not connected
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:657) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:121) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:532) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:495) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.sendNodeRequest(TransportBroadcastByNodeAction.java:322) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.start(TransportBroadcastByNodeAction.java:311) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:234) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:79) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:405) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1253) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.stats(AbstractClient.java:1574) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.updateIndicesStats(InternalClusterInfoService.java:268) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.refresh(InternalClusterInfoService.java:319) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.maybeRefresh(InternalClusterInfoService.java:275) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService.access$500(InternalClusterInfoService.java:68) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.cluster.InternalClusterInfoService$SubmitReschedulingClusterInfoUpdatedJob.lambda$run$0(InternalClusterInfoService.java:222) ~[elasticsearch-6.0.0.jar:6.0.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.0.jar:6.0.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-02-18T15:02:29,994][WARN ][o.e.d.z.PublishClusterStateAction] [elasticsearch-master-001] timed out waiting for all nodes to process published state [474493] (timeout [30s], pending nodes: [{elasticsearch-master-003}{vLonCmnMQFCn4TZMNtYb6w}{V2lQPN1kSP2RnWzmpIvAYg}{10.0.0.45}{10.0.0.45:9300}, {elasticsearch-data-cold-002}{6cMfsHpHTxuQN_wcZ8FC6w}{BbOL9NBSRBSTYk3egr4Ijg}{10.0.0.38}{10.0.0.38:9300}{type=cold}, {elasticsearch-master-002}{0pqpLYZdQK-0O26N1Uts2Q}{CXWV6KDqQiGnrXkMk5fdXg}{10.0.0.44}{10.0.0.44:9300}, {elasticsearch-data-cold-001}{B93cHubSRa2i3ogHi7NoMA}{2MaiGuzrRiuEwjZcPiGGKw}{10.0.0.37}{10.0.0.37:9300}{type=cold}])
[2018-02-18T15:02:29,994][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-master-001] removed {{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold},}, reason: apply cluster state (from master [master {elasticsearch-master-001}{KGg45JFdTDaySNwmNIX3_w}{TWQ_nJjXS-aQqrIxaPlQeA}{10.0.0.43}{10.0.0.43:9300} committed version [474493] source [zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(transport disconnected)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} transport disconnected], zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} failed to ping, tried [3] times, each with maximum [30s] timeout]]])
[2018-02-18T15:02:56,544][INFO ][o.e.c.r.DelayedAllocationService] [elasticsearch-master-001] scheduling reroute for delayed shards in [459ms] (2691 delayed shards)
[2018-02-18T15:02:56,578][WARN ][o.e.c.s.MasterService    ] [elasticsearch-master-001] cluster state update task [zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(transport disconnected)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} transport disconnected], zen-disco-node-failed({elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{FRoU1ZGfTZuR2VTfkCD3WQ}{10.0.0.52}{10.0.0.52:9300}{type=cold} failed to ping, tried [3] times, each with maximum [30s] timeout]] took [59.5s] above the warn threshold of 30s
[2018-02-18T15:03:26,737][INFO ][o.e.c.s.MasterService    ] [elasticsearch-master-001] zen-disco-node-join[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{hJMBK9RrRTGXlsPTbBxqlg}{10.0.0.52}{10.0.0.52:9300}{type=cold}], reason: added {{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{hJMBK9RrRTGXlsPTbBxqlg}{10.0.0.52}{10.0.0.52:9300}{type=cold},}
[2018-02-18T15:03:29,381][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-master-001] added {{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{hJMBK9RrRTGXlsPTbBxqlg}{10.0.0.52}{10.0.0.52:9300}{type=cold},}, reason: apply cluster state (from master [master {elasticsearch-master-001}{KGg45JFdTDaySNwmNIX3_w}{TWQ_nJjXS-aQqrIxaPlQeA}{10.0.0.43}{10.0.0.43:9300} committed version [474495] source [zen-disco-node-join[{elasticsearch-data-cold-003}{qbtMFuBaSfCWbVztvvi6Kg}{hJMBK9RrRTGXlsPTbBxqlg}{10.0.0.52}{10.0.0.52:9300}{type=cold}]]])

What happened following that whole thing is that Elasticsearch created the new alias (As it should have, because we called the rollover command), but it did not delete the old alias, thus reaching a point that is has two aliases, both named customers pointing to two different indices, as can be seen here:

# curl -s 'localhost:9200/_cat/aliases?v' | sort
alias     index                filter routing.index routing.search
customers customers-raw-003281     -      -             -
customers customers-raw-003282     -      -             -

The customers-raw-003281 is the old index that was supposed to be rolled-over to customers-raw-003282 by the cronjob. Apparently it managed to roll it over but not delete the old alias.
This caused the entire cluster to become un-writeable to the customers alias because Elasticsearch didn't know where to write the data to (I guess).

We had to manually delete the old alias reference so that the cluster will be writeable again.

There were no logs what so ever about the problem, and we had to manually figure it out by ourselves.

Elasticsearch appeared to be operating (health status green), nothing in the logs.

Logstash only logged a message that it has an issue that it can retry and kept retrying for ever, again, with no specific log about the error.

The text was updated successfully, but these errors were encountered:

dnhatn · 2018-02-18T20:49:54Z

@shaharmor, Thank you for reporting the issue. I am not sure if crashing the data node was the root cause. Are you using index template in customers index? If yes, is there any alias in that template?

shaharmor · 2018-02-19T08:28:28Z

There is no customers index. The customers is the alias.
The indices are named customers-raw-000001, customers-raw-000002, etc.

I initially created the first customers-raw-000001 index using this template:

{
  "index_patterns": [
    "customers-raw-*"
  ],
  "settings": {
    "index": {
      "refresh_interval": "30s",
      "number_of_shards": "6",
      "number_of_replicas": "1",
      "routing.allocation.include.type": "hot"
    }
  },
  "mappings": {
  // some mapping here
  },
  "aliases": {
    "customers":  {}
  }
}

dnhatn · 2018-02-19T14:21:03Z

Thank you for your feedback @shaharmor. Rollover consists of two separate steps: (1) Create a new index and wait for that index to be ready; (2) Updates the alias to point to the new index. If the index template's aliases contain the rollover alias, the rollover alias will point to two indices between step 1 and step 2. Unfortunately, this is your case; you should remove the customer alias from the index template. We will fail a rollover request in 6.2.0 if the rollover alias found in the index templates (See #28110). I am closing this but feel free to comment or re-open.

shaharmor · 2018-02-19T14:24:19Z

@dnhatn So how come it worked until today? And how come it keeps working as we speak?

shaharmor · 2018-02-19T14:25:26Z

Also, correct me if I'm wrong, but your guide: https://www.elastic.co/blog/managing-time-based-indices-efficiently specifically say to add the alias to the index template

dnhatn · 2018-02-19T14:32:16Z

Unfortunately we only see the issue recently. We should update the blog.

shaharmor · 2018-02-19T14:37:41Z

I also think that this change (#28110) should be marked as a breaking change for ES 6.2

dnhatn · 2018-02-19T17:08:55Z

@shaharmor, Sorry I missed your comment.

So how come it worked until today? And how come it keeps working as we speak?

This works because the window between step 1 and step 2 is very small and it's no problem if there is no write operation in that window.

shaharmor · 2018-02-19T17:11:28Z

So basically we were lucky all this time 😁

dnhatn added feedback_needed :Data Management/Indices APIs APIs to create and manage indices and templates labels Feb 18, 2018

dnhatn self-assigned this Feb 18, 2018

dnhatn closed this as completed Feb 19, 2018

dnhatn removed the feedback_needed label Feb 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ES 6.0 - Duplicate Alias after rollover timeout #28720

[BUG] ES 6.0 - Duplicate Alias after rollover timeout #28720

shaharmor commented Feb 18, 2018 •

edited

Loading

dnhatn commented Feb 18, 2018

shaharmor commented Feb 19, 2018

dnhatn commented Feb 19, 2018 •

edited

Loading

shaharmor commented Feb 19, 2018 •

edited

Loading

shaharmor commented Feb 19, 2018

dnhatn commented Feb 19, 2018

shaharmor commented Feb 19, 2018

dnhatn commented Feb 19, 2018

shaharmor commented Feb 19, 2018

[BUG] ES 6.0 - Duplicate Alias after rollover timeout #28720

[BUG] ES 6.0 - Duplicate Alias after rollover timeout #28720

Comments

shaharmor commented Feb 18, 2018 • edited Loading

dnhatn commented Feb 18, 2018

shaharmor commented Feb 19, 2018

dnhatn commented Feb 19, 2018 • edited Loading

shaharmor commented Feb 19, 2018 • edited Loading

shaharmor commented Feb 19, 2018

dnhatn commented Feb 19, 2018

shaharmor commented Feb 19, 2018

dnhatn commented Feb 19, 2018

shaharmor commented Feb 19, 2018

shaharmor commented Feb 18, 2018 •

edited

Loading

dnhatn commented Feb 19, 2018 •

edited

Loading

shaharmor commented Feb 19, 2018 •

edited

Loading