[CI] ClusterDisruptionIT.testSendingShardFailure #70407

davidkyle · 2021-03-15T16:12:03Z

Build scan:
https://gradle-enterprise.elastic.co/s/3dxawzvupxuu4

Repro line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.ClusterDisruptionIT.testSendingShardFailure" \
  -Dtests.seed=1DDEF5F86C208A64 \
  -Dtests.security.manager=true \
  -Dtests.locale=ko \
  -Dtests.timezone=UCT \
  -Druntime.java=15 \
  -Dtests.fips.enabled=true

Reproduces locally?:
Nope

Applicable branches:
7.x, 8.0

Failure history:
Regularly since March 5th although without digging into the logs I can't say the root cause is the same for all those failures

https://build-stats.elastic.co/goto/93418579fb15be36cbf29906167594a3

Failure excerpt:
Mostly manifested as a failure to form a cluster with a missing master but the most interesting error is the assert not transport thread error

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3170, name=elasticsearch[node_t1][transport_worker][T#2], state=RUNNABLE, group=TGRP-ClusterDisruptionIT]
Caused by: java.lang.AssertionError: Expected current thread [Thread[elasticsearch[node_t1][transport_worker][T#2],5,TGRP-ClusterDisruptionIT]] to not be a transport thread. Reason: [reading changes snapshot may involve slow IO]
	at __randomizedtesting.SeedInfo.seed([1DDEF5F86C208A64]:0)
	at org.elasticsearch.transport.Transports.assertNotTransportThread(Transports.java:49)
	at org.elasticsearch.index.engine.LuceneChangesSnapshot.assertAccessingThread(LuceneChangesSnapshot.java:155)
	at org.elasticsearch.index.engine.LuceneChangesSnapshot.close(LuceneChangesSnapshot.java:117)
	at org.elasticsearch.index.shard.PrimaryReplicaSyncer$1.close(PrimaryReplicaSyncer.java:91)
	at org.elasticsearch.index.shard.PrimaryReplicaSyncer$2.onFailure(PrimaryReplicaSyncer.java:124)
	at org.elasticsearch.index.shard.PrimaryReplicaSyncer$3.onFailure(PrimaryReplicaSyncer.java:165)
	at org.elasticsearch.index.shard.PrimaryReplicaSyncer$SnapshotSender.onFailure(PrimaryReplicaSyncer.java:227)
	at org.elasticsearch.action.resync.TransportResyncReplicationAction$1.handleException(TransportResyncReplicationAction.java:177)
	at org.elasticsearch.transport.TransportService$5.handleException(TransportService.java:738)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1283)
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1392)
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1366)
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50)
	at org.elasticsearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:45)
	at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:40)
	at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66)
	at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onFailure(ActionListener.java:371)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onFailure(TransportReplicationAction.java:435)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.handleException(TransportReplicationAction.java:429)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$runWithPrimaryShardReference$3(TransportReplicationAction.java:414)
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:142)
	at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66)
	at org.elasticsearch.action.support.replication.ReplicationOperation.finishAsFailed(ReplicationOperation.java:342)
	at org.elasticsearch.action.support.replication.ReplicationOperation.onNoLongerPrimary(ReplicationOperation.java:287)
	at org.elasticsearch.action.support.replication.ReplicationOperation.access$1100(ReplicationOperation.java:46)
	at org.elasticsearch.action.support.replication.ReplicationOperation$2.lambda$onFailure$2(ReplicationOperation.java:218)
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:142)
	at org.elasticsearch.action.ActionListener.onFailure(ActionListener.java:285)
	at org.elasticsearch.action.ResultDeduplicator$CompositeListener.onFailure(ResultDeduplicator.java:105)
	at org.elasticsearch.cluster.action.shard.ShardStateAction$1.handleException(ShardStateAction.java:162)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1283)
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:317)
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:215)
	at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:315)
	at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:307)
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:126)
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:84)
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:693)
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:129)
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:104)
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:69)
	at org.elasticsearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:311)
	at org.elasticsearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:217)
	at org.elasticsearch.nio.BytesChannelContext.read(BytesChannelContext.java:29)
	at org.elasticsearch.nio.EventHandler.handleRead(EventHandler.java:128)
	at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:140)
	at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:409)
	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:235)
	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:163)
	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:120)
	at java.base/java.lang.Thread.run(Thread.java:832)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-03-15T16:12:06Z

Pinging @elastic/es-distributed (Team:Distributed)

vamuzumd · 2021-03-15T23:35:47Z

@davidkyle, the build link seems inaccessible to me(requires Gradle Enterprise credentials), can you attach the log files please? I wanted to fix this issue and tried reproing it locally, but it succeeds.

We assert that the snapshot isn't closed on a transport thread, but we close it without forking off the transport thread in case of a failure. With this commit we fork on failure too. Relates elastic#69949 Closes elastic#70407

DaveCTurner · 2021-03-17T15:16:41Z

Thanks for taking a look @vamuzumd. I don't think you need any more logs than the ones included in the OP to see the problem. I opened #70506 to suggest a fix.

We assert that the snapshot isn't closed on a transport thread, but we close it without forking off the transport thread in case of a failure. With this commit we fork on failure too. Relates #69949 Closes #70407

davidkyle added >test-failure Triaged test failures from CI :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Mar 15, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Mar 15, 2021

DaveCTurner self-assigned this Mar 17, 2021

DaveCTurner mentioned this issue Mar 17, 2021

Fork listener#onFailure in PrimaryReplicaSyncer #70506

Merged

DaveCTurner closed this as completed in #70506 Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] ClusterDisruptionIT.testSendingShardFailure #70407

[CI] ClusterDisruptionIT.testSendingShardFailure #70407

davidkyle commented Mar 15, 2021

elasticmachine commented Mar 15, 2021

vamuzumd commented Mar 15, 2021 •

edited

Loading

DaveCTurner commented Mar 17, 2021

[CI] ClusterDisruptionIT.testSendingShardFailure #70407

[CI] ClusterDisruptionIT.testSendingShardFailure #70407

Comments

davidkyle commented Mar 15, 2021

elasticmachine commented Mar 15, 2021

vamuzumd commented Mar 15, 2021 • edited Loading

DaveCTurner commented Mar 17, 2021

vamuzumd commented Mar 15, 2021 •

edited

Loading