Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ClusterDisruptionIT.testSendingShardFailure #70407

Closed
davidkyle opened this issue Mar 15, 2021 · 3 comments · Fixed by #70506
Closed

[CI] ClusterDisruptionIT.testSendingShardFailure #70407

davidkyle opened this issue Mar 15, 2021 · 3 comments · Fixed by #70506
Assignees
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@davidkyle
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/3dxawzvupxuu4

Repro line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.ClusterDisruptionIT.testSendingShardFailure" \
  -Dtests.seed=1DDEF5F86C208A64 \
  -Dtests.security.manager=true \
  -Dtests.locale=ko \
  -Dtests.timezone=UCT \
  -Druntime.java=15 \
  -Dtests.fips.enabled=true

Reproduces locally?:
Nope

Applicable branches:
7.x, 8.0

Failure history:
Regularly since March 5th although without digging into the logs I can't say the root cause is the same for all those failures

https://build-stats.elastic.co/goto/93418579fb15be36cbf29906167594a3

Failure excerpt:
Mostly manifested as a failure to form a cluster with a missing master but the most interesting error is the assert not transport thread error

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3170, name=elasticsearch[node_t1][transport_worker][T#2], state=RUNNABLE, group=TGRP-ClusterDisruptionIT]
Caused by: java.lang.AssertionError: Expected current thread [Thread[elasticsearch[node_t1][transport_worker][T#2],5,TGRP-ClusterDisruptionIT]] to not be a transport thread. Reason: [reading changes snapshot may involve slow IO]
	at __randomizedtesting.SeedInfo.seed([1DDEF5F86C208A64]:0)
	at org.elasticsearch.transport.Transports.assertNotTransportThread(Transports.java:49)
	at org.elasticsearch.index.engine.LuceneChangesSnapshot.assertAccessingThread(LuceneChangesSnapshot.java:155)
	at org.elasticsearch.index.engine.LuceneChangesSnapshot.close(LuceneChangesSnapshot.java:117)
	at org.elasticsearch.index.shard.PrimaryReplicaSyncer$1.close(PrimaryReplicaSyncer.java:91)
	at org.elasticsearch.index.shard.PrimaryReplicaSyncer$2.onFailure(PrimaryReplicaSyncer.java:124)
	at org.elasticsearch.index.shard.PrimaryReplicaSyncer$3.onFailure(PrimaryReplicaSyncer.java:165)
	at org.elasticsearch.index.shard.PrimaryReplicaSyncer$SnapshotSender.onFailure(PrimaryReplicaSyncer.java:227)
	at org.elasticsearch.action.resync.TransportResyncReplicationAction$1.handleException(TransportResyncReplicationAction.java:177)
	at org.elasticsearch.transport.TransportService$5.handleException(TransportService.java:738)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1283)
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1392)
	at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1366)
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50)
	at org.elasticsearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:45)
	at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:40)
	at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66)
	at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onFailure(ActionListener.java:371)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onFailure(TransportReplicationAction.java:435)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.handleException(TransportReplicationAction.java:429)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$runWithPrimaryShardReference$3(TransportReplicationAction.java:414)
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:142)
	at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66)
	at org.elasticsearch.action.support.replication.ReplicationOperation.finishAsFailed(ReplicationOperation.java:342)
	at org.elasticsearch.action.support.replication.ReplicationOperation.onNoLongerPrimary(ReplicationOperation.java:287)
	at org.elasticsearch.action.support.replication.ReplicationOperation.access$1100(ReplicationOperation.java:46)
	at org.elasticsearch.action.support.replication.ReplicationOperation$2.lambda$onFailure$2(ReplicationOperation.java:218)
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:142)
	at org.elasticsearch.action.ActionListener.onFailure(ActionListener.java:285)
	at org.elasticsearch.action.ResultDeduplicator$CompositeListener.onFailure(ResultDeduplicator.java:105)
	at org.elasticsearch.cluster.action.shard.ShardStateAction$1.handleException(ShardStateAction.java:162)
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1283)
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:317)
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:215)
	at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:315)
	at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:307)
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:126)
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:84)
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:693)
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:129)
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:104)
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:69)
	at org.elasticsearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:311)
	at org.elasticsearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:217)
	at org.elasticsearch.nio.BytesChannelContext.read(BytesChannelContext.java:29)
	at org.elasticsearch.nio.EventHandler.handleRead(EventHandler.java:128)
	at org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:140)
	at org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:409)
	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:235)
	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:163)
	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:120)
	at java.base/java.lang.Thread.run(Thread.java:832)
@davidkyle davidkyle added >test-failure Triaged test failures from CI :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Mar 15, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Mar 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@vamuzumd
Copy link
Contributor

vamuzumd commented Mar 15, 2021

@davidkyle, the build link seems inaccessible to me(requires Gradle Enterprise credentials), can you attach the log files please? I wanted to fix this issue and tried reproing it locally, but it succeeds.

@DaveCTurner DaveCTurner self-assigned this Mar 17, 2021
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Mar 17, 2021
We assert that the snapshot isn't closed on a transport thread, but we
close it without forking off the transport thread in case of a failure.
With this commit we fork on failure too.

Relates elastic#69949
Closes elastic#70407
@DaveCTurner
Copy link
Contributor

Thanks for taking a look @vamuzumd. I don't think you need any more logs than the ones included in the OP to see the problem. I opened #70506 to suggest a fix.

DaveCTurner added a commit that referenced this issue Mar 17, 2021
We assert that the snapshot isn't closed on a transport thread, but we
close it without forking off the transport thread in case of a failure.
With this commit we fork on failure too.

Relates #69949
Closes #70407
DaveCTurner added a commit that referenced this issue Mar 17, 2021
We assert that the snapshot isn't closed on a transport thread, but we
close it without forking off the transport thread in case of a failure.
With this commit we fork on failure too.

Relates #69949
Closes #70407
DaveCTurner added a commit that referenced this issue Mar 17, 2021
We assert that the snapshot isn't closed on a transport thread, but we
close it without forking off the transport thread in case of a failure.
With this commit we fork on failure too.

Relates #69949
Closes #70407
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants