Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] CorruptedFileIT.testReplicaCorruption failure #41899

Closed
matriv opened this issue May 7, 2019 · 3 comments · Fixed by #47136
Closed

[CI] CorruptedFileIT.testReplicaCorruption failure #41899

matriv opened this issue May 7, 2019 · 3 comments · Fixed by #47136
Assignees
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI

Comments

@matriv
Copy link
Contributor

matriv commented May 7, 2019

Failed for 7.0: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+intake/859/console

Reproduction line

(does not) reproduce locally

./gradlew :server:integTest --tests "org.elasticsearch.index.store.CorruptedFileIT.testReplicaCorruption" \
  -Dtests.seed=56970DA58EAFB57B \
  -Dtests.security.manager=true \
  -Dtests.locale=sv \
  -Dtests.timezone=Africa/Casablanca \
  -Dcompiler.java=12 \
  -Druntime.java=8

Example relevant log:

org.elasticsearch.index.store.CorruptedFileIT > testReplicaCorruption FAILED
    java.lang.AssertionError: timed out waiting for green state
        at org.junit.Assert.fail(Assert.java:88)
        at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:980)
        at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:936)
        at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:925)
        at org.elasticsearch.index.store.CorruptedFileIT.testReplicaCorruption(CorruptedFileIT.java:602)

    java.lang.AssertionError: not all translog generations have been released

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=1401, name=elasticsearch[node_td1][generic][T#2], state=RUNNABLE, group=TGRP-CorruptedFileIT]

        Caused by:
        java.lang.AssertionError: wtf? file=corrupted_bNYq4690The4_a_bS3D1eg
 1> org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=r76dkc actual=17kuudq (resource=name [_0.cfs], length [11566], checksum [r76dkc], writtenBy [8.0.0]) (resource=VerifyingIndexOutput(_0.cfs))
  1> 	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1237) ~[main/:?]
  1> 	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1215) ~[main/:?]
  1> 	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1245) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.MultiFileWriter.innerWriteFileChunk(MultiFileWriter.java:120) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.MultiFileWriter.access$000(MultiFileWriter.java:43) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.MultiFileWriter$FileChunkWriter.writeChunk(MultiFileWriter.java:200) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.MultiFileWriter.writeFileChunk(MultiFileWriter.java:68) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:459) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:632) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:606) ~[main/:?]
  1> 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [main/:?]
  1> 	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1077) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

Frequency

Not often, last occurrence was April 3rd 2019.

@matriv matriv added >test-failure Triaged test failures from CI :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels May 7, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn self-assigned this May 17, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Jun 5, 2019

Logs are gone and no recent failure of this. Closing this waiting for next failure.

@ywelsch ywelsch closed this as completed Jun 5, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Sep 25, 2019

This has failed again, here are the logs: https://gradle-enterprise.elastic.co/s/i4zmfbd3xj4vg

@ywelsch ywelsch reopened this Sep 25, 2019
dnhatn added a commit that referenced this issue Sep 25, 2019
We can have a large number of shard copies in this test. For example,
the two recent failures have 24 and 27 copies respectively and all
replicas have to copy segment files as their stores are corrupted. Our
CI needs more than 30 seconds to start all these copies.

Note that in two recent failures, the cluster was green just after the
cluster health timed out.

Closes #41899
dnhatn added a commit that referenced this issue Sep 25, 2019
We can have a large number of shard copies in this test. For example,
the two recent failures have 24 and 27 copies respectively and all
replicas have to copy segment files as their stores are corrupted. Our
CI needs more than 30 seconds to start all these copies.

Note that in two recent failures, the cluster was green just after the
cluster health timed out.

Closes #41899
dnhatn added a commit that referenced this issue Sep 25, 2019
We can have a large number of shard copies in this test. For example,
the two recent failures have 24 and 27 copies respectively and all
replicas have to copy segment files as their stores are corrupted. Our
CI needs more than 30 seconds to start all these copies.

Note that in two recent failures, the cluster was green just after the
cluster health timed out.

Closes #41899
dnhatn added a commit that referenced this issue Sep 25, 2019
We can have a large number of shard copies in this test. For example,
the two recent failures have 24 and 27 copies respectively and all
replicas have to copy segment files as their stores are corrupted. Our
CI needs more than 30 seconds to start all these copies.

Note that in two recent failures, the cluster was green just after the
cluster health timed out.

Closes #41899
dnhatn added a commit that referenced this issue Sep 26, 2019
We can have a large number of shard copies in this test. For example,
the two recent failures have 24 and 27 copies respectively and all
replicas have to copy segment files as their stores are corrupted. Our
CI needs more than 30 seconds to start all these copies.

Note that in two recent failures, the cluster was green just after the
cluster health timed out.

Closes #41899
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants