[CI] CorruptedFileIT.testReplicaCorruption failure #41899

matriv · 2019-05-07T14:36:50Z

Failed for 7.0: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+intake/859/console

Reproduction line

(does not) reproduce locally

./gradlew :server:integTest --tests "org.elasticsearch.index.store.CorruptedFileIT.testReplicaCorruption" \
  -Dtests.seed=56970DA58EAFB57B \
  -Dtests.security.manager=true \
  -Dtests.locale=sv \
  -Dtests.timezone=Africa/Casablanca \
  -Dcompiler.java=12 \
  -Druntime.java=8

Example relevant log:

org.elasticsearch.index.store.CorruptedFileIT > testReplicaCorruption FAILED
    java.lang.AssertionError: timed out waiting for green state
        at org.junit.Assert.fail(Assert.java:88)
        at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:980)
        at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:936)
        at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:925)
        at org.elasticsearch.index.store.CorruptedFileIT.testReplicaCorruption(CorruptedFileIT.java:602)

    java.lang.AssertionError: not all translog generations have been released

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=1401, name=elasticsearch[node_td1][generic][T#2], state=RUNNABLE, group=TGRP-CorruptedFileIT]

        Caused by:
        java.lang.AssertionError: wtf? file=corrupted_bNYq4690The4_a_bS3D1eg

 1> org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=r76dkc actual=17kuudq (resource=name [_0.cfs], length [11566], checksum [r76dkc], writtenBy [8.0.0]) (resource=VerifyingIndexOutput(_0.cfs))
  1> 	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1237) ~[main/:?]
  1> 	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1215) ~[main/:?]
  1> 	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1245) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.MultiFileWriter.innerWriteFileChunk(MultiFileWriter.java:120) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.MultiFileWriter.access$000(MultiFileWriter.java:43) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.MultiFileWriter$FileChunkWriter.writeChunk(MultiFileWriter.java:200) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.MultiFileWriter.writeFileChunk(MultiFileWriter.java:68) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:459) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:632) ~[main/:?]
  1> 	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:606) ~[main/:?]
  1> 	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [main/:?]
  1> 	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1077) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

Frequency

Not often, last occurrence was April 3rd 2019.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-05-07T14:36:53Z

Pinging @elastic/es-distributed

ywelsch · 2019-06-05T15:21:59Z

Logs are gone and no recent failure of this. Closing this waiting for next failure.

ywelsch · 2019-09-25T06:52:16Z

This has failed again, here are the logs: https://gradle-enterprise.elastic.co/s/i4zmfbd3xj4vg

We can have a large number of shard copies in this test. For example, the two recent failures have 24 and 27 copies respectively and all replicas have to copy segment files as their stores are corrupted. Our CI needs more than 30 seconds to start all these copies. Note that in two recent failures, the cluster was green just after the cluster health timed out. Closes #41899

matriv added >test-failure Triaged test failures from CI :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels May 7, 2019

dnhatn self-assigned this May 17, 2019

ywelsch closed this as completed Jun 5, 2019

ywelsch reopened this Sep 25, 2019

dnhatn mentioned this issue Sep 25, 2019

Increase ensureGreen timeout for testReplicaCorruption #47136

Merged

dnhatn closed this as completed in #47136 Sep 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] CorruptedFileIT.testReplicaCorruption failure #41899

[CI] CorruptedFileIT.testReplicaCorruption failure #41899

matriv commented May 7, 2019

elasticmachine commented May 7, 2019

ywelsch commented Jun 5, 2019

ywelsch commented Sep 25, 2019

[CI] CorruptedFileIT.testReplicaCorruption failure #41899

[CI] CorruptedFileIT.testReplicaCorruption failure #41899

Comments

matriv commented May 7, 2019

Reproduction line

Example relevant log:

Frequency

elasticmachine commented May 7, 2019

ywelsch commented Jun 5, 2019

ywelsch commented Sep 25, 2019