One of the shard couldn't recovery from the rolling upgrade, 1.1.1 -> 1.4.2 #9406

yangou · 2015-01-25T18:01:13Z

We started upgrading our ES two nodes cluster from 1.1.1 to 1.4.2, using the rolling upgrade.
After restarting one of the upgraded node, we found one of the shards couldn't get itself back. The log out put give this error:

Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=x56z8s actual=1h6zri0 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@393b946e)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Also, the expected size of that shard should be around 170G, however, the recovering directory grew to more than 650+G.

I checked the hardware, which has no issues at all. But to make sure it wasn't the hard ware issue, I added a brand new node into the cluster, and it seems that particular shard has the reproducible issue on new machine too.

I deleted the directory manually as mentioned in #9302, the cluster didn't automatically create the replica.
So I use the reroute API, to try to move that primary shard from the old version node to new version node. It seemed to be promising, because when the move finishes, the size of the directory was correct. However, it turned out that after moving, the old shard on the old version node didn't get removed, and the newly created one on new version node became just a copy of the old shard, not even a replica, because cluster didn't allocate it.

Any idea about how to fix this issue?

The text was updated successfully, but these errors were encountered:

yangou · 2015-01-25T23:53:22Z

Something even more weird. The old version node keeps replicate the shard to the other two nodes, constantly. That's why it increased the size of that shard on those two nodes.

I used this script to check my primary shard, which is on the old version:

ES_HOME=/usr/share/elasticsearch
ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-0.90.3.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/*
INDEXPATH=/data/logstash/data/elasticsearch/nodes/0/indices/logstash-2013.08.24/0/index/
sudo -u logstash java -cp $ES_CLASSPATH -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $INDEXPATH

It indicates that my primary shard is [OK]
I really run out solution to solve this.

clintongormley · 2015-01-26T19:45:50Z

Hi @yangou

It looks like you're running into a bug fixed by this PR: #9142

You have a very old segment in that shard, which has a corrupt checksum. We only started checking the legacy checksums in 1.4.2, not realising that old segments were sometimes written with corrupt checksums.

You have three options:

go to 1.4.1 instead, and run the upgrade command (which will upgrade all segments to the latest version)
wait for version 1.4.3 to be released (coming soon)
delete the shard and hope it recovers from a non-corrupt replica (if that exist) or risk losing the data in that shard

Closing as a duplicate of #9140

yangou · 2015-01-26T23:53:49Z

Can I upgrade it to 1.4.1 first and then 1.4.2?
Since two of the nodes in cluster has been upgraded to 1.4.2

clintongormley · 2015-01-27T09:16:11Z

@yangou I think you can, because both versions use lucene 4.10.2... but I'm not 100% sure.

clintongormley closed this as completed Jan 26, 2015

yangou mentioned this issue Jan 27, 2015

Upgrade command leads to huge file size increase #9431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One of the shard couldn't recovery from the rolling upgrade, 1.1.1 -> 1.4.2 #9406

One of the shard couldn't recovery from the rolling upgrade, 1.1.1 -> 1.4.2 #9406

yangou commented Jan 25, 2015

yangou commented Jan 25, 2015

clintongormley commented Jan 26, 2015

yangou commented Jan 26, 2015

clintongormley commented Jan 27, 2015

One of the shard couldn't recovery from the rolling upgrade, 1.1.1 -> 1.4.2 #9406

One of the shard couldn't recovery from the rolling upgrade, 1.1.1 -> 1.4.2 #9406

Comments

yangou commented Jan 25, 2015

yangou commented Jan 25, 2015

clintongormley commented Jan 26, 2015

yangou commented Jan 26, 2015

clintongormley commented Jan 27, 2015