Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One of the shard couldn't recovery from the rolling upgrade, 1.1.1 -> 1.4.2 #9406

Closed
yangou opened this issue Jan 25, 2015 · 4 comments
Closed

Comments

@yangou
Copy link

yangou commented Jan 25, 2015

We started upgrading our ES two nodes cluster from 1.1.1 to 1.4.2, using the rolling upgrade.
After restarting one of the upgraded node, we found one of the shards couldn't get itself back. The log out put give this error:

Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=x56z8s actual=1h6zri0 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@393b946e)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Also, the expected size of that shard should be around 170G, however, the recovering directory grew to more than 650+G.

I checked the hardware, which has no issues at all. But to make sure it wasn't the hard ware issue, I added a brand new node into the cluster, and it seems that particular shard has the reproducible issue on new machine too.

I deleted the directory manually as mentioned in #9302, the cluster didn't automatically create the replica.
So I use the reroute API, to try to move that primary shard from the old version node to new version node. It seemed to be promising, because when the move finishes, the size of the directory was correct. However, it turned out that after moving, the old shard on the old version node didn't get removed, and the newly created one on new version node became just a copy of the old shard, not even a replica, because cluster didn't allocate it.

Any idea about how to fix this issue?

@yangou
Copy link
Author

yangou commented Jan 25, 2015

Something even more weird. The old version node keeps replicate the shard to the other two nodes, constantly. That's why it increased the size of that shard on those two nodes.

I used this script to check my primary shard, which is on the old version:

ES_HOME=/usr/share/elasticsearch
ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-0.90.3.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/*
INDEXPATH=/data/logstash/data/elasticsearch/nodes/0/indices/logstash-2013.08.24/0/index/
sudo -u logstash java -cp $ES_CLASSPATH -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $INDEXPATH

It indicates that my primary shard is [OK]
I really run out solution to solve this.

@clintongormley
Copy link
Contributor

Hi @yangou

It looks like you're running into a bug fixed by this PR: #9142

You have a very old segment in that shard, which has a corrupt checksum. We only started checking the legacy checksums in 1.4.2, not realising that old segments were sometimes written with corrupt checksums.

You have three options:

  • go to 1.4.1 instead, and run the upgrade command (which will upgrade all segments to the latest version)
  • wait for version 1.4.3 to be released (coming soon)
  • delete the shard and hope it recovers from a non-corrupt replica (if that exist) or risk losing the data in that shard

Closing as a duplicate of #9140

@yangou
Copy link
Author

yangou commented Jan 26, 2015

Can I upgrade it to 1.4.1 first and then 1.4.2?
Since two of the nodes in cluster has been upgraded to 1.4.2

@clintongormley
Copy link
Contributor

@yangou I think you can, because both versions use lucene 4.10.2... but I'm not 100% sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants