-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One of the shard couldn't recovery from the rolling upgrade, 1.1.1 -> 1.4.2 #9406
Comments
Something even more weird. The old version node keeps replicate the shard to the other two nodes, constantly. That's why it increased the size of that shard on those two nodes. I used this script to check my primary shard, which is on the old version:
It indicates that my primary shard is [OK] |
Hi @yangou It looks like you're running into a bug fixed by this PR: #9142 You have a very old segment in that shard, which has a corrupt checksum. We only started checking the legacy checksums in 1.4.2, not realising that old segments were sometimes written with corrupt checksums. You have three options:
Closing as a duplicate of #9140 |
Can I upgrade it to 1.4.1 first and then 1.4.2? |
@yangou I think you can, because both versions use lucene 4.10.2... but I'm not 100% sure. |
We started upgrading our ES two nodes cluster from 1.1.1 to 1.4.2, using the rolling upgrade.
After restarting one of the upgraded node, we found one of the shards couldn't get itself back. The log out put give this error:
Also, the expected size of that shard should be around 170G, however, the recovering directory grew to more than 650+G.
I checked the hardware, which has no issues at all. But to make sure it wasn't the hard ware issue, I added a brand new node into the cluster, and it seems that particular shard has the reproducible issue on new machine too.
I deleted the directory manually as mentioned in #9302, the cluster didn't automatically create the replica.
So I use the reroute API, to try to move that primary shard from the old version node to new version node. It seemed to be promising, because when the move finishes, the size of the directory was correct. However, it turned out that after moving, the old shard on the old version node didn't get removed, and the newly created one on new version node became just a copy of the old shard, not even a replica, because cluster didn't allocate it.
Any idea about how to fix this issue?
The text was updated successfully, but these errors were encountered: