-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail node startup if shards index format is too old / new #9515
Conversation
Today if a shard contains a segment that is from Lucene 3.x and therefore throws an `IndexFormatTooOldException` the nodes goes into a wild allocation loop if the index is directly recovered from the gateway. If the problematic shard is allocated later due to other reasons the shard will fail allocation and downgrading the cluster might be impossible since new segments in other indices have already been written. This commit adds santiy checks to the GatewayMetaState that tries to read the SegmentsInfo for every shard on the node and fails if a shard is corrupted or the index is too new etc. With the new data_path per index feature nodes might not have enough information unless they are master eligable since we used to not persist the index and global state on nodes that are not master eligable. This commit changes this behavior and writes the state on all nodes that hold data. This in an enhacement itself since data nodes that are not master eligable are not selfcontained today. This change also fixes the issue see in elastic#8823 since metadata is written on all data nodes now. Closes elastic#8823
throw ex; | ||
} | ||
} finally { | ||
IOUtils.close(dirs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a .closeWhileHandlingException
, or do we want to make sure we can close them properly also?
I think that we need to be smarter in writing the index metadata. With this change, if I read it correctly, we write all the indices metadata on all data nodes, regardless if a shard for that index is allocated on it or not (same logic as we do on the master nodes). I think that we should only write the index metadata on the data node if there is a shard for that index allocated on it. I like the fact that once we do, and master nodes data is lost, then it will automatically import it as dangling indices. I think that it would be great to have a test for it though. Start a dedicated master node and dedicated data node, clean the master node state on disk, and verify that the index is imported. Also, I wonder if we should separate it into 2 PRs, one to write the index metadata on data nodes that hold the relevant shard, and one that holds the logic explained in the title of this PR, which is fail a node to start if it holds na index that is too old. |
+1 to what @kimchy said about not writing index states to node that don't hold a shard of that index.
I don't think we solve this problem by preventing a node from starting. I'm thinking about a cluster with many indices, a couple of which still have segments from 3.x. Depending on node join order we might start allocating shards and starting them before the node with 3.x joined. Also it might be that one node not joining is not a show stopper (we typically advise people to use the expected_nodes settings) and the cluster will continue without the shards in question, get potentially to a RED state but accept indexing on other indices/ shards. I don't see how to solve this in a watertight way (might be wrong, of course). I thought of another approach I would like to propose. We change the LocalGatewayAllocator to reach out to all the nodes of the cluster before allocating shards (we currently do it on a shard by shard basis). The nodes will then scan the local data dirs for older segments and respond with that information. If any of the nodes respond with a too old index, we shut down the cluster via a cluster block or something similar, which means it will respond to API calls and can say what happened. We might consider just closing all the indices in the cluster. It can still be the case that shards have already been allocated when a node with old shards join the cluster. In this case we can't know for sure no indexing was done. We can potentially try to close the index in question. This also tricky as that specific index may already have assigned shards. To avoid this we need to only allocate and index's shards when all the shards can be allocated. Closing indices also have the upside of:
Last idea I wanted to suggest here is potentially just offering an upgrade tool to upgrade problematic shards without a full cluster downgrade. This can be done while the indices are closed. Sorry for the long comment. This is tricky. |
moving out here - it seems like the progress over perfection approach is not should be applied here. If somebody wants to pick this up feel free |
Today if a shard contains a segment that is from Lucene 3.x and
therefore throws an
IndexFormatTooOldException
the node goes intoa wild allocation loop if the index is directly recovered from the gateway.
If the problematic shard is allocated later due to other reasons the shard
will fail allocation and downgrading the cluster might be impossible since
new segments in other indices have already been written.
This commit adds sanity checks to the
GatewayMetaState
that tries to readthe
SegmentsInfo
for every shard on the node and fails if a shard is corruptedor the index is too new etc.
With the new data_path per index feature nodes might not have enough information
unless they are master eligible since we used to not persist the index and global
state on nodes that are not master eligible. This commit changes this behavior and
writes the state on all nodes that hold data. This in an enhancement itself since
data nodes that are not master eligible are not self-contained today.
This change also fixes the issue see in #8823 since metadata is written on all
data nodes now.
Closes #8823