Fail node startup if shards index format is too old / new #9515

s1monw · 2015-01-30T22:30:11Z

Today if a shard contains a segment that is from Lucene 3.x and
therefore throws an IndexFormatTooOldException the node goes into
a wild allocation loop if the index is directly recovered from the gateway.
If the problematic shard is allocated later due to other reasons the shard
will fail allocation and downgrading the cluster might be impossible since
new segments in other indices have already been written.

This commit adds sanity checks to the GatewayMetaState that tries to read
the SegmentsInfo for every shard on the node and fails if a shard is corrupted
or the index is too new etc.

With the new data_path per index feature nodes might not have enough information
unless they are master eligible since we used to not persist the index and global
state on nodes that are not master eligible. This commit changes this behavior and
writes the state on all nodes that hold data. This in an enhancement itself since
data nodes that are not master eligible are not self-contained today.

This change also fixes the issue see in #8823 since metadata is written on all
data nodes now.

Closes #8823

Today if a shard contains a segment that is from Lucene 3.x and therefore throws an `IndexFormatTooOldException` the nodes goes into a wild allocation loop if the index is directly recovered from the gateway. If the problematic shard is allocated later due to other reasons the shard will fail allocation and downgrading the cluster might be impossible since new segments in other indices have already been written. This commit adds santiy checks to the GatewayMetaState that tries to read the SegmentsInfo for every shard on the node and fails if a shard is corrupted or the index is too new etc. With the new data_path per index feature nodes might not have enough information unless they are master eligable since we used to not persist the index and global state on nodes that are not master eligable. This commit changes this behavior and writes the state on all nodes that hold data. This in an enhacement itself since data nodes that are not master eligable are not selfcontained today. This change also fixes the issue see in elastic#8823 since metadata is written on all data nodes now. Closes elastic#8823

dakrone · 2015-01-30T23:10:15Z

src/main/java/org/elasticsearch/gateway/GatewayMetaState.java

+                    throw ex;
+                }
+            } finally {
+                IOUtils.close(dirs);


Should this be a .closeWhileHandlingException, or do we want to make sure we can close them properly also?

kimchy · 2015-02-01T20:09:57Z

I think that we need to be smarter in writing the index metadata. With this change, if I read it correctly, we write all the indices metadata on all data nodes, regardless if a shard for that index is allocated on it or not (same logic as we do on the master nodes). I think that we should only write the index metadata on the data node if there is a shard for that index allocated on it.

I like the fact that once we do, and master nodes data is lost, then it will automatically import it as dangling indices. I think that it would be great to have a test for it though. Start a dedicated master node and dedicated data node, clean the master node state on disk, and verify that the index is imported.

Also, I wonder if we should separate it into 2 PRs, one to write the index metadata on data nodes that hold the relevant shard, and one that holds the logic explained in the title of this PR, which is fail a node to start if it holds na index that is too old.

bleskes · 2015-02-02T09:05:52Z

+1 to what @kimchy said about not writing index states to node that don't hold a shard of that index.

If the problematic shard is allocated later due to other reasons the shard
will fail allocation and downgrading the cluster might be impossible since
new segments in other indices have already been written.

I don't think we solve this problem by preventing a node from starting. I'm thinking about a cluster with many indices, a couple of which still have segments from 3.x. Depending on node join order we might start allocating shards and starting them before the node with 3.x joined. Also it might be that one node not joining is not a show stopper (we typically advise people to use the expected_nodes settings) and the cluster will continue without the shards in question, get potentially to a RED state but accept indexing on other indices/ shards.

I don't see how to solve this in a watertight way (might be wrong, of course). I thought of another approach I would like to propose. We change the LocalGatewayAllocator to reach out to all the nodes of the cluster before allocating shards (we currently do it on a shard by shard basis). The nodes will then scan the local data dirs for older segments and respond with that information. If any of the nodes respond with a too old index, we shut down the cluster via a cluster block or something similar, which means it will respond to API calls and can say what happened. We might consider just closing all the indices in the cluster.

It can still be the case that shards have already been allocated when a node with old shards join the cluster. In this case we can't know for sure no indexing was done. We can potentially try to close the index in question. This also tricky as that specific index may already have assigned shards. To avoid this we need to only allocate and index's shards when all the shards can be allocated.

Closing indices also have the upside of:

people can choose to just deleted the problematic indices (think about time based data where it may be easier to discard older data rather then upgrade it).
people can down grade after their cluster causing the indices where were touched by the newer lucene version. This will cause the newer indices to close but will allow upgrading the older indices.

Last idea I wanted to suggest here is potentially just offering an upgrade tool to upgrade problematic shards without a full cluster downgrade. This can be done while the indices are closed.

Sorry for the long comment. This is tricky.

s1monw · 2015-02-17T09:23:09Z

moving out here - it seems like the progress over perfection approach is not should be applied here. If somebody wants to pick this up feel free

s1monw added v1.5.0 v2.0.0-beta1 resiliency review labels Jan 30, 2015

dakrone reviewed Jan 30, 2015
View reviewed changes

s1monw added stalled and removed resiliency review v1.5.0 v2.0.0-beta1 labels Feb 17, 2015

s1monw closed this Feb 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail node startup if shards index format is too old / new #9515

Fail node startup if shards index format is too old / new #9515

s1monw commented Jan 30, 2015

dakrone Jan 30, 2015

kimchy commented Feb 1, 2015

bleskes commented Feb 2, 2015

s1monw commented Feb 17, 2015

Fail node startup if shards index format is too old / new #9515

Fail node startup if shards index format is too old / new #9515

Conversation

s1monw commented Jan 30, 2015

dakrone Jan 30, 2015

Choose a reason for hiding this comment

kimchy commented Feb 1, 2015

bleskes commented Feb 2, 2015

s1monw commented Feb 17, 2015