Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail node startup if shards index format is too old / new #9515

Closed
wants to merge 1 commit into from

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Jan 30, 2015

Today if a shard contains a segment that is from Lucene 3.x and
therefore throws an IndexFormatTooOldException the node goes into
a wild allocation loop if the index is directly recovered from the gateway.
If the problematic shard is allocated later due to other reasons the shard
will fail allocation and downgrading the cluster might be impossible since
new segments in other indices have already been written.

This commit adds sanity checks to the GatewayMetaState that tries to read
the SegmentsInfo for every shard on the node and fails if a shard is corrupted
or the index is too new etc.

With the new data_path per index feature nodes might not have enough information
unless they are master eligible since we used to not persist the index and global
state on nodes that are not master eligible. This commit changes this behavior and
writes the state on all nodes that hold data. This in an enhancement itself since
data nodes that are not master eligible are not self-contained today.

This change also fixes the issue see in #8823 since metadata is written on all
data nodes now.

Closes #8823

Today if a shard contains a segment that is from Lucene 3.x and
therefore throws an `IndexFormatTooOldException` the nodes goes into
a wild allocation loop if the index is directly recovered from the gateway.
If the problematic shard is allocated later due to other reasons the shard
will fail allocation and downgrading the cluster might be impossible since
new segments in other indices have already been written.

This commit adds santiy checks to the GatewayMetaState that tries to read
the SegmentsInfo for every shard on the node and fails if a shard is corrupted
or the index is too new etc.

With the new data_path per index feature nodes might not have enough information
unless they are master eligable since we used to not persist the index and global
state on nodes that are not master eligable. This commit changes this behavior and
writes the state on all nodes that hold data. This in an enhacement itself since
data nodes that are not master eligable are not selfcontained today.

This change also fixes the issue see in elastic#8823 since metadata is written on all
data nodes now.

Closes elastic#8823
throw ex;
}
} finally {
IOUtils.close(dirs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a .closeWhileHandlingException, or do we want to make sure we can close them properly also?

@kimchy
Copy link
Member

kimchy commented Feb 1, 2015

I think that we need to be smarter in writing the index metadata. With this change, if I read it correctly, we write all the indices metadata on all data nodes, regardless if a shard for that index is allocated on it or not (same logic as we do on the master nodes). I think that we should only write the index metadata on the data node if there is a shard for that index allocated on it.

I like the fact that once we do, and master nodes data is lost, then it will automatically import it as dangling indices. I think that it would be great to have a test for it though. Start a dedicated master node and dedicated data node, clean the master node state on disk, and verify that the index is imported.

Also, I wonder if we should separate it into 2 PRs, one to write the index metadata on data nodes that hold the relevant shard, and one that holds the logic explained in the title of this PR, which is fail a node to start if it holds na index that is too old.

@bleskes
Copy link
Contributor

bleskes commented Feb 2, 2015

+1 to what @kimchy said about not writing index states to node that don't hold a shard of that index.

If the problematic shard is allocated later due to other reasons the shard
will fail allocation and downgrading the cluster might be impossible since
new segments in other indices have already been written.

I don't think we solve this problem by preventing a node from starting. I'm thinking about a cluster with many indices, a couple of which still have segments from 3.x. Depending on node join order we might start allocating shards and starting them before the node with 3.x joined. Also it might be that one node not joining is not a show stopper (we typically advise people to use the expected_nodes settings) and the cluster will continue without the shards in question, get potentially to a RED state but accept indexing on other indices/ shards.

I don't see how to solve this in a watertight way (might be wrong, of course). I thought of another approach I would like to propose. We change the LocalGatewayAllocator to reach out to all the nodes of the cluster before allocating shards (we currently do it on a shard by shard basis). The nodes will then scan the local data dirs for older segments and respond with that information. If any of the nodes respond with a too old index, we shut down the cluster via a cluster block or something similar, which means it will respond to API calls and can say what happened. We might consider just closing all the indices in the cluster.

It can still be the case that shards have already been allocated when a node with old shards join the cluster. In this case we can't know for sure no indexing was done. We can potentially try to close the index in question. This also tricky as that specific index may already have assigned shards. To avoid this we need to only allocate and index's shards when all the shards can be allocated.

Closing indices also have the upside of:

  1. people can choose to just deleted the problematic indices (think about time based data where it may be easier to discard older data rather then upgrade it).
  2. people can down grade after their cluster causing the indices where were touched by the newer lucene version. This will cause the newer indices to close but will allow upgrading the older indices.

Last idea I wanted to suggest here is potentially just offering an upgrade tool to upgrade problematic shards without a full cluster downgrade. This can be done while the indices are closed.

Sorry for the long comment. This is tricky.

@s1monw
Copy link
Contributor Author

s1monw commented Feb 17, 2015

moving out here - it seems like the progress over perfection approach is not should be applied here. If somebody wants to pick this up feel free

@s1monw s1monw closed this Feb 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write index metadata on data nodes where shards allocated
5 participants