Concurrent deletion of indices and master failure can cause indices to be reimported #11665

brwe · 2015-06-15T09:49:15Z

Currently, a data node deletes indices by evaluating the cluster state. If a new cluster state comes in it is compared to the last known cluster state, and if the new state does not contain an index that the node has in its last cluster state, then this index is deleted.

This could cause data to be deleted if the data folder of all master nodes was lost (#8823):

All master nodes of a cluster go down at the same time and their data folders cannot be recovered.
A new master is brought up but it does not have any indices in its cluster state because the data was lost.
Because all other node are data nodes it cannot get the cluster state from them too and therefore sends a cluster state without any indices in it to the data nodes. The data nodes then delete all their data.

On the master branch we prevent this now by checking if the current cluster state comes from a different master than the previous one and if so, we keep the indices and import them as dangling (see #9952, ClusterChangedEvent).

While this prevents the deletion, it also means that we might in other cases not delete indices although we should.

Example:

two masters eligible nodes, m1 is master, one data node (d).
m1, m2 and d are on cluster state version 1 that contains and index
The index is deleted through the API, causing m1 to send cluster state 2 which does not contain the index to m2 and d that should trigger the actual index deletion.
m1 goes down
m2 receives the new cluster state but d does not (network issues etc)
m2 is elected master and sends cluster state 3 to d which again does not contain the index
d will not delete the index because the state comes from a different master than cluster state 1 (the last one it knows of) and will therefore not delete the index and instead import it back into the cluster

Currently there is no way for a data node to decide if an index should actually be deleted or not if the cluster state that triggers the delete comes from a new master. We chose between: (1) deleting all data in case a node receives an empty cluster state or (2) run the risk to keep indices around that should actually be deleted.

We decided for (2) in #9952. Just opening this issue so that this behavior is documented.

clintongormley · 2015-06-15T10:00:43Z

@brwe what about making the delete index request wait for responses from the data nodes? then the request can report success/failure?

bleskes · 2015-06-15T11:39:50Z

@clintongormley the delete index API does wait for data nodes to confirm the deletion. The above scenario will trigger the call to time out (it waits for an ack from the data node that will not come). If people then check the CS, they will see that the index was deleted. However, at a later stage, once the data rejoins the cluster and the new master, the index will be reimported.

clintongormley · 2015-06-15T11:43:19Z

Ok understood. +1

Some of the test for meta data are redundant. Also, since they somewhat test service disruptions (start master with empty data folder) we might move them to DiscoveryWithServiceDisruptionsTests. Also, this commit adds a test for elastic#11665

clintongormley · 2016-01-18T20:28:13Z

@bleskes is this still an issue?

bleskes · 2016-01-19T10:54:08Z

Sadly it is. However, thinking about it again I realized that we can easily detect the “new empty” master danger by comparing cluster uuid - a new master will generate a new one. Agreed with marking as adopt me. Although it sounds scary it’s quite an easy fix and is a good entry point to the cluster state universe. If anyone wants to pick this up, please ping me :)

On 18 Jan 2016, at 21:28, Clinton Gormley [email protected] wrote:

@bleskes is this still an issue?

—
Reply to this email directly or view it on GitHub.

If a node was isolated from the cluster while a delete was happening, the node will ignore the deleted operation when rejoining as we couldn't detect whether the new master genuinely deleted the indices or it is a new fresh "reset" master that was started without the old data folder. We can now be smarter and detect these reset masters and actually delete the indices on the node if its not the case of a reset master. Note that this new protection doesn't hold if the node was shut down. In that case it's indices will still be imported as dangling indices. Closes elastic#11665

brwe mentioned this issue Jun 15, 2015

[TEST] remove redundant tests and move to different suite #11666

Merged

brwe mentioned this issue Jun 16, 2015

Investigate cluster state signaling of index deletes #10978

Closed

clintongormley added >bug help wanted adoptme :Cluster labels Jan 18, 2016

abeyad self-assigned this Feb 21, 2016

abeyad added v5.0.0-alpha1 v2.3.0 labels Feb 26, 2016

abeyad mentioned this issue Feb 26, 2016

Index deletes not applied when cluster UUID has changed #16825

Closed

abeyad closed this as completed in 83d1e09 Mar 1, 2016

This was referenced Mar 1, 2016

DiscoveryWithServiceDisruptionsIT.testIndicesDeleted fails on master #16890

Closed

Fixes the DiscoveryWithServiceDisruptionsIT#testIndicesDeleted test #16917

Closed

clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018

fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021

Create TestFixMe.md

a9fae03

fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021

Create Helloworld.md

1398a04

fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021

Update Helloworld.md

f68abab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent deletion of indices and master failure can cause indices to be reimported #11665

Concurrent deletion of indices and master failure can cause indices to be reimported #11665

brwe commented Jun 15, 2015

clintongormley commented Jun 15, 2015

bleskes commented Jun 15, 2015

clintongormley commented Jun 15, 2015

clintongormley commented Jan 18, 2016

bleskes commented Jan 19, 2016

Concurrent deletion of indices and master failure can cause indices to be reimported #11665

Concurrent deletion of indices and master failure can cause indices to be reimported #11665

Comments

brwe commented Jun 15, 2015

clintongormley commented Jun 15, 2015

bleskes commented Jun 15, 2015

clintongormley commented Jun 15, 2015

clintongormley commented Jan 18, 2016

bleskes commented Jan 19, 2016