Avoid loading shard metadata while closing #29140

DaveCTurner · 2018-03-19T15:24:25Z

If ShardStateMetaData.FORMAT.loadLatestState is called while a shard is
closing, the shard metadata directory may be deleted after its existence has
been checked but before the Lucene Directory has been created. When the
Directory is created, the just-deleted directory is brought back into
existence.

There are three places where loadLatestState is called in a manner that
leaves it open to this race. This change ensures that these calls occur either
under a ShardLock or else while holding a reference to the existing Store.
In either case, this protects the shard metadata directory from concurrent
deletion.

Cf #19338, #21463, #25335 and https://issues.apache.org/jira/browse/LUCENE-7375

If `ShardStateMetaData.FORMAT.loadLatestState` is called while a shard is closing, the shard metadata directory may be deleted after its existence has been checked but before the Lucene `Directory` has been created. When the `Directory` is created, the just-deleted directory is brought back into existence. There are three places where `loadLatestState` is called in a manner that leaves it open to this race. This change ensures that these calls occur either under a `ShardLock` or else while holding a reference to the existing `Store`. In either case, this protects the shard metadata directory from concurrent deletion. Cf elastic#19338, elastic#21463, elastic#25335 and https://issues.apache.org/jira/browse/LUCENE-7375

elasticmachine · 2018-03-19T15:24:28Z

Pinging @elastic/es-distributed

DaveCTurner · 2018-03-19T15:27:12Z

Note to reviewers: I have assumed a certain amount of consistency between IndicesService, IndexService, IndexShard and so on. I'm not sure how safe this is. Please tread carefully.

I also don't have a good plan for testing this. Pointers appreciated.

DaveCTurner · 2018-03-26T10:43:31Z

@bleskes, any thoughts here?

bleskes · 2018-03-26T12:31:51Z

Maybe it's a naive solution, but isn't it enough to just make sure all access in the TransportNodesListGatewayStartedShards is done under the shard lock? i.e., first to get an indexshard and ask to do what's needed to be done, if not try to acquire the shard lock, and do what's needed to be done. We already get the lock when validating the store.

DaveCTurner · 2018-03-26T16:16:14Z

We discussed this on Zoom, and decided that it'd be more appropriate to ask the IndexShard not to close the shard while we're calling ShardStateMetaData.FORMAT.loadLatestState() instead of using the refcount on the Store directly. The reason for this is that we're not actually modifying the Store, it's the metadata folder, so locking on the Store is inappropriate.

NB the IndexService currently uses the Store#onClose event to trigger the deletion of the shard's directory. I was mistakenly interpreting this to mean that the Store owns the directory: in fact it's just the last thing to close.

Within IndexShard we could still use the Store's refcount to protect against concurrent deletion, but it'd be simpler just to use its mutex.

DaveCTurner · 2018-03-28T08:25:37Z

I tried this. I don't particularly like having the call to loadLatestState within IndexShard and would have preferred to pass in a lambda, but the IOException it throws makes that ugly. The other alternative I investigated was exposing the held mutex as an AutoCloseable so it could be used in a try-with-resources thing at the caller, but this isn't obviously possible.

bleskes · 2018-03-28T08:27:28Z

I don't particularly like having the call to loadLatestState within IndexShard

Why don't you like it? IndexShard is already the one that writes it. Alternatively we can keep an in memory copy of it, thought I personally don't feel it's needed.

DaveCTurner · 2018-03-28T08:31:53Z

Really, just that it involved importing things that weren't already there, which hinted that something was wrong. If you're good with it then that's enough. Next up is to try and get a failing test for this.

bleskes · 2018-03-28T09:04:18Z

Really, just that it involved importing things that weren't already there, which hinted that something was wrong. If you're good with it then that's enough. Next up is to try and get a failing test for this.

I think I'm missing something - IndexShard already writes the file, I would expect the impact of reading it being minimal?

…' into 2018-03-19-load-latest-shard-state-under-lock

DaveCTurner · 2018-04-04T17:27:18Z

I added a test that fails occasionally on master (i.e. typically in the first run using -Dtests.iters=1000) and which makes it though 1000 runs with the other changes. I think it might be possible to simplify it - I guessed at 4 threads, 100 iterations etc, and can put some effort into taking that down if you'd like.

DaveCTurner

This is ready for another look @ywelsch and @bleskes.

DaveCTurner · 2018-05-18T11:19:19Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+                throw new AlreadyClosedException(shardId + " can't load shard state metadata - shard is closed");
+            }
+
+            return ShardStateMetaData.FORMAT.loadLatestState(logger, namedXContentRegistry, dataLocations);


Very useful, thanks. This makes things much simpler. I pushed 3eff6c9.

DaveCTurner · 2018-05-18T11:19:40Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+    public ShardStateMetaData loadShardStateMetaDataIfOpen(NamedXContentRegistry namedXContentRegistry, Path[] dataLocations)
+        throws IOException {
+        synchronized (mutex) {
+            if (state == IndexShardState.CLOSED) {


This check is not needed if making our own ShardStateMetaData so I will remove it.

DaveCTurner · 2018-05-18T11:20:15Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -2059,6 +2061,17 @@ public void startRecovery(RecoveryState recoveryState, PeerRecoveryTargetService
        }
    }

+    public ShardStateMetaData loadShardStateMetaDataIfOpen(NamedXContentRegistry namedXContentRegistry, Path[] dataLocations)


As per comment below this is not needed since we can make our own ShardStateMetaData.

DaveCTurner · 2018-05-18T11:22:02Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -2059,6 +2061,17 @@ public void startRecovery(RecoveryState recoveryState, PeerRecoveryTargetService
        }
    }

+    public ShardStateMetaData loadShardStateMetaDataIfOpen(NamedXContentRegistry namedXContentRegistry, Path[] dataLocations)


It was, I think, because otherwise it was possible we'd get hold of an IndexShard while it was closing and then fail to load the metadata since it'd already been deleted. However, as per comment below we don't need to touch the disk here.

DaveCTurner · 2018-05-18T11:51:12Z

server/src/main/java/org/elasticsearch/indices/store/TransportNodesListShardStoreMetaData.java

@@ -139,7 +140,10 @@ private StoreFilesMetaData listStoreMetaData(ShardId shardId) throws IOException
                return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY);
            }
            final IndexSettings indexSettings = indexService != null ? indexService.getIndexSettings() : new IndexSettings(metaData, settings);
-            final ShardPath shardPath = ShardPath.loadShardPath(logger, nodeEnv, shardId, indexSettings);
+            final ShardPath shardPath;
+            try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) {


I looked at how we could be in a situation in which the shard lock is unavailable for a long time. This'd be the case if the shard was open, but that means there's an IndexShard so we don't get here. More precisely, there are some circumstances in which we could get here and then fail to get the shard lock because the shard is now open, but retrying is the thing to do here.

All the other usages of the shard lock seem short-lived. They protect some IO (e.g. deleting the shards, etc) so may take some time, but not infinitely long.

Also, we obtain the same shard lock a few lines down, in Store.readMetadataSnapshot, unless ShardPath.loadShardPath returns null.

Could you clarify, @ywelsch?

DaveCTurner · 2018-05-18T12:06:30Z

server/src/main/java/org/elasticsearch/gateway/TransportNodesListGatewayStartedShards.java

-            ShardStateMetaData shardStateMetaData = ShardStateMetaData.FORMAT.loadLatestState(logger, NamedXContentRegistry.EMPTY,
-                nodeEnv.availableShardPaths(request.shardId));
+
+            ShardStateMetaData shardStateMetaData = safelyLoadLatestState(shardId);


Ok, I moved this code around in 7f835cc. I'm not 100% comfortable with the changes made since I'm unfamiliar with all the invariants that may or may not hold here - please tread carefully.

DaveCTurner · 2018-05-18T12:09:49Z

server/src/main/java/org/elasticsearch/gateway/TransportNodesListGatewayStartedShards.java

@@ -138,7 +159,9 @@ protected NodeGatewayStartedShards nodeOperation(NodeRequest request) {
                    ShardPath shardPath = null;
                    try {
                        IndexSettings indexSettings = new IndexSettings(metaData, settings);
-                        shardPath = ShardPath.loadShardPath(logger, nodeEnv, shardId, indexSettings);
+                        try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) {


We obtain the same shard lock a few lines down, in Store.tryOpenIndex(...), unless ShardPath.loadShardPath returns null in which case we throw a different exception.

DaveCTurner · 2018-05-18T12:28:29Z

server/src/test/java/org/elasticsearch/gateway/RecoveryFromGatewayIT.java

+            listingThread.start();
+        }
+
+        // Deleting an index asserts that it really is gone from disk, so no other assertions are necessary here.


Good point, I pushed 48f6d46

bleskes

LGTM. I would like to wait for @ywelsch blessings as well.

bleskes · 2018-05-24T08:26:59Z

server/src/main/java/org/elasticsearch/gateway/TransportNodesListGatewayStartedShards.java

+            if (indexShard != null) {
+                final ShardStateMetaData shardStateMetaData = indexShard.getShardStateMetaData();
+                final String allocationId = shardStateMetaData.allocationId != null ?
+                    shardStateMetaData.allocationId.getId() : null;
                logger.debug("{} shard state info found: [{}]", shardId, shardStateMetaData);


this can be chatty. Can we move back to trace?

Ok I pushed 7e58bc6

bleskes · 2018-05-24T08:31:45Z

server/src/main/java/org/elasticsearch/gateway/TransportNodesListGatewayStartedShards.java

+            final IndexShard indexShard = indicesService.getShardOrNull(shardId);
+            if (indexShard != null) {
+                final ShardStateMetaData shardStateMetaData = indexShard.getShardStateMetaData();
+                final String allocationId = shardStateMetaData.allocationId != null ?


allocationIds have been around since I don't know how long. When can this be null?

Its declaration says this:

elasticsearch/server/src/main/java/org/elasticsearch/index/shard/ShardStateMetaData.java

Lines 44 to 45 in 6538542

@Nullable

public final AllocationId allocationId; // can be null if we read from legacy format (see fromXContent and MultiDataPathUpgrader)

There are lots of other null checks too. Maybe worth addressing separately?

bleskes · 2018-05-24T11:33:30Z

I'm good with doing this at a different PR.

…

On Thu, May 24, 2018 at 1:16 PM David Turner ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In server/src/main/java/org/elasticsearch/gateway/TransportNodesListGatewayStartedShards.java <#29140 (comment)> : > + final IndexShard indexShard = indicesService.getShardOrNull(shardId); + if (indexShard != null) { + final ShardStateMetaData shardStateMetaData = indexShard.getShardStateMetaData(); + final String allocationId = shardStateMetaData.allocationId != null ? Its declaration says this: https://github.com/elastic/elasticsearch/blob/65385426033fe105df8aee61d97d7d92b4ab0ecf/server/src/main/java/org/elasticsearch/index/shard/ShardStateMetaData.java#L44-L45 There are lots of other null checks too. Maybe worth addressing separately? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29140 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA9bJ0C1Ez-HYuP4phC-IwM5qswB0uUPks5t1pZzgaJpZM4SwXd3> .

ywelsch

I've left a few more asks and comments.

ywelsch · 2018-05-29T08:22:57Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

@@ -2065,6 +2065,12 @@ public void startRecovery(RecoveryState recoveryState, PeerRecoveryTargetService
        }
    }

+    public ShardStateMetaData getShardStateMetaData() {
+        synchronized (mutex) {


we can avoid the mutex here. just do a one-time volatile read of shardrouting (which is an immutable object). indexSettings.getUUID() are a final object and the uuid is immutable.

Good point, I pushed 1d4e044

ywelsch · 2018-05-29T08:39:06Z

server/src/main/java/org/elasticsearch/indices/store/TransportNodesListShardStoreMetaData.java

@@ -139,7 +140,10 @@ private StoreFilesMetaData listStoreMetaData(ShardId shardId) throws IOException
                return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY);
            }
            final IndexSettings indexSettings = indexService != null ? indexService.getIndexSettings() : new IndexSettings(metaData, settings);
-            final ShardPath shardPath = ShardPath.loadShardPath(logger, nodeEnv, shardId, indexSettings);
+            final ShardPath shardPath;
+            try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) {


In TransportNodesListGatewayStartedShards and in Store.readMetadataSnapshot, which we call below, we catch the ShardLockObtainFailedException and treat it either as an empty store (in case of TransportNodesListShardStoreMetaData) or as an ok target for primary allocation (see TransportNodesListGatewayStartedShards and PrimaryShardAllocator.buildNodeShardsResult), but we've made sure not to end up in a situation where the master goes into a potentially long retry loop (which causes a reroute storm on the master). I don't want to open this box of Pandora here, so my suggestion is to add

} catch (ShardLockObtainFailedException ex) { logger.info(() -> new ParameterizedMessage("{}: failed to obtain shard lock", shardId), ex); return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY); }

here so as not to mess with existing behavior.

ywelsch · 2018-05-29T08:50:53Z

server/src/main/java/org/elasticsearch/gateway/TransportNodesListGatewayStartedShards.java

+                if (shardPath == null) {
+                    throw new IllegalStateException(shardId + " no shard path found");
+                }
+                Store.tryOpenIndex(shardPath.resolveIndex(), shardId, nodeEnv::shardLock, logger);


Instead of acquiring the shard lock for a second time, I would prefer if we would do it once, and move this call under that lock and just rename tryOpenIndex to tryOpenIndexUnderLock, removing the locking mechanism from it.

Same thing for TransportNodesListShardStoreMetaData. You can then also remove the ShardLocker interface, which irked me for a while.

Ok, I pushed 61b4e4e and 8f1a5e2. Could you take another look, @ywelsch?

DaveCTurner · 2018-05-30T15:34:25Z

server/src/main/java/org/elasticsearch/gateway/TransportNodesListGatewayStartedShards.java

+            }
+
+            final ShardStateMetaData shardStateMetaData;
+            try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) {


Hmm, I just spotted this - there are still two calls to nodeEnv.shardLock here. TBH I don't know what we should be doing on failure of this one.

…urn a null result if it fails

DaveCTurner · 2018-05-31T12:59:53Z

Thanks to @ywelsch for further guidance about failure cases. Thinking further about ShardLockObtainFailedException issues, I pushed 78c0526 but I see that this now makes the following code a bit pointless: failing to obtain a shard lock means the allocation ID will be null here.

elasticsearch/server/src/main/java/org/elasticsearch/gateway/PrimaryShardAllocator.java

Lines 259 to 265 in 0ff2c60

    
           final String finalAllocationId = allocationId; 
        
           if (nodeShardState.storeException() instanceof ShardLockObtainFailedException) { 
        
               logger.trace(() -> new ParameterizedMessage("[{}] on node [{}] has allocation id [{}] but the store can not be opened as it's locked, treating as valid shard", shard, nodeShardState.getNode(), finalAllocationId), nodeShardState.storeException()); 
        
           } else { 
        
               logger.trace(() -> new ParameterizedMessage("[{}] on node [{}] has allocation id [{}] but the store can not be opened, treating as no allocation id", shard, nodeShardState.getNode(), finalAllocationId), nodeShardState.storeException()); 
        
               allocationId = null; 
        
           }

DaveCTurner · 2018-11-15T20:29:36Z

This PR represents an actual issue, and all the other issues that point to it were closed in its favour, but the consequences of
#29140 (comment) make this whole idea start to unravel.

I would like to explore the idea of loading the metadata of every on-disk index much earlier in the lifecycle of a node, avoiding these concurrency issues (of course introducing different ones in their place, but perhaps the new ones will be less tricky).

ywelsch · 2019-03-11T12:12:10Z

I think it makes sense to explore alternative ways of coordinating the loading of shard state metadata. We have fixed the current test failures by weakening the assertions on the existence of a shard folder after clean-up. As there is no immediate plan to work on this, I'm closing this one out.

DaveCTurner added >test-failure Triaged test failures from CI :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. v7.0.0 v6.3.0 labels Mar 19, 2018

DaveCTurner requested review from bleskes and ywelsch March 19, 2018 15:24

bleskes mentioned this pull request Mar 26, 2018

Tests: RareClusterStateIT. testUnassignedShardAndEmptyNodesInRoutingTable failed #21463

Closed

DaveCTurner added 2 commits March 27, 2018 13:10

Merge branch 'master' into 2018-03-19-load-latest-shard-state-under-lock

fb684fb

Use IndexShard's mutex to protect the call to loadLatestState

487e785

DaveCTurner added >bug and removed >test-failure Triaged test failures from CI labels Mar 28, 2018

DaveCTurner added 3 commits April 4, 2018 16:46

Add test

fb5f8c9

Merge

f8a4c0d

Merge branch '2018-03-19-load-latest-shard-state-under-lock-TEST-ONLY…

358a94c

…' into 2018-03-19-load-latest-shard-state-under-lock

DaveCTurner and others added 5 commits April 5, 2018 08:07

Much simpler test

fed85bf

Tidy imports

0ce8021

Try repeats

b654d9d

Try without a loop

e9ef547

Multiple threads, one request each, synchronised starts

7641ac2

DaveCTurner commented May 18, 2018

View reviewed changes

bleskes approved these changes May 24, 2018

View reviewed changes

DaveCTurner added 2 commits May 24, 2018 12:13

debug -> trace

7e58bc6

Merge branch 'master' into 2018-03-19-load-latest-shard-state-under-lock

91c101a

ywelsch reviewed May 29, 2018

View reviewed changes

DaveCTurner added 4 commits May 30, 2018 14:25

Merge branch 'master' into 2018-03-19-load-latest-shard-state-under-lock

903ef15

No need for mutex, just read shardRouting once

1d4e044

Only obtain lock once in TransportNodesListShardStoreMetaData

61b4e4e

Only obtain lock once in TransportNodesListGatewayStartedShards

8f1a5e2

DaveCTurner commented May 30, 2018

View reviewed changes

DaveCTurner added 2 commits May 31, 2018 11:09

Merge branch 'master' into 2018-03-19-load-latest-shard-state-under-lock

ac8902b

Only get lock once in TransportNodesListGatewayStartedShards, and ret…

78c0526

…urn a null result if it fails

DaveCTurner mentioned this pull request Jun 7, 2018

[CI] IndicesLifecycleListenerIT#testIndexStateShardChanged failure on 6.2 #31188

Closed

ywelsch mentioned this pull request Aug 14, 2018

[CI] IndexRecoveryIT.testRerouteRecovery : Paths exist that should have been deleted #32686

Closed

lcawl added v6.4.1 and removed v6.4.0 labels Aug 23, 2018

DaveCTurner mentioned this pull request Aug 26, 2018

Strengthen FilterRoutingTests #33149

Merged

DaveCTurner added the >test Issues or PRs that are addressing/adding tests label Aug 26, 2018

DaveCTurner added the WIP label Sep 18, 2018

michaelbaamonde added the team-discuss label Feb 28, 2019

ywelsch removed the team-discuss label Mar 11, 2019

ywelsch closed this Mar 11, 2019

DaveCTurner mentioned this pull request Mar 17, 2019

New master repeatedly reroute and fetch shard store of recovering replica #40107

Closed

michaelbaamonde added v7.0.0-rc1 and removed v7.0.0 labels Mar 25, 2019

DaveCTurner deleted the 2018-03-19-load-latest-shard-state-under-lock branch July 23, 2022 10:44

	@Nullable
	public final AllocationId allocationId; // can be null if we read from legacy format (see fromXContent and MultiDataPathUpgrader)

Avoid loading shard metadata while closing #29140

Avoid loading shard metadata while closing #29140

Conversation

DaveCTurner commented Mar 19, 2018

elasticmachine commented Mar 19, 2018

DaveCTurner commented Mar 19, 2018

DaveCTurner commented Mar 26, 2018

bleskes commented Mar 26, 2018 • edited Loading

DaveCTurner commented Mar 26, 2018

DaveCTurner commented Mar 28, 2018

bleskes commented Mar 28, 2018

DaveCTurner commented Mar 28, 2018

bleskes commented Mar 28, 2018

DaveCTurner commented Apr 4, 2018 • edited Loading

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented May 24, 2018 via email

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner commented May 31, 2018

DaveCTurner commented Nov 15, 2018

ywelsch commented Mar 11, 2019

bleskes commented Mar 26, 2018 •

edited

Loading

DaveCTurner commented Apr 4, 2018 •

edited

Loading