Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid loading shard metadata while closing #29140

Conversation

DaveCTurner
Copy link
Contributor

If ShardStateMetaData.FORMAT.loadLatestState is called while a shard is
closing, the shard metadata directory may be deleted after its existence has
been checked but before the Lucene Directory has been created. When the
Directory is created, the just-deleted directory is brought back into
existence.

There are three places where loadLatestState is called in a manner that
leaves it open to this race. This change ensures that these calls occur either
under a ShardLock or else while holding a reference to the existing Store.
In either case, this protects the shard metadata directory from concurrent
deletion.

Cf #19338, #21463, #25335 and https://issues.apache.org/jira/browse/LUCENE-7375

If `ShardStateMetaData.FORMAT.loadLatestState` is called while a shard is
closing, the shard metadata directory may be deleted after its existence has
been checked but before the Lucene `Directory` has been created. When the
`Directory` is created, the just-deleted directory is brought back into
existence.

There are three places where `loadLatestState` is called in a manner that
leaves it open to this race. This change ensures that these calls occur either
under a `ShardLock` or else while holding a reference to the existing `Store`.
In either case, this protects the shard metadata directory from concurrent
deletion.

Cf elastic#19338, elastic#21463, elastic#25335 and https://issues.apache.org/jira/browse/LUCENE-7375
@DaveCTurner DaveCTurner added >test-failure Triaged test failures from CI :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. v7.0.0 v6.3.0 labels Mar 19, 2018
@DaveCTurner DaveCTurner requested review from bleskes and ywelsch March 19, 2018 15:24
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

Note to reviewers: I have assumed a certain amount of consistency between IndicesService, IndexService, IndexShard and so on. I'm not sure how safe this is. Please tread carefully.

I also don't have a good plan for testing this. Pointers appreciated.

@DaveCTurner
Copy link
Contributor Author

@bleskes, any thoughts here?

@bleskes
Copy link
Contributor

bleskes commented Mar 26, 2018

Maybe it's a naive solution, but isn't it enough to just make sure all access in the TransportNodesListGatewayStartedShards is done under the shard lock? i.e., first to get an indexshard and ask to do what's needed to be done, if not try to acquire the shard lock, and do what's needed to be done. We already get the lock when validating the store.

@DaveCTurner
Copy link
Contributor Author

We discussed this on Zoom, and decided that it'd be more appropriate to ask the IndexShard not to close the shard while we're calling ShardStateMetaData.FORMAT.loadLatestState() instead of using the refcount on the Store directly. The reason for this is that we're not actually modifying the Store, it's the metadata folder, so locking on the Store is inappropriate.

NB the IndexService currently uses the Store#onClose event to trigger the deletion of the shard's directory. I was mistakenly interpreting this to mean that the Store owns the directory: in fact it's just the last thing to close.

Within IndexShard we could still use the Store's refcount to protect against concurrent deletion, but it'd be simpler just to use its mutex.

@DaveCTurner
Copy link
Contributor Author

I tried this. I don't particularly like having the call to loadLatestState within IndexShard and would have preferred to pass in a lambda, but the IOException it throws makes that ugly. The other alternative I investigated was exposing the held mutex as an AutoCloseable so it could be used in a try-with-resources thing at the caller, but this isn't obviously possible.

@bleskes
Copy link
Contributor

bleskes commented Mar 28, 2018

I don't particularly like having the call to loadLatestState within IndexShard

Why don't you like it? IndexShard is already the one that writes it. Alternatively we can keep an in memory copy of it, thought I personally don't feel it's needed.

@DaveCTurner DaveCTurner added >bug and removed >test-failure Triaged test failures from CI labels Mar 28, 2018
@DaveCTurner
Copy link
Contributor Author

Really, just that it involved importing things that weren't already there, which hinted that something was wrong. If you're good with it then that's enough. Next up is to try and get a failing test for this.

@bleskes
Copy link
Contributor

bleskes commented Mar 28, 2018

Really, just that it involved importing things that weren't already there, which hinted that something was wrong. If you're good with it then that's enough. Next up is to try and get a failing test for this.

I think I'm missing something - IndexShard already writes the file, I would expect the impact of reading it being minimal?

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Apr 4, 2018

I added a test that fails occasionally on master (i.e. typically in the first run using -Dtests.iters=1000) and which makes it though 1000 runs with the other changes. I think it might be possible to simplify it - I guessed at 4 threads, 100 iterations etc, and can put some effort into taking that down if you'd like.

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready for another look @ywelsch and @bleskes.

throw new AlreadyClosedException(shardId + " can't load shard state metadata - shard is closed");
}

return ShardStateMetaData.FORMAT.loadLatestState(logger, namedXContentRegistry, dataLocations);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful, thanks. This makes things much simpler. I pushed 3eff6c9.

public ShardStateMetaData loadShardStateMetaDataIfOpen(NamedXContentRegistry namedXContentRegistry, Path[] dataLocations)
throws IOException {
synchronized (mutex) {
if (state == IndexShardState.CLOSED) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not needed if making our own ShardStateMetaData so I will remove it.

@@ -2059,6 +2061,17 @@ public void startRecovery(RecoveryState recoveryState, PeerRecoveryTargetService
}
}

public ShardStateMetaData loadShardStateMetaDataIfOpen(NamedXContentRegistry namedXContentRegistry, Path[] dataLocations)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per comment below this is not needed since we can make our own ShardStateMetaData.

@@ -2059,6 +2061,17 @@ public void startRecovery(RecoveryState recoveryState, PeerRecoveryTargetService
}
}

public ShardStateMetaData loadShardStateMetaDataIfOpen(NamedXContentRegistry namedXContentRegistry, Path[] dataLocations)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, I think, because otherwise it was possible we'd get hold of an IndexShard while it was closing and then fail to load the metadata since it'd already been deleted. However, as per comment below we don't need to touch the disk here.

@@ -139,7 +140,10 @@ private StoreFilesMetaData listStoreMetaData(ShardId shardId) throws IOException
return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY);
}
final IndexSettings indexSettings = indexService != null ? indexService.getIndexSettings() : new IndexSettings(metaData, settings);
final ShardPath shardPath = ShardPath.loadShardPath(logger, nodeEnv, shardId, indexSettings);
final ShardPath shardPath;
try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at how we could be in a situation in which the shard lock is unavailable for a long time. This'd be the case if the shard was open, but that means there's an IndexShard so we don't get here. More precisely, there are some circumstances in which we could get here and then fail to get the shard lock because the shard is now open, but retrying is the thing to do here.

All the other usages of the shard lock seem short-lived. They protect some IO (e.g. deleting the shards, etc) so may take some time, but not infinitely long.

Also, we obtain the same shard lock a few lines down, in Store.readMetadataSnapshot, unless ShardPath.loadShardPath returns null.

Could you clarify, @ywelsch?

ShardStateMetaData shardStateMetaData = ShardStateMetaData.FORMAT.loadLatestState(logger, NamedXContentRegistry.EMPTY,
nodeEnv.availableShardPaths(request.shardId));

ShardStateMetaData shardStateMetaData = safelyLoadLatestState(shardId);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I moved this code around in 7f835cc. I'm not 100% comfortable with the changes made since I'm unfamiliar with all the invariants that may or may not hold here - please tread carefully.

@@ -138,7 +159,9 @@ protected NodeGatewayStartedShards nodeOperation(NodeRequest request) {
ShardPath shardPath = null;
try {
IndexSettings indexSettings = new IndexSettings(metaData, settings);
shardPath = ShardPath.loadShardPath(logger, nodeEnv, shardId, indexSettings);
try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We obtain the same shard lock a few lines down, in Store.tryOpenIndex(...), unless ShardPath.loadShardPath returns null in which case we throw a different exception.

listingThread.start();
}

// Deleting an index asserts that it really is gone from disk, so no other assertions are necessary here.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I pushed 48f6d46

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I would like to wait for @ywelsch blessings as well.

if (indexShard != null) {
final ShardStateMetaData shardStateMetaData = indexShard.getShardStateMetaData();
final String allocationId = shardStateMetaData.allocationId != null ?
shardStateMetaData.allocationId.getId() : null;
logger.debug("{} shard state info found: [{}]", shardId, shardStateMetaData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be chatty. Can we move back to trace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I pushed 7e58bc6

final IndexShard indexShard = indicesService.getShardOrNull(shardId);
if (indexShard != null) {
final ShardStateMetaData shardStateMetaData = indexShard.getShardStateMetaData();
final String allocationId = shardStateMetaData.allocationId != null ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocationIds have been around since I don't know how long. When can this be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its declaration says this:

@Nullable
public final AllocationId allocationId; // can be null if we read from legacy format (see fromXContent and MultiDataPathUpgrader)

There are lots of other null checks too. Maybe worth addressing separately?

@bleskes
Copy link
Contributor

bleskes commented May 24, 2018 via email

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few more asks and comments.

@@ -2065,6 +2065,12 @@ public void startRecovery(RecoveryState recoveryState, PeerRecoveryTargetService
}
}

public ShardStateMetaData getShardStateMetaData() {
synchronized (mutex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can avoid the mutex here. just do a one-time volatile read of shardrouting (which is an immutable object). indexSettings.getUUID() are a final object and the uuid is immutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I pushed 1d4e044

@@ -139,7 +140,10 @@ private StoreFilesMetaData listStoreMetaData(ShardId shardId) throws IOException
return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY);
}
final IndexSettings indexSettings = indexService != null ? indexService.getIndexSettings() : new IndexSettings(metaData, settings);
final ShardPath shardPath = ShardPath.loadShardPath(logger, nodeEnv, shardId, indexSettings);
final ShardPath shardPath;
try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TransportNodesListGatewayStartedShards and in Store.readMetadataSnapshot, which we call below, we catch the ShardLockObtainFailedException and treat it either as an empty store (in case of TransportNodesListShardStoreMetaData) or as an ok target for primary allocation (see TransportNodesListGatewayStartedShards and PrimaryShardAllocator.buildNodeShardsResult), but we've made sure not to end up in a situation where the master goes into a potentially long retry loop (which causes a reroute storm on the master). I don't want to open this box of Pandora here, so my suggestion is to add

} catch (ShardLockObtainFailedException ex) {
    logger.info(() -> new ParameterizedMessage("{}: failed to obtain shard lock", shardId), ex);
    return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY);
}

here so as not to mess with existing behavior.

if (shardPath == null) {
throw new IllegalStateException(shardId + " no shard path found");
}
Store.tryOpenIndex(shardPath.resolveIndex(), shardId, nodeEnv::shardLock, logger);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of acquiring the shard lock for a second time, I would prefer if we would do it once, and move this call under that lock and just rename tryOpenIndex to tryOpenIndexUnderLock, removing the locking mechanism from it.

Same thing for TransportNodesListShardStoreMetaData. You can then also remove the ShardLocker interface, which irked me for a while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I pushed 61b4e4e and 8f1a5e2. Could you take another look, @ywelsch?

}

final ShardStateMetaData shardStateMetaData;
try (ShardLock ignored = nodeEnv.shardLock(shardId, TimeUnit.SECONDS.toMillis(5))) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just spotted this - there are still two calls to nodeEnv.shardLock here. TBH I don't know what we should be doing on failure of this one.

@DaveCTurner
Copy link
Contributor Author

Thanks to @ywelsch for further guidance about failure cases. Thinking further about ShardLockObtainFailedException issues, I pushed 78c0526 but I see that this now makes the following code a bit pointless: failing to obtain a shard lock means the allocation ID will be null here.

final String finalAllocationId = allocationId;
if (nodeShardState.storeException() instanceof ShardLockObtainFailedException) {
logger.trace(() -> new ParameterizedMessage("[{}] on node [{}] has allocation id [{}] but the store can not be opened as it's locked, treating as valid shard", shard, nodeShardState.getNode(), finalAllocationId), nodeShardState.storeException());
} else {
logger.trace(() -> new ParameterizedMessage("[{}] on node [{}] has allocation id [{}] but the store can not be opened, treating as no allocation id", shard, nodeShardState.getNode(), finalAllocationId), nodeShardState.storeException());
allocationId = null;
}

@DaveCTurner
Copy link
Contributor Author

This PR represents an actual issue, and all the other issues that point to it were closed in its favour, but the consequences of
#29140 (comment) make this whole idea start to unravel.

I would like to explore the idea of loading the metadata of every on-disk index much earlier in the lifecycle of a node, avoiding these concurrency issues (of course introducing different ones in their place, but perhaps the new ones will be less tricky).

@ywelsch
Copy link
Contributor

ywelsch commented Mar 11, 2019

I think it makes sense to explore alternative ways of coordinating the loading of shard state metadata. We have fixed the current test failures by weakening the assertions on the existence of a shard folder after clean-up. As there is no immediate plan to work on this, I'm closing this one out.

@ywelsch ywelsch closed this Mar 11, 2019
@DaveCTurner DaveCTurner deleted the 2018-03-19-load-latest-shard-state-under-lock branch July 23, 2022 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. >test Issues or PRs that are addressing/adding tests v6.4.1 v7.0.0-rc1 WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants