Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bulk Fetch SnapshotInfo API to Repository #73570

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented May 31, 2021

This PR refactors the Repository API for fetching SnapshotInfo to enabled implementations to optimize for bulk fetching multiple SnapshotInfo at once. This is a requirement for making use of a more efficient repository format that does not require loading individual blobs per snapshot to fetch a snapshot listing. Also, by enabling consuming SnapshotInfo as they are fetched on the snapshot meta thread this allows for some more memory efficient usage of snapshot listing.
Also, this commit makes use of the new API to make the snapshot status API run a little more parallel if fetching multiple snapshots (though there's additional improvements possible+useful here as far as fetching shard level metadata in parallel).

@original-brownbear original-brownbear added WIP :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >refactoring labels May 31, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 31, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear original-brownbear marked this pull request as draft May 31, 2021 16:29
@original-brownbear original-brownbear changed the title [WIP] Add Bulk Fetch SnapshotInfo API to Repository Add Bulk Fetch SnapshotInfo API to Repository Jun 2, 2021
@original-brownbear original-brownbear marked this pull request as ready for review June 2, 2021 12:54
@original-brownbear original-brownbear marked this pull request as draft June 2, 2021 13:57
@original-brownbear original-brownbear marked this pull request as ready for review June 7, 2021 08:17
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jun 13, 2021
With work to make repo APIs more async incoming in elastic#73570
we need a non-blocking way to run this check. This adds that async
check and removes the need to manually pass executors around as well.
original-brownbear added a commit that referenced this pull request Jun 13, 2021
With work to make repo APIs more async incoming in #73570
we need a non-blocking way to run this check. This adds that async
check and removes the need to manually pass executors around as well.
(endTime == 0 ? threadPool.absoluteTimeInMillis() : endTime) - startTime
)
);
if (snapshotsInProgress.snapshot(new Snapshot(repositoryName, snapshotId)) == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand this: we found a snapshot with a matching name in the repository data but we did not found it before in the in progress snapshots, how could it be in snapshotsInProgress?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted :) I think this is an impossible race to run into these days. I'll leave it as is here for now and will open a PR to clean this up from master if possible separately today :)

public void getSnapshotInfo(GetSnapshotInfoContext context) {
// put snapshot info downloads into a task queue instead of pushing them all into the queue to not completely monopolize the
// snapshot meta pool for a single request
final int workers = Math.min(threadPool.info(ThreadPool.Names.SNAPSHOT_META).getMax(), context.snapshotIds().size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check context.done() and context.isCancelled() first here too and return directly? This would avoid to queue runnables in the snapshot meta thread pool for nothing (yet it is unlikely to happen)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that wasn't really worth it since we haven't forked yet. The chance of a concurrent cancellation is almost 0 and the cannot have been failed yet concurrently ever I think.

} else {
final List<SnapshotStatus> threadSafeBuilder = Collections.synchronizedList(builder);
repositoriesService.repository(repositoryName)
.getSnapshotInfo(new GetSnapshotInfoContext(snapshotIdsToLoad, true, () -> false, (context, snapshotInfo) -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pass task::isCancelled instead of () -> false ? The task cancellation is verified also in snapshotShards() few lines after but maybe we could be more explicit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ right, this PR was opened before the status action became cancellable, will fix :)

shardStatuses = snapshotShards(repositoryName, repositoryData, task, snapshotInfo);
} catch (Exception e) {
// stops all further fetches of snapshotInfo since context is fail-fast
context.onFailure(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could turn the GetSnapshotInfoContext's biconsumer into a checked consumer and let `GetSnapshotInfoContext catch exceptions and fail the context? (see other comment below)

* {@link BiConsumer} invoked for each {@link SnapshotInfo} that is fetched with this instance and the {@code SnapshotInfo} as
* arguments.
*/
private final BiConsumer<GetSnapshotInfoContext, SnapshotInfo> onSnapshotInfo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be a CheckedConsumer<SnapshotInfo, IOException> onSnapshotInfo and handle failure in case of exception directly in this class rather than the consumer holds a reference to the context in order to be able to fail it.

(Unless this is required by an upcoming change?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Unless this is required by an upcoming change?)

Yes I think I'd like the flexibility for upcoming changes. The fact that we currently fetch the shard level metadata on the thread that resolves the snapshot info listener is not supposed to stick around as a solution so we need this mechanism to fail a forking callback here.

} catch (NoSuchFileException ex) {
failure = new SnapshotMissingException(metadata.name(), snapshotId, ex);
} catch (IOException | NotXContentException ex) {
failure = new SnapshotException(metadata.name(), snapshotId, "failed to get snapshots", ex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
failure = new SnapshotException(metadata.name(), snapshotId, "failed to get snapshots", ex);
failure = new SnapshotException(metadata.name(), snapshotId, "failed to get snapshot info for " + snapshotId, ex);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add the snapshot id to the message IMO, we already have it as a field on the snapshot exception and that will be serialized. But it should be singular here in the message I think :)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much to add over Tanguy's thorough review, just a few comments on habitability.

BiConsumer<GetSnapshotInfoContext, SnapshotInfo> onSnapshotInfo,
ActionListener<Void> listener
) {
assert snapshotIds.isEmpty() == false : "no snapshot ids to fetch given";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should either forbid this in production too (e.g. throw an IllegalArgumentException) or else just allow it and deal with the n=0 case ourselves. I'd rather the latter I think but either way is ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with throwing for now. I think we shouldn't create the case where we pass an empty list to this API, just like we don't allow a 0 sized GroupedActionListener, otherwise we'd have to resolve the listener in the constructor right away I guess, which I find confusing.

@original-brownbear
Copy link
Member Author

Thanks Tanguy + David! I think I addressed all comments now and this should be good for another round :)

@@ -182,8 +185,15 @@ public SnapshotInfo getSnapshotInfo(SnapshotId snapshotId) {
ArrayList<String> indices = new ArrayList<>(indicesMap.size());
indicesMap.keysIt().forEachRemaining(indices::add);

return new SnapshotInfo(snapshotId, indices, new ArrayList<>(metadata.dataStreams().keySet()), Collections.emptyList(),
response.getState().getNodes().getMaxNodeVersion(), SnapshotState.SUCCESS
context.onResponse(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oops I just saw this. Think we need to take steps to put this on the SNAPSHOT_META thread now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timing #73570 (comment) :D on it shortly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 94f44ed now

@original-brownbear
Copy link
Member Author

Ah damn the new assertion breaks the CCR repo which resolves the listener straight away, will fix in ~30 minutes

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; I haven't reviewed the changes in TransportSnapshotsStatusAction terribly deeply, they look roughly ok but I'd have to spend more time on them to be sure and it looks like Tanguy has already done that.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the extra iterations

@original-brownbear
Copy link
Member Author

Thanks Tanguy + David!

@original-brownbear original-brownbear merged commit dbb626a into elastic:master Jun 14, 2021
@original-brownbear original-brownbear deleted the more-efficient-snapshot-info-api branch June 14, 2021 17:17
original-brownbear added a commit that referenced this pull request Jun 17, 2021
Pagination and snapshots for get snapshots API, build on top of the current implementation to enable work that needs this API for testing. A follow-up will leverage the changes to make things more efficient via pagination.

Relates #73570 which does part of the under-the-hood changes required to efficiently implement this API on the repository layer.
original-brownbear added a commit that referenced this pull request Jun 29, 2021
)

Backport of the recently introduced snapshot pagination and scalability improvements listed below.
Merged as a single backport because the `7.x` and master snapshot status API logic had massively diverged between master and 7.x. With the work in the below PRs, the logic in master and 7.x once again has been aligned very closely again.

#72842
#73172
#73199
#73570 
#73952
#74236 
#74451 (this one is only partly applicable as it was mainly a change to master to align `master` and `7.x` branches)
@original-brownbear original-brownbear restored the more-efficient-snapshot-info-api branch April 18, 2023 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >refactoring Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants