-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement prototype remote store directory/index input for search #7417
Conversation
Gradle Check (Jenkins) Run Completed with:
|
return NoLockFactory.INSTANCE.obtainLock(null, null); | ||
} | ||
|
||
static class NoopIndexOutput extends IndexOutput { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pull the common functionalities including this class out to a new class RemoteDirectory
which can be extended by RemoteSearchDirectory
and RemoteSnapshotDirectory
.
public abstract class RemoteDirectory extends Directory {
...
<All empty and Noop based methods>
...
...
static class NoopIndexOutput extends IndexOutput {
}
}
public final class RemoteSnapshotDirectory extends RemoteDirectory {}
public final class RemoteSearchDirectory extends RemoteDirectory {}
That should prevent a bunch of NoOp duplication across classes, and can be overridden by specific implementations in future phases, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemoteDirectory is another class present ref, but I see your point we can take out the common part to another class.
protected IndexInput fetchBlock(int blockId) throws IOException { | ||
final String blockFileName = uploadedSegmentMetadata.getUploadedFilename() + "." + blockId; | ||
|
||
final long blockStart = getBlockStart(blockId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are missing some block calculation here -
final long blockStart = getBlockStart(blockId);
final long blockEnd = blockStart + getActualBlockSize(blockId);
final int part = (int) (blockStart / partSize);
final long partStart = part * partSize;
final long position = blockStart - partStart;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no information or metadata stored in the remote store right now, either we can have placeholder for it or this can be a TODO for when the metadata for part size is added to the remote store.
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Let's close this for now. We'll use this as a part of the design @neetikasinghal is currently working on, but this won't be shipped as-is. |
Description
Searchable snapshots implemented a RemoteSnapshotDirectory that provides access to files that are physically represented as a snapshot in a repository (the specific repository implementation is provided via a storage plugin). This task is to create a similar remote search-focused Directory implementation for searching remote-backed indexes stored in a repository. The bulk of the logic for the Lucene abstraction that implements the on-demand fetching and file caching is implemented in the class OnDemandBlockIndexInput and will be reused here.
The prototype uses the remote-backed storage metadata to identify segment file locations as opposed to snapshot metadata
High-level Design
In the case of Searchable Snapshots, FileInfo class (part of BlobStoreIndexShardSnapshot class) contains the metadata for each segment/data file like file name, part size etc. The FileInfo object is used to make an object of BlobFetchRequest that is helpful in fetching/downloading the blocks of data from the remote store during the read path.
In case of Remote Search, we can leverage the UploadedSegmentMetadata class and use it to form the object of BlobFetchRequest similar to Searchable Snapshots.
In current flow of remote store upload, there is a listener registered for each refresh, that takes care of uploading/updating the metadata of each shard to the remote store. Current metadata is stored under a RemoteDirectory:
<IndexUUID>/<Shard ID>/segments/metadata/
metadata__<Primary Term>__<Commit Generation>__<UUID>
<OriginalSegmentFilename>::<UploadedSegmentFilename>::<Checksum>
Create a new Directory - RemoteSearchDirectory and new IndexInput inheritor of OnDemandBlockIndexInput - OnDemandBlockSearchIndexInput
Initialization of the directory
The initialization of the Search based directory can be done at RemoteSearchDirectoryFactory , similar to the initialization of RemoteDirectory.
Upload/Update/Delete of metadata to the directory
Upload of metadata happens after every commit, update can happen after each refresh, at remote store.
The stale commit commit metadata files are deleted.
Content of metadata file
This contains UploadedSegmentMetadata of each of the segment file uploaded
Read/Loading the metadata file in the directory
OnDemandBlockSearchIndexInput class will be created to read the files in the directory newly created.
End to end flow for testing
create an index → remote backup → close the index → apply the remote search setting → open the index back → perform search on the index
Note: This testing setup will work only for the immutable indexes.
Related Issues
Remote Search RFC (#6528)
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.