Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement prototype remote store directory/index input for search #7417

Closed
wants to merge 3 commits into from

Conversation

neetikasinghal
Copy link
Contributor

@neetikasinghal neetikasinghal commented May 4, 2023

Description

Searchable snapshots implemented a RemoteSnapshotDirectory that provides access to files that are physically represented as a snapshot in a repository (the specific repository implementation is provided via a storage plugin). This task is to create a similar remote search-focused Directory implementation for searching remote-backed indexes stored in a repository. The bulk of the logic for the Lucene abstraction that implements the on-demand fetching and file caching is implemented in the class OnDemandBlockIndexInput and will be reused here.

The prototype uses the remote-backed storage metadata to identify segment file locations as opposed to snapshot metadata

High-level Design

In the case of Searchable Snapshots, FileInfo class (part of BlobStoreIndexShardSnapshot class) contains the metadata for each segment/data file like file name, part size etc. The FileInfo object is used to make an object of BlobFetchRequest that is helpful in fetching/downloading the blocks of data from the remote store during the read path.
In case of Remote Search, we can leverage the UploadedSegmentMetadata class and use it to form the object of BlobFetchRequest similar to Searchable Snapshots.

In current flow of remote store upload, there is a listener registered for each refresh, that takes care of uploading/updating the metadata of each shard to the remote store. Current metadata is stored under a RemoteDirectory:

  • Location - <IndexUUID>/<Shard ID>/segments/metadata/
  • File Name - metadata__<Primary Term>__<Commit Generation>__<UUID>
  • Content in the file for every segment file uploaded at each commit: <OriginalSegmentFilename>::<UploadedSegmentFilename>::<Checksum>
  • Length of the file name

Create a new Directory - RemoteSearchDirectory and new IndexInput inheritor of OnDemandBlockIndexInput - OnDemandBlockSearchIndexInput

Initialization of the directory
The initialization of the Search based directory can be done at RemoteSearchDirectoryFactory , similar to the initialization of RemoteDirectory.

Upload/Update/Delete of metadata to the directory
Upload of metadata happens after every commit, update can happen after each refresh, at remote store.
The stale commit commit metadata files are deleted.

Content of metadata file
This contains UploadedSegmentMetadata of each of the segment file uploaded

Read/Loading the metadata file in the directory
OnDemandBlockSearchIndexInput class will be created to read the files in the directory newly created.

End to end flow for testing
create an index → remote backup → close the index → apply the remote search setting → open the index back → perform search on the index
Note: This testing setup will work only for the immutable indexes.

Related Issues

Remote Search RFC (#6528)

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented May 4, 2023

Gradle Check (Jenkins) Run Completed with:

return NoLockFactory.INSTANCE.obtainLock(null, null);
}

static class NoopIndexOutput extends IndexOutput {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pull the common functionalities including this class out to a new class RemoteDirectory which can be extended by RemoteSearchDirectory and RemoteSnapshotDirectory.

public abstract class RemoteDirectory extends Directory {
    ...
    <All empty and Noop based methods>
    ...
    ...
    static class NoopIndexOutput extends IndexOutput {
   }
}
public final class RemoteSnapshotDirectory extends RemoteDirectory {}
public final class RemoteSearchDirectory extends RemoteDirectory {}

That should prevent a bunch of NoOp duplication across classes, and can be overridden by specific implementations in future phases, if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoteDirectory is another class present ref, but I see your point we can take out the common part to another class.

protected IndexInput fetchBlock(int blockId) throws IOException {
final String blockFileName = uploadedSegmentMetadata.getUploadedFilename() + "." + blockId;

final long blockStart = getBlockStart(blockId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing some block calculation here -

        final long blockStart = getBlockStart(blockId);
        final long blockEnd = blockStart + getActualBlockSize(blockId);
        final int part = (int) (blockStart / partSize);
        final long partStart = part * partSize;

        final long position = blockStart - partStart;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no information or metadata stored in the remote store right now, either we can have placeholder for it or this can be a TODO for when the metadata for part size is added to the remote store.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2023

Gradle Check (Jenkins) Run Completed with:

@andrross
Copy link
Member

Let's close this for now. We'll use this as a part of the design @neetikasinghal is currently working on, but this won't be shipped as-is.

@andrross andrross closed this Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants