Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Translog] Introduce remote translog with upload functionality #5392

Merged
merged 11 commits into from
Dec 27, 2022

Conversation

gbbafna
Copy link
Collaborator

@gbbafna gbbafna commented Nov 28, 2022

This is the first PR for remote translog support in OpenSearch .

The scope of the PR is just uploading translog files - tlog and ckp .

Signed-off-by: Gaurav Bafna [email protected]

Description

[Describe what this change achieves]

Issues Resolved

#5477

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@gbbafna gbbafna requested review from a team and reta as code owners November 28, 2022 15:31
@gbbafna gbbafna marked this pull request as draft November 28, 2022 15:31
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

import java.util.function.LongConsumer;
import java.util.function.LongSupplier;

public class LocalTranslog extends Translog {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the implementation details are not changed and these methods are just moved out of Translog.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are no changes to LocalTranslog .

@@ -29,7 +29,7 @@ public Translog newTranslog(
LongConsumer persistedSequenceNumberConsumer
) throws IOException {

return new Translog(
return new LocalTranslog(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should the class name be also changed to LocalTranslogFactory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sachin .I will make this change in follow up PR along with introduction of RemoteTranslogFactory.

public void sync() throws IOException {
try {
if (syncToDisk()) {
prepareAndUpload(primaryTermSupplier.getAsLong(), null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if upload fails, how does it impact the operations already synced to local translog?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would not rollback those changes. Local translog can contain unacked changes for a while and be ahead of Remote Translog . However on next upload, these operations would also get uploaded to remote. So local and remote would be in sync again . This is the current behavior of translog as well, where it can persist unacknowledged writes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If upload fails we should ensure we are failing the request acknowledgement

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Previously we were not doing that. Have added that handling and added UTs as well.

fileTransferTracker::exclusionFilter
);
try {
final Checkpoint checkpoint = readCheckpoint(location);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this method is not overridden, does it mean it will always read checkpoint from local?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to change this once we have the readCheckpoint from remote . Ideally on restart of the process , we want to read from local to minimize the bootup time. In rest of the cases, we want to download from remote to local and then do readCheckpoint . I will put up a ToDo for same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to ensure that files are always in sync with the remote store at all times, we can avoid downloading if checksums match to boost recovery times.

prepareAndUpload(primaryTermSupplier.getAsLong(), null);
}

private boolean prepareAndUpload(Long primaryTerm, Long generation) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does prepareAndUpload upload only the latest sync'd tarnslog and checkpoint files? Does it assume existence of previous files in remote translog?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't . It uploads the difference b/w local and remote stores. The transferSnapshotProvider factors in all the local translog and checkpoint files and uses FileTransferTracker to know the already uploaded files.

@Override
public void sync() throws IOException {
try {
if (syncToDisk()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As syncToDisk() is called before uploading translog files to remote, does it mean RemoteFsTranslog has dependency over LocalTranslog? Is this a decorator pattern, where we can combine 2 Translog implementations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the common dependencies are still in Translog. Hence RemoteFsTranslog doesn't depend upon LocalTranslog . You can argue that RemoteFsTranslog can extend LocalTranslog for now . But we have kept it separate for now as we might plan to not do a local sync at all and directly upload to remote store . And the evolution of RemoteFsTranslog will make it quite different from LocalTranslog.

@Bukhtawar Bukhtawar changed the title [Remote Translog] [Draft] - Changes to upload remote translog [Remote Translog] - Changes to upload remote translog Dec 7, 2022
@@ -120,7 +120,7 @@ public void reset() throws IOException {
public int read() throws IOException {
try {
return readByte() & 0xFF;
} catch (EOFException e) {
} catch (EOFException | ArrayIndexOutOfBoundsException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would ArrayIndexOutOfBoundsException occur?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the read of the byte array is completed, AIOB is thrown and not EOF .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which byte array? Can you provide more detail about what is happening here? This doesn't strike me as the right place to fix the problem because it looks like this is covering up a behavior of a specific StreamInput, unless I'm misunderstanding something here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant BytesStreamInput . Yes, you are right . BytesStreamInput can throw AIOOBE not EOF . Hence , I added the handling for same .

// NOTE: AIOOBE not EOF if you read too much
@Override
public byte readByte() {
return bytes[pos++];
}

I agree to your concerns. Ideally BytesStreamInput should not throw AIOOBE, but return EOF . However modifying that can cause regressions in other places. Hence I am inclined to keep it as is for that reason. Let me know if you have any concerns regarding same.

/**
* Reads and returns a single byte.
*/
public abstract byte readByte() throws IOException;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that any StreamInput can be given to BufferedChecksumStreamInput. In any other case array index out of bounds could very well be a bug and this will just interpret it as end of file. I don't understand the reason for not throwing an EOFException, other than this is a copy from the corresponding Lucene class.

I'm inclined to fix the problem at the source rather than doing this for the reason cited above. Also, this is not a comprehensive fix...readBytes() and read() will continue to throw ArrayIndexOutOfBoundsException.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it . I have fixed BytesStreamInput now to return EOF instead of AIOOBE just like other implementations of StreamInput .

) throws IOException {
super(config, translogUUID, deletionPolicy, globalCheckpointSupplier, primaryTermSupplier, persistedSequenceNumberConsumer);
try {
final Checkpoint checkpoint = readCheckpoint(location);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we abstract readCheckpoint as well as this would be different for different Translog Implementation and might not be required for the other implementations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we are reading only from local , hence keeping it as of now. We can make this change later, when we introduce a different Translog Implementation which needs a different readCheckpoint

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gbbafna this might be needed for replicas while doing replica-primary promotion - fyi.

* that a new generation is rolled when the term is increased. This guarantee allows to us to validate
* and reject operation whose term is higher than the primary term stored in the translog header.
* @param persistedSequenceNumberConsumer a callback that's called whenever an operation with a given sequence number is successfully
* persisted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add to the class java doc @opensearch.internal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -112,7 +111,7 @@
*
* @opensearch.internal
*/
public class Translog extends AbstractIndexShardComponent implements IndexShardComponent, Closeable {
public abstract class Translog extends AbstractIndexShardComponent implements IndexShardComponent, Closeable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through the Translog class, it still is heavily tied to local filesystem. Is it possible to have abstractions such that we could have a Translog class which is agnostic to Local/Remote store and then a LocalTranslog and RemoteLocalTranslog so that we could have right abstractions when we want to become remote completely?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a big change and not needed urgently. Hence. we want to defer it as of now as we still write to local and then upload to remote . Downloading puts in local and then reads from local .

try (ReleasableLock ignored = readLock.acquire()) {
if (location.generation == current.getGeneration()) { // if we have a new one it's already synced
ensureOpen();
return current.syncUpTo(location.translogLocation + location.size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TranslogWrite also seems to be tied heavily with local disk. Is it possible to have an abstraction (TranslogWriter) here and an LocalTranslogWriter implementation that does required job. For RemoteLocalTranslog, the implementation could be again and extension of LocalTranslogWriter or some something similar?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to your concerns . Currently we are doing this way as this is faster and safer way to release a working Remote Translog. Post our performance runs, if we see syncing to local disk is a significant performance hit, we will take that abstraction work , which is quite a significant work in itself.

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do more on getting the right abstractions here.

@gbbafna gbbafna marked this pull request as ready for review December 9, 2022 04:12
@sachinpkale
Copy link
Member

Fixing Gradle Precommit failure.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2022

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2022

Gradle Check (Jenkins) Run Completed with:

@gbbafna gbbafna force-pushed the remote-txlog-upload-core branch from aa7e4f2 to ea33f76 Compare December 22, 2022 10:48
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@gbbafna
Copy link
Collaborator Author

gbbafna commented Dec 22, 2022

Overall changes look good to me. Few things:

1. Fix the build

2. Write PR title using imperative mood (we use squash and merge policy while merging. This makes PR title as commit message of merged PR)

3. Complete the checklist in the PR description

Thanks Sachin. Completed all of the above.

Build got fixed by the rebase.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

if (current.totalOperations() == 0 && primaryTermSupplier.getAsLong() == current.getPrimaryTerm()) {
return;
}
prepareAndUpload(primaryTermSupplier.getAsLong(), null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While ensureSynced uploads within a writeLock is this true even for rollGeneration. To ensure exactly one upload can happen at any point in time we need to ensure that the write lock is held by the current thread. Can we add those checks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepareAndUpload already does a writeLock acquire, which will guarantee that exactly one thread will be uploading the txlog files. Since Translog has a ReentrantReadWriteLock , it will ensure that only thread will be able to take the write log, and other threads will lie dormant, till the write lock is released .

Comment on lines +141 to +145
// Do we need remote writes in sync fashion ?
// If we don't , we should swallow FileAlreadyExistsException while writing to remote store
// and also verify for same during primary-primary relocation
// Writing remote in sync fashion doesn't hurt as global ckp update
// is not updated in remote translog except in primary to primary recovery.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need a ToDo here, the comments aren't clear. We do need remote writes in sync, while the same file should be prevented from getting uploaded in the first place and if it cannot then we need to check if that got overridden on the remote store

}
}

private boolean upload(Long primaryTerm, Long generation) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check if lock is held by current thread. Please also add multi-threaded tests to ensure that replication requests and flush request don't get into concurrency issues

Comment on lines +184 to +188
if (ex instanceof IOException) {
throw (IOException) ex;
} else {
throw (RuntimeException) ex;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just throw the underlying exception?

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gbbafna LGTM

@Bukhtawar Bukhtawar merged commit 5cfae6e into opensearch-project:main Dec 27, 2022
@gbbafna gbbafna deleted the remote-txlog-upload-core branch December 28, 2022 12:30
sachinpkale pushed a commit to sachinpkale/OpenSearch that referenced this pull request Jan 9, 2023
…opensearch-project#5392)

* Introduce remote translog with upload functionality 

Signed-off-by: Gaurav Bafna <[email protected]>
Co-authored-by: Bukhtawar Khan <[email protected]>
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Jan 9, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-5392-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 5cfae6e05ad3a686d32e0769a2b4c2749ed24e89
# Push it to GitHub
git push --set-upstream origin backport/backport-5392-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-5392-to-2.x.

gbbafna added a commit to gbbafna/OpenSearch that referenced this pull request Jan 9, 2023
…opensearch-project#5392)

* Introduce remote translog with upload functionality 

Signed-off-by: Gaurav Bafna <[email protected]>
Co-authored-by: Bukhtawar Khan <[email protected]>
gbbafna added a commit that referenced this pull request Jan 9, 2023
…hanges (#5757)

* Introduce TranslogFactory for Local/Remote Translog support (#4172)

* Introduce TranslogFactory for Local/Remote Translog support

Signed-off-by: Bukhtawar Khan <[email protected]>

* [Remote Translog] Introduce remote translog with upload functionality (#5392)

* Introduce remote translog with upload functionality 

Signed-off-by: Gaurav Bafna <[email protected]>
Co-authored-by: Bukhtawar Khan <[email protected]>

* Enable creation of indices using Remote Translog    (#5638)

* Enable creation of indices using Remote Translog behind a setting and feature flag
Signed-off-by: Gaurav Bafna <[email protected]>

* [Remote Translog] Add support for downloading files from remote translog (#5649)

* Add support to download translog from remote store during recovery

Signed-off-by: Sachin Kale <[email protected]>

* Integrate remote translog download on failover (#5699)

* Integrate remote translog download on failover

Signed-off-by: Ashish Singh <[email protected]>

Signed-off-by: Bukhtawar Khan <[email protected]>
Signed-off-by: Gaurav Bafna <[email protected]>
Signed-off-by: Sachin Kale <[email protected]>
Signed-off-by: Ashish Singh <[email protected]>
sachinpkale pushed a commit to sachinpkale/OpenSearch that referenced this pull request Jan 10, 2023
…hanges (opensearch-project#5757)

* Introduce TranslogFactory for Local/Remote Translog support (opensearch-project#4172)

* Introduce TranslogFactory for Local/Remote Translog support

Signed-off-by: Bukhtawar Khan <[email protected]>

* [Remote Translog] Introduce remote translog with upload functionality (opensearch-project#5392)

* Introduce remote translog with upload functionality 

Signed-off-by: Gaurav Bafna <[email protected]>
Co-authored-by: Bukhtawar Khan <[email protected]>

* Enable creation of indices using Remote Translog    (opensearch-project#5638)

* Enable creation of indices using Remote Translog behind a setting and feature flag
Signed-off-by: Gaurav Bafna <[email protected]>

* [Remote Translog] Add support for downloading files from remote translog (opensearch-project#5649)

* Add support to download translog from remote store during recovery

Signed-off-by: Sachin Kale <[email protected]>

* Integrate remote translog download on failover (opensearch-project#5699)

* Integrate remote translog download on failover

Signed-off-by: Ashish Singh <[email protected]>

Signed-off-by: Bukhtawar Khan <[email protected]>
Signed-off-by: Gaurav Bafna <[email protected]>
Signed-off-by: Sachin Kale <[email protected]>
Signed-off-by: Ashish Singh <[email protected]>
gbbafna added a commit that referenced this pull request Jan 10, 2023
* Introduce TranslogFactory for Local/Remote Translog support (#4172)

* Introduce TranslogFactory for Local/Remote Translog support

Signed-off-by: Bukhtawar Khan <[email protected]>

* [Remote Translog] Introduce remote translog with upload functionality (#5392)

* Introduce remote translog with upload functionality 

Signed-off-by: Gaurav Bafna <[email protected]>
Co-authored-by: Bukhtawar Khan <[email protected]>

* Enable creation of indices using Remote Translog    (#5638)

* Enable creation of indices using Remote Translog behind a setting and feature flag
Signed-off-by: Gaurav Bafna <[email protected]>

* [Remote Translog] Add support for downloading files from remote translog (#5649)

* Add support to download translog from remote store during recovery

Signed-off-by: Sachin Kale <[email protected]>

* Integrate remote translog download on failover (#5699)

* Integrate remote translog download on failover

Signed-off-by: Ashish Singh <[email protected]>

Signed-off-by: Bukhtawar Khan <[email protected]>
Signed-off-by: Gaurav Bafna <[email protected]>
Signed-off-by: Sachin Kale <[email protected]>
Signed-off-by: Ashish Singh <[email protected]>

Signed-off-by: Bukhtawar Khan <[email protected]>
Signed-off-by: Gaurav Bafna <[email protected]>
Signed-off-by: Sachin Kale <[email protected]>
Signed-off-by: Ashish Singh <[email protected]>
Co-authored-by: Gaurav Bafna <[email protected]>
kotwanikunal pushed a commit that referenced this pull request Jan 25, 2023
…hanges (#5757)

* Introduce TranslogFactory for Local/Remote Translog support (#4172)

* Introduce TranslogFactory for Local/Remote Translog support

Signed-off-by: Bukhtawar Khan <[email protected]>

* [Remote Translog] Introduce remote translog with upload functionality (#5392)

* Introduce remote translog with upload functionality 

Signed-off-by: Gaurav Bafna <[email protected]>
Co-authored-by: Bukhtawar Khan <[email protected]>

* Enable creation of indices using Remote Translog    (#5638)

* Enable creation of indices using Remote Translog behind a setting and feature flag
Signed-off-by: Gaurav Bafna <[email protected]>

* [Remote Translog] Add support for downloading files from remote translog (#5649)

* Add support to download translog from remote store during recovery

Signed-off-by: Sachin Kale <[email protected]>

* Integrate remote translog download on failover (#5699)

* Integrate remote translog download on failover

Signed-off-by: Ashish Singh <[email protected]>

Signed-off-by: Bukhtawar Khan <[email protected]>
Signed-off-by: Gaurav Bafna <[email protected]>
Signed-off-by: Sachin Kale <[email protected]>
Signed-off-by: Ashish Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants