-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Remote Translog] Introduce remote translog with upload functionality #5392
[Remote Translog] Introduce remote translog with upload functionality #5392
Conversation
Gradle Check (Jenkins) Run Completed with:
|
import java.util.function.LongConsumer; | ||
import java.util.function.LongSupplier; | ||
|
||
public class LocalTranslog extends Translog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the implementation details are not changed and these methods are just moved out of Translog
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. There are no changes to LocalTranslog
.
@@ -29,7 +29,7 @@ public Translog newTranslog( | |||
LongConsumer persistedSequenceNumberConsumer | |||
) throws IOException { | |||
|
|||
return new Translog( | |||
return new LocalTranslog( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should the class name be also changed to LocalTranslogFactory
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Sachin .I will make this change in follow up PR along with introduction of RemoteTranslogFactory
.
public void sync() throws IOException { | ||
try { | ||
if (syncToDisk()) { | ||
prepareAndUpload(primaryTermSupplier.getAsLong(), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if upload fails, how does it impact the operations already synced to local translog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would not rollback those changes. Local translog can contain unacked changes for a while and be ahead of Remote Translog . However on next upload, these operations would also get uploaded to remote. So local and remote would be in sync again . This is the current behavior of translog as well, where it can persist unacknowledged writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If upload fails we should ensure we are failing the request acknowledgement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Previously we were not doing that. Have added that handling and added UTs as well.
fileTransferTracker::exclusionFilter | ||
); | ||
try { | ||
final Checkpoint checkpoint = readCheckpoint(location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this method is not overridden, does it mean it will always read checkpoint from local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to change this once we have the readCheckpoint from remote . Ideally on restart of the process , we want to read from local to minimize the bootup time. In rest of the cases, we want to download from remote to local and then do readCheckpoint
. I will put up a ToDo for same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try to ensure that files are always in sync with the remote store at all times, we can avoid downloading if checksums match to boost recovery times.
prepareAndUpload(primaryTermSupplier.getAsLong(), null); | ||
} | ||
|
||
private boolean prepareAndUpload(Long primaryTerm, Long generation) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does prepareAndUpload
upload only the latest sync'd tarnslog and checkpoint files? Does it assume existence of previous files in remote translog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it doesn't . It uploads the difference b/w local and remote stores. The transferSnapshotProvider
factors in all the local translog and checkpoint files and uses FileTransferTracker
to know the already uploaded files.
@Override | ||
public void sync() throws IOException { | ||
try { | ||
if (syncToDisk()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As syncToDisk()
is called before uploading translog files to remote, does it mean RemoteFsTranslog
has dependency over LocalTranslog
? Is this a decorator pattern, where we can combine 2 Translog implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the common dependencies are still in Translog
. Hence RemoteFsTranslog
doesn't depend upon LocalTranslog
. You can argue that RemoteFsTranslog
can extend LocalTranslog
for now . But we have kept it separate for now as we might plan to not do a local sync at all and directly upload to remote store . And the evolution of RemoteFsTranslog
will make it quite different from LocalTranslog
.
@@ -120,7 +120,7 @@ public void reset() throws IOException { | |||
public int read() throws IOException { | |||
try { | |||
return readByte() & 0xFF; | |||
} catch (EOFException e) { | |||
} catch (EOFException | ArrayIndexOutOfBoundsException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would ArrayIndexOutOfBoundsException occur?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the read of the byte array is completed, AIOB is thrown and not EOF .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which byte array? Can you provide more detail about what is happening here? This doesn't strike me as the right place to fix the problem because it looks like this is covering up a behavior of a specific StreamInput, unless I'm misunderstanding something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant BytesStreamInput
. Yes, you are right . BytesStreamInput
can throw AIOOBE
not EOF
. Hence , I added the handling for same .
OpenSearch/server/src/main/java/org/opensearch/common/io/stream/BytesStreamInput.java
Lines 83 to 87 in 306b199
// NOTE: AIOOBE not EOF if you read too much | |
@Override | |
public byte readByte() { | |
return bytes[pos++]; | |
} |
I agree to your concerns. Ideally BytesStreamInput
should not throw AIOOBE
, but return EOF
. However modifying that can cause regressions in other places. Hence I am inclined to keep it as is for that reason. Let me know if you have any concerns regarding same.
OpenSearch/server/src/main/java/org/opensearch/common/io/stream/StreamInput.java
Lines 129 to 132 in 306b199
/** | |
* Reads and returns a single byte. | |
*/ | |
public abstract byte readByte() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that any StreamInput can be given to BufferedChecksumStreamInput. In any other case array index out of bounds could very well be a bug and this will just interpret it as end of file. I don't understand the reason for not throwing an EOFException, other than this is a copy from the corresponding Lucene class.
I'm inclined to fix the problem at the source rather than doing this for the reason cited above. Also, this is not a comprehensive fix...readBytes()
and read()
will continue to throw ArrayIndexOutOfBoundsException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it . I have fixed BytesStreamInput
now to return EOF
instead of AIOOBE
just like other implementations of StreamInput
.
) throws IOException { | ||
super(config, translogUUID, deletionPolicy, globalCheckpointSupplier, primaryTermSupplier, persistedSequenceNumberConsumer); | ||
try { | ||
final Checkpoint checkpoint = readCheckpoint(location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we abstract readCheckpoint
as well as this would be different for different Translog Implementation and might not be required for the other implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we are reading only from local , hence keeping it as of now. We can make this change later, when we introduce a different Translog Implementation which needs a different readCheckpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gbbafna this might be needed for replicas while doing replica-primary promotion - fyi.
* that a new generation is rolled when the term is increased. This guarantee allows to us to validate | ||
* and reject operation whose term is higher than the primary term stored in the translog header. | ||
* @param persistedSequenceNumberConsumer a callback that's called whenever an operation with a given sequence number is successfully | ||
* persisted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add to the class java doc @opensearch.internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -112,7 +111,7 @@ | |||
* | |||
* @opensearch.internal | |||
*/ | |||
public class Translog extends AbstractIndexShardComponent implements IndexShardComponent, Closeable { | |||
public abstract class Translog extends AbstractIndexShardComponent implements IndexShardComponent, Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going through the Translog class, it still is heavily tied to local filesystem. Is it possible to have abstractions such that we could have a Translog class which is agnostic to Local/Remote store and then a LocalTranslog and RemoteLocalTranslog so that we could have right abstractions when we want to become remote completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a big change and not needed urgently. Hence. we want to defer it as of now as we still write to local and then upload to remote . Downloading puts in local and then reads from local .
try (ReleasableLock ignored = readLock.acquire()) { | ||
if (location.generation == current.getGeneration()) { // if we have a new one it's already synced | ||
ensureOpen(); | ||
return current.syncUpTo(location.translogLocation + location.size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TranslogWrite also seems to be tied heavily with local disk. Is it possible to have an abstraction (TranslogWriter) here and an LocalTranslogWriter implementation that does required job. For RemoteLocalTranslog, the implementation could be again and extension of LocalTranslogWriter or some something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed to your concerns . Currently we are doing this way as this is faster and safer way to release a working Remote Translog. Post our performance runs, if we see syncing to local disk is a significant performance hit, we will take that abstraction work , which is quite a significant work in itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do more on getting the right abstractions here.
Fixing |
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
593f3fb
to
33104a1
Compare
Signed-off-by: Gaurav Bafna <[email protected]>
Signed-off-by: Gaurav Bafna <[email protected]>
Signed-off-by: Gaurav Bafna <[email protected]>
Signed-off-by: Gaurav Bafna <[email protected]>
Signed-off-by: Gaurav Bafna <[email protected]>
aa7e4f2
to
ea33f76
Compare
Gradle Check (Jenkins) Run Completed with:
|
Thanks Sachin. Completed all of the above. Build got fixed by the rebase. |
Signed-off-by: Gaurav Bafna <[email protected]>
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
if (current.totalOperations() == 0 && primaryTermSupplier.getAsLong() == current.getPrimaryTerm()) { | ||
return; | ||
} | ||
prepareAndUpload(primaryTermSupplier.getAsLong(), null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While ensureSynced
uploads within a writeLock is this true even for rollGeneration
. To ensure exactly one upload can happen at any point in time we need to ensure that the write lock is held by the current thread. Can we add those checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepareAndUpload
already does a writeLock
acquire, which will guarantee that exactly one thread will be uploading the txlog files. Since Translog
has a ReentrantReadWriteLock
, it will ensure that only thread will be able to take the write log, and other threads will lie dormant, till the write lock is released .
// Do we need remote writes in sync fashion ? | ||
// If we don't , we should swallow FileAlreadyExistsException while writing to remote store | ||
// and also verify for same during primary-primary relocation | ||
// Writing remote in sync fashion doesn't hurt as global ckp update | ||
// is not updated in remote translog except in primary to primary recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we need a ToDo here, the comments aren't clear. We do need remote writes in sync, while the same file should be prevented from getting uploaded in the first place and if it cannot then we need to check if that got overridden on the remote store
} | ||
} | ||
|
||
private boolean upload(Long primaryTerm, Long generation) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check if lock is held by current thread. Please also add multi-threaded tests to ensure that replication requests and flush request don't get into concurrency issues
if (ex instanceof IOException) { | ||
throw (IOException) ex; | ||
} else { | ||
throw (RuntimeException) ex; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just throw the underlying exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gbbafna LGTM
…opensearch-project#5392) * Introduce remote translog with upload functionality Signed-off-by: Gaurav Bafna <[email protected]> Co-authored-by: Bukhtawar Khan <[email protected]>
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-5392-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 5cfae6e05ad3a686d32e0769a2b4c2749ed24e89
# Push it to GitHub
git push --set-upstream origin backport/backport-5392-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x Then, create a pull request where the |
…opensearch-project#5392) * Introduce remote translog with upload functionality Signed-off-by: Gaurav Bafna <[email protected]> Co-authored-by: Bukhtawar Khan <[email protected]>
…hanges (#5757) * Introduce TranslogFactory for Local/Remote Translog support (#4172) * Introduce TranslogFactory for Local/Remote Translog support Signed-off-by: Bukhtawar Khan <[email protected]> * [Remote Translog] Introduce remote translog with upload functionality (#5392) * Introduce remote translog with upload functionality Signed-off-by: Gaurav Bafna <[email protected]> Co-authored-by: Bukhtawar Khan <[email protected]> * Enable creation of indices using Remote Translog (#5638) * Enable creation of indices using Remote Translog behind a setting and feature flag Signed-off-by: Gaurav Bafna <[email protected]> * [Remote Translog] Add support for downloading files from remote translog (#5649) * Add support to download translog from remote store during recovery Signed-off-by: Sachin Kale <[email protected]> * Integrate remote translog download on failover (#5699) * Integrate remote translog download on failover Signed-off-by: Ashish Singh <[email protected]> Signed-off-by: Bukhtawar Khan <[email protected]> Signed-off-by: Gaurav Bafna <[email protected]> Signed-off-by: Sachin Kale <[email protected]> Signed-off-by: Ashish Singh <[email protected]>
…hanges (opensearch-project#5757) * Introduce TranslogFactory for Local/Remote Translog support (opensearch-project#4172) * Introduce TranslogFactory for Local/Remote Translog support Signed-off-by: Bukhtawar Khan <[email protected]> * [Remote Translog] Introduce remote translog with upload functionality (opensearch-project#5392) * Introduce remote translog with upload functionality Signed-off-by: Gaurav Bafna <[email protected]> Co-authored-by: Bukhtawar Khan <[email protected]> * Enable creation of indices using Remote Translog (opensearch-project#5638) * Enable creation of indices using Remote Translog behind a setting and feature flag Signed-off-by: Gaurav Bafna <[email protected]> * [Remote Translog] Add support for downloading files from remote translog (opensearch-project#5649) * Add support to download translog from remote store during recovery Signed-off-by: Sachin Kale <[email protected]> * Integrate remote translog download on failover (opensearch-project#5699) * Integrate remote translog download on failover Signed-off-by: Ashish Singh <[email protected]> Signed-off-by: Bukhtawar Khan <[email protected]> Signed-off-by: Gaurav Bafna <[email protected]> Signed-off-by: Sachin Kale <[email protected]> Signed-off-by: Ashish Singh <[email protected]>
* Introduce TranslogFactory for Local/Remote Translog support (#4172) * Introduce TranslogFactory for Local/Remote Translog support Signed-off-by: Bukhtawar Khan <[email protected]> * [Remote Translog] Introduce remote translog with upload functionality (#5392) * Introduce remote translog with upload functionality Signed-off-by: Gaurav Bafna <[email protected]> Co-authored-by: Bukhtawar Khan <[email protected]> * Enable creation of indices using Remote Translog (#5638) * Enable creation of indices using Remote Translog behind a setting and feature flag Signed-off-by: Gaurav Bafna <[email protected]> * [Remote Translog] Add support for downloading files from remote translog (#5649) * Add support to download translog from remote store during recovery Signed-off-by: Sachin Kale <[email protected]> * Integrate remote translog download on failover (#5699) * Integrate remote translog download on failover Signed-off-by: Ashish Singh <[email protected]> Signed-off-by: Bukhtawar Khan <[email protected]> Signed-off-by: Gaurav Bafna <[email protected]> Signed-off-by: Sachin Kale <[email protected]> Signed-off-by: Ashish Singh <[email protected]> Signed-off-by: Bukhtawar Khan <[email protected]> Signed-off-by: Gaurav Bafna <[email protected]> Signed-off-by: Sachin Kale <[email protected]> Signed-off-by: Ashish Singh <[email protected]> Co-authored-by: Gaurav Bafna <[email protected]>
…hanges (#5757) * Introduce TranslogFactory for Local/Remote Translog support (#4172) * Introduce TranslogFactory for Local/Remote Translog support Signed-off-by: Bukhtawar Khan <[email protected]> * [Remote Translog] Introduce remote translog with upload functionality (#5392) * Introduce remote translog with upload functionality Signed-off-by: Gaurav Bafna <[email protected]> Co-authored-by: Bukhtawar Khan <[email protected]> * Enable creation of indices using Remote Translog (#5638) * Enable creation of indices using Remote Translog behind a setting and feature flag Signed-off-by: Gaurav Bafna <[email protected]> * [Remote Translog] Add support for downloading files from remote translog (#5649) * Add support to download translog from remote store during recovery Signed-off-by: Sachin Kale <[email protected]> * Integrate remote translog download on failover (#5699) * Integrate remote translog download on failover Signed-off-by: Ashish Singh <[email protected]> Signed-off-by: Bukhtawar Khan <[email protected]> Signed-off-by: Gaurav Bafna <[email protected]> Signed-off-by: Sachin Kale <[email protected]> Signed-off-by: Ashish Singh <[email protected]>
This is the first PR for remote translog support in OpenSearch .
The scope of the PR is just uploading translog files - tlog and ckp .
Signed-off-by: Gaurav Bafna [email protected]
Description
[Describe what this change achieves]
Issues Resolved
#5477
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.