-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add global checkpoint to translog checkpoints #21254
Add global checkpoint to translog checkpoints #21254
Conversation
This commit adds the sequence number global checkpoint to translog checkpoints, and removes them from Lucene commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. Left some comments where I think we need some more thinking (as we discussed). Thanks!
throw new IndexFormatTooOldException("trasnlog", "translog has no generation nor a UUID - this might be an index from a previous version consider upgrading to N-1 first"); | ||
} | ||
} | ||
final Translog translog = new Translog(translogConfig, generation); | ||
final Translog translog = new Translog(translogConfig, generation, seqNoService::getGlobalCheckpoint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - shall we make this a parameter to the method, like the other things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 6cbed98.
* {@code globalCheckpoint} should be set to the last known global | ||
* checkpoint for this shard, or | ||
* {@link SequenceNumbersService#UNASSIGNED_SEQ_NO}. | ||
* Initialize the sequence number service. The {@code maxSeqNo} should be set to the last sequence number assigned by this shard, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG what happened here :)
@@ -155,6 +149,10 @@ public void updateAllocationIdsFromMaster(Set<String> activeAllocationIds, Set<S | |||
* of one of the active allocations is not known. | |||
*/ | |||
public boolean updateGlobalCheckpointOnPrimary() { | |||
return globalCheckpointService.updateCheckpointOnPrimary(); | |||
final boolean maybeUpdateGlobalCheckpoint = globalCheckpointService.updateCheckpointOnPrimary(); | |||
if (maybeUpdateGlobalCheckpoint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is what we want? maybeUpdateGlobalCheckpoint may be true because the global checkpoint isn't updated but we miss some local checkpoint. I think it's still ok to call this runnable, but we should document it if we keep it like this. Alternatively maybe we should split this method in two - updateGlobalCheckpointOnPrimary
and needsLocalCheckpointSync
and call those from IndexShard
's updateGlobalCheckpointOnPrimary
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted the changes to this method, they are not needed after 86ec4d0.
+ Long.BYTES // generation | ||
+ CodecUtil.footerLength(); | ||
|
||
// TODO: remove legacy support, not needed in 6.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a nocommit so we'll clean up later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 37a50bb.
void onGlobalCheckpointUpdate() { | ||
try (ReleasableLock ignored = readLock.acquire()) { | ||
ensureOpen(); | ||
translog.sync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this all runs on an indexing thread, i'm concerned it will make us fsync twice. How about not fsync here but rather trigger a sync from the global checkpoint sync action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 86ec4d0.
* @return the sequence number stats | ||
* @throws IOException if an I/O exception occurred reading the Lucene commit point or the translog checkpoint | ||
*/ | ||
private static SeqNoStats loadSeqNoStats(final EngineConfig engineConfig, final IndexWriter writer) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a little funky as we read some stuff from given parameters and some stuff from disk but then only if we're not supposed to ignore the translog. How about never reading the global checkpoint from the translog on opening and make it part of the recovery from translog? also it means peer recovery will need to also pass the global checkpoint as part the engine opening (but we can do this a follow up - add a no commit please if so)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 19d4db0.
This commit replaces a TODO on removing legacy support for non-checksummed translog checkpoints in Checkpoint.java with a nocommit.
This commit adds a nocommit to convert the way the global checkpoint is acquired when starting the engine to make it part of the recovery from the translog.
This commit causes the translog to be synced after every global checkpoint sync and removes syncing of the global checkpoint from the indexing path.
This commit modifies the signature of InternalEngine#openTranslog to feed through the global checkpoint supplier down to the opening of the translog.
Thanks @bleskes. I've responded to your feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending one testing question.
return new ReplicaResult(); | ||
} | ||
|
||
private void syncTranslog(final IndexShard indexShard) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the conversion to an unchecked exceptions? shardOperationOnX Allows throwing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In master both shardOperationOnPrimary
and shardOperationOnReplica
declare the top-level checked exception. We can pick this up and remove the wrapping after integrating master into feature/seq_no. I pushed 6c1338c.
@@ -136,7 +137,7 @@ public void tearDown() throws Exception { | |||
} | |||
|
|||
private Translog create(Path path) throws IOException { | |||
return new Translog(getTranslogConfig(path), null); | |||
return new Translog(getTranslogConfig(path), null, () -> SequenceNumbersService.UNASSIGNED_SEQ_NO); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test that checks that the global checkpoint given by this supplier is properly persisted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 37af724.
This commit adds a missing checked exception declared on TransportReplicationAction#shardOperationOnPrimary but not declared on the overridden method GlobalCheckpointSyncAction#shardOperationOnPrimary.
This commit adds a nocommit to GlobalCheckpointSyncAction to remove some exception wrapping after integrating master into feature/seq_no. This wrapping will not be needed after integration as TransportReplicationAction#shardOperationOnPrimary and TransportReplicationAction#shardOperationOnReplica will both declare a top-level checked exception.
This commit adds a low-level test that the supplied global checkpoint is persisted in translog checkpoints when the translog is synced.
@bleskes I pushed commits responding to your last round of feedback. Do you want to take another look? |
This commit simplifies the advancement of the global checkpoint in TranslogTests#testBasicCheckpoint by moving the advancement into the test code.
still LGTM. thx. |
Thanks @bleskes. |
This commit adds the sequence number global checkpoint to translog
checkpoints, and removes them from Lucene commits.