-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing a translog deletion policy #24950
Conversation
…_policy # Conflicts: # core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java # core/src/main/java/org/elasticsearch/index/shard/IndexShard.java # core/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minors really this looks awesome!
this.uidField = engineConfig.getIndexSettings().isSingleType() ? IdFieldMapper.NAME : UidFieldMapper.NAME; | ||
this.versionMap = new LiveVersionMap(); | ||
final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy(); | ||
this.deletionPolicy = new CombinedDeletionPolicy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: does this need to break into 3 lines or is this maybe a leftover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sadly it doesn't fit in one line, I'll make it like this - it seems you prefer it:
this.deletionPolicy = new CombinedDeletionPolicy(
new SnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy()), translogDeletionPolicy, openMode);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k
@@ -250,6 +250,10 @@ public int totalOperations() { | |||
return operationCounter; | |||
} | |||
|
|||
public long lastSyncedGlobalCheckpoint() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadocs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe pkg private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this can be completely removed - it's a leftover from the "somewhat into the future" POC. good catch.
|
||
/** Records how many views are held against each | ||
* translog generation */ | ||
protected final Map<Long,Integer> translogRefCounts = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this is a good place for LongIntMap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be Map<Long,Counter>
then you can do this:
translogRefCounts.computeIfAbsent(translogGen, Counter.newCounter(false)).addAndGet(1);
//....
value = translogRefCounts.computeIfAbsent(translogGen, Counter.newCounter(false)).addAndGet(-1);
It would make things easier to read IMO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.apache.lucene.util.Counter
that is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at LongIntMap
but decided not to have a 3rd party dependency for this low performance, rarely used map. I like the Counter class usage. It simplifies things. Thanks!
import java.util.List; | ||
import java.util.Map; | ||
|
||
public class TranslogDeletionPolicy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can make this final right away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. I thought I'd need to subclass in tests but it turned out not be necessary.
try (ReleasableLock lock = readLock.acquire()) { | ||
ensureOpen(); | ||
View view = new View(lastCommittedTranslogFileGeneration); | ||
outstandingViews.add(view); | ||
viewGenToClean = deletionPolicy.acquireTranslogGenForView(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be simpler if you remove the viewGenToClean
and just do this return new View(deletionPolicy.acquireTranslogGenForView());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be but I'm paranoid about an exception in the View constructor. This way it's clearly safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no exception possibility here? I think this is overparanoia
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The View constructor does not even do anything, it just sets a field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do not like the use of setting viewGenToClean
to -1
to indicate not to release the view. Is there a reason that you do not make viewGenToClean
final and local to the try block and set a boolean flag to indicate success or not?
* returns the minimum translog generation that is still required by the system. Any generation below | ||
* the returned value may be safely deleted | ||
*/ | ||
public synchronized long minTranslogGenRequired(List<TranslogReader> readers, TranslogWriter currentWriter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readers is unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readers is still unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I missed your comment from the last time. I will remove the params. They are a leftover from the POC (to show how we would do size based deletion).
} | ||
|
||
private void setLastCommittedTranslogGeneration(List<? extends IndexCommit> commits) throws IOException { | ||
final IndexCommit indexCommit = commits.get(commits.size() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you leave a comment why we only use the last one? It would help others to reason about this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
* An {@link IndexDeletionPolicy} that coordinates between Lucene's commits and the retention of translog generation files, | ||
* making sure that all translog files that are need to recover from the lucene commit are not deleted. | ||
*/ | ||
public class CombinedDeletionPolicy extends IndexDeletionPolicy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this final and maybe pkg private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. And it exposed a leftover wrong Java Doc reference.
Thx @s1monw . I addressed all your feedback. I will wait for @jasontedor to have a look as well. |
…_policy # Conflicts: # core/src/test/java/org/elasticsearch/index/translog/TranslogTests.java
Thx @jasontedor . I addressed your comments. Can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some lingering nits that do not require another look from me but otherwise LGTM. Thanks @bleskes.
* written, the current translogs file generation and it's fsynced offset in bytes. | ||
* Each Translog has only one translog file open for writes at any time referenced by a translog generation ID. This ID is written to a | ||
* <tt>translog.ckp</tt> file that is designed to fit in a single disk block such that a write of the file is atomic. The checkpoint file | ||
* is written on each fsync operation of the translog and records the number of operations written, the current translogs file generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
translogs
-> translog's
* Each Translog has only one translog file open for writes at any time referenced by a translog generation ID. This ID is written to a | ||
* <tt>translog.ckp</tt> file that is designed to fit in a single disk block such that a write of the file is atomic. The checkpoint file | ||
* is written on each fsync operation of the translog and records the number of operations written, the current translogs file generation | ||
* , it's fsynced offset in bytes and other important statistics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comma to start line, place at end of previous line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's
-> its
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bytes and
-> bytes, and
Thx @s1monw , @jasontedor |
When we open a translog, we rely on the `translog.ckp` file to tell us what the maximum generation file should be and on the information stored in the last lucene commit to know the first file we need to recover. This requires coordination and is currently subject to a race condition: if a node dies after a lucene commit is made but before we remove the translog generations that were unneeded by it, the next we open the translog we will ignore those files and never delete them (I have added tests for this). This PR changes the approach to have the translog store both of those numbers in the `translog.ckp`. This means it's more self contained and easier to control. This change also decouples the translog recovery logic from the specific commit we're opening. This prepares the ground to fully utilize the deletion policy introduce elastic#24950 and store more translog data that's needed for Lucene, keep multiple lucene commits around, and be free to recover from any of them.
When we open a translog, we rely on the `translog.ckp` file to tell us what the maximum generation file should be and on the information stored in the last lucene commit to know the first file we need to recover. This requires coordination and is currently subject to a race condition: if a node dies after a lucene commit is made but before we remove the translog generations that were unneeded by it, the next time we open the translog we will ignore those files and never delete them (I have added tests for this). This PR changes the approach to have the translog store both of those numbers in the `translog.ckp`. This means it's more self contained and easier to control. This change also decouples the translog recovery logic from the specific commit we're opening. This prepares the ground to fully utilize the deletion policy introduced in #24950 and store more translog data that's needed for Lucene, keep multiple lucene commits around and be free to recover from any of them.
@ArielCoralogix previously we'd throw away the translog files immediately after flush. There was no hard coded time based interval. That said - I think it will be very rare this will cause much more files to be open than before. The translog still stays under 512MB as before. This changes doesn't mean we check every 12 hours, we check after each indexing request. The changes means that we keep at most 512MB for at most 12 hours. Effectively - an active translog will always be 512MB rathen then shrinking to 0 and growing again to 512MB. After 12hrs it will be cleaned away. I don't believe you have so many active shards within 12 hrs for this to seriously influence the number of open files. Or do you? |
Hey @bleskes, I'm Ariel's colleague. from running: sudo lsof -p |
Hey @bleskes adding to @farin99's comment. We currently have 150,000 open file descriptors on some of our servers (and slowly rising). This ticket has some more info: #29097 |
Currently, the decisions regarding which translog generation files to delete are hard coded in the interaction between the
InternalEngine
and theTranslog
classes. This PR extracts it to a dedicated class calledTranslogDeletionPolicy
, for two main reasons: