Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing a translog deletion policy #24950

Merged
merged 30 commits into from
Jun 1, 2017

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented May 30, 2017

Currently, the decisions regarding which translog generation files to delete are hard coded in the interaction between the InternalEngine and the Translog classes. This PR extracts it to a dedicated class called TranslogDeletionPolicy, for two main reasons:

  1. Simplicity - the code is easier to read and understand (no more two phase commit on the translog, the Engine can just commit and the translog will respond)
  2. Preparing for future plans to extend the logic we need - i.e., retain multiple lucene commit and also introduce a size based retention logic, allowing people to always keep a certain amount of translog files around. The latter is useful to increase the chance of an ops based recovery.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minors really this looks awesome!

this.uidField = engineConfig.getIndexSettings().isSingleType() ? IdFieldMapper.NAME : UidFieldMapper.NAME;
this.versionMap = new LiveVersionMap();
final TranslogDeletionPolicy translogDeletionPolicy = new TranslogDeletionPolicy();
this.deletionPolicy = new CombinedDeletionPolicy(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does this need to break into 3 lines or is this maybe a leftover?

Copy link
Contributor Author

@bleskes bleskes May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sadly it doesn't fit in one line, I'll make it like this - it seems you prefer it:

        this.deletionPolicy = new CombinedDeletionPolicy(
            new SnapshotDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy()), translogDeletionPolicy, openMode);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k

@@ -250,6 +250,10 @@ public int totalOperations() {
return operationCounter;
}

public long lastSyncedGlobalCheckpoint() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadocs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe pkg private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this can be completely removed - it's a leftover from the "somewhat into the future" POC. good catch.


/** Records how many views are held against each
* translog generation */
protected final Map<Long,Integer> translogRefCounts = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is a good place for LongIntMap?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be Map<Long,Counter> then you can do this:

translogRefCounts.computeIfAbsent(translogGen, Counter.newCounter(false)).addAndGet(1);
//....

value = translogRefCounts.computeIfAbsent(translogGen, Counter.newCounter(false)).addAndGet(-1);

It would make things easier to read IMO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.lucene.util.Counter that is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at LongIntMap but decided not to have a 3rd party dependency for this low performance, rarely used map. I like the Counter class usage. It simplifies things. Thanks!

import java.util.List;
import java.util.Map;

public class TranslogDeletionPolicy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can make this final right away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I thought I'd need to subclass in tests but it turned out not be necessary.

try (ReleasableLock lock = readLock.acquire()) {
ensureOpen();
View view = new View(lastCommittedTranslogFileGeneration);
outstandingViews.add(view);
viewGenToClean = deletionPolicy.acquireTranslogGenForView();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be simpler if you remove the viewGenToClean and just do this return new View(deletionPolicy.acquireTranslogGenForView());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be but I'm paranoid about an exception in the View constructor. This way it's clearly safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no exception possibility here? I think this is overparanoia

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The View constructor does not even do anything, it just sets a field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not like the use of setting viewGenToClean to -1 to indicate not to release the view. Is there a reason that you do not make viewGenToClean final and local to the try block and set a boolean flag to indicate success or not?

* returns the minimum translog generation that is still required by the system. Any generation below
* the returned value may be safely deleted
*/
public synchronized long minTranslogGenRequired(List<TranslogReader> readers, TranslogWriter currentWriter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readers is unused?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readers is still unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I missed your comment from the last time. I will remove the params. They are a leftover from the POC (to show how we would do size based deletion).

}

private void setLastCommittedTranslogGeneration(List<? extends IndexCommit> commits) throws IOException {
final IndexCommit indexCommit = commits.get(commits.size() - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave a comment why we only use the last one? It would help others to reason about this code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

* An {@link IndexDeletionPolicy} that coordinates between Lucene's commits and the retention of translog generation files,
* making sure that all translog files that are need to recover from the lucene commit are not deleted.
*/
public class CombinedDeletionPolicy extends IndexDeletionPolicy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this final and maybe pkg private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. And it exposed a leftover wrong Java Doc reference.

@bleskes
Copy link
Contributor Author

bleskes commented May 30, 2017

Thx @s1monw . I addressed all your feedback. I will wait for @jasontedor to have a look as well.

bleskes added 3 commits May 30, 2017 13:53
…_policy

# Conflicts:
#	core/src/test/java/org/elasticsearch/index/translog/TranslogTests.java
@bleskes
Copy link
Contributor Author

bleskes commented Jun 1, 2017

Thx @jasontedor . I addressed your comments. Can you take another look?

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some lingering nits that do not require another look from me but otherwise LGTM. Thanks @bleskes.

* written, the current translogs file generation and it's fsynced offset in bytes.
* Each Translog has only one translog file open for writes at any time referenced by a translog generation ID. This ID is written to a
* <tt>translog.ckp</tt> file that is designed to fit in a single disk block such that a write of the file is atomic. The checkpoint file
* is written on each fsync operation of the translog and records the number of operations written, the current translogs file generation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

translogs -> translog's

* Each Translog has only one translog file open for writes at any time referenced by a translog generation ID. This ID is written to a
* <tt>translog.ckp</tt> file that is designed to fit in a single disk block such that a write of the file is atomic. The checkpoint file
* is written on each fsync operation of the translog and records the number of operations written, the current translogs file generation
* , it's fsynced offset in bytes and other important statistics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comma to start line, place at end of previous line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's -> its

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes and -> bytes, and

@bleskes bleskes merged commit 1775e42 into elastic:master Jun 1, 2017
@bleskes bleskes deleted the translog_deletion_policy branch June 1, 2017 12:04
@bleskes
Copy link
Contributor Author

bleskes commented Jun 1, 2017

Thx @s1monw , @jasontedor

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jun 1, 2017
When we open a translog, we rely on the `translog.ckp` file to tell us what the maximum generation file should be and on the information stored in the last lucene commit to know the first file we need to recover. This requires coordination and is currently subject to a race condition: if a node dies after a lucene commit is made but before we remove the translog generations that were unneeded by it, the next we open the translog we will ignore those files and never delete them (I have added tests for this).

This PR changes the approach to have the translog store both of those numbers in the `translog.ckp`. This means it's more self contained and easier to control.

This change also decouples the translog recovery logic from the specific commit we're opening. This prepares the ground to fully utilize the deletion policy introduce elastic#24950 and store more translog data that's needed for Lucene, keep multiple lucene commits around, and be free to recover from any of them.
bleskes added a commit that referenced this pull request Jun 8, 2017
When we open a translog, we rely on the `translog.ckp` file to tell us what the maximum generation file should be and on the information stored in the last lucene commit to know the first file we need to recover. This requires coordination and is currently subject to a race condition: if a node dies after a lucene commit is made but before we remove the translog generations that were unneeded by it, the next time we open the translog we will ignore those files and never delete them (I have added tests for this).

This PR changes the approach to have the translog store both of those numbers in the `translog.ckp`. This means it's more self contained and easier to control. 

This change also decouples the translog recovery logic from the specific commit we're opening. This prepares the ground to fully utilize the deletion policy introduced in #24950 and store more translog data that's needed for Lucene, keep multiple lucene commits around and be free to recover from any of them.
@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Engine :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 13, 2018
@ArielCoralogix
Copy link

@bleskes Can you please share the previously hard-coded time interval for translog deletion? The default now (6.2.2) is 12H.
Asking because of this: #29097

@bleskes
Copy link
Contributor Author

bleskes commented Mar 16, 2018

@ArielCoralogix previously we'd throw away the translog files immediately after flush. There was no hard coded time based interval. That said - I think it will be very rare this will cause much more files to be open than before. The translog still stays under 512MB as before. This changes doesn't mean we check every 12 hours, we check after each indexing request. The changes means that we keep at most 512MB for at most 12 hours. Effectively - an active translog will always be 512MB rathen then shrinking to 0 and growing again to 512MB. After 12hrs it will be cleaned away. I don't believe you have so many active shards within 12 hrs for this to seriously influence the number of open files. Or do you?

@farin99
Copy link

farin99 commented Mar 16, 2018

Hey @bleskes, I'm Ariel's colleague. from running: sudo lsof -p
90% of the file descriptors are from translog-xxxxx.tlog files. So if I understand correctly elastic by default won't delete the translog files for 12H or until it reach 512MB?
We have hundreds of active shards

@ArielCoralogix
Copy link

Hey @bleskes adding to @farin99's comment. We currently have 150,000 open file descriptors on some of our servers (and slowly rising). This ticket has some more info: #29097
Is there any way for us to change the settings so the behavior will be similar to version 5.4? (deleting files immediately after committing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants