Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Translog] Deleting remote translog considering latest remote metadata #5869

Closed

Conversation

gbbafna
Copy link
Collaborator

@gbbafna gbbafna commented Jan 16, 2023

Signed-off-by: Gaurav Bafna [email protected]

Description

We should delete remote translog considering the latest metadata file uploaded. The tlog/ckp files referenced by that metadata file cannot be deleted . By doing this we will be able to restore translog from that metadata.

Issues Resolved

#5845

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh

@codecov-commenter
Copy link

codecov-commenter commented Jan 16, 2023

Codecov Report

Merging #5869 (246924d) into main (32dd9e2) will increase coverage by 0.02%.
The diff coverage is 88.23%.

@@             Coverage Diff              @@
##               main    #5869      +/-   ##
============================================
+ Coverage     70.88%   70.91%   +0.02%     
- Complexity    58720    58776      +56     
============================================
  Files          4768     4768              
  Lines        280575   280584       +9     
  Branches      40514    40516       +2     
============================================
+ Hits         198881   198971      +90     
+ Misses        65334    65327       -7     
+ Partials      16360    16286      -74     
Impacted Files Coverage Δ
...rg/opensearch/index/translog/RemoteFsTranslog.java 68.53% <81.81%> (+0.88%) ⬆️
...h/index/translog/transfer/FileTransferTracker.java 94.59% <100.00%> (+0.47%) ⬆️
...dex/translog/transfer/TranslogTransferManager.java 82.60% <100.00%> (-0.19%) ⬇️
...port/ResponseHandlerFailureTransportException.java 0.00% <0.00%> (-60.00%) ⬇️
...a/org/opensearch/client/cluster/ProxyModeInfo.java 0.00% <0.00%> (-55.00%) ⬇️
.../java/org/opensearch/node/NodeClosedException.java 50.00% <0.00%> (-50.00%) ⬇️
...rch/client/transport/NoNodeAvailableException.java 28.57% <0.00%> (-42.86%) ⬇️
...cluster/coordination/PublishClusterStateStats.java 33.33% <0.00%> (-37.51%) ⬇️
...n/admin/indices/readonly/AddIndexBlockRequest.java 35.71% <0.00%> (-35.72%) ⬇️
...search/search/aggregations/pipeline/EwmaModel.java 24.44% <0.00%> (-33.34%) ⬇️
... and 449 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

}
}

public void trimUnreferencedReaders() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add @Override annotation?

logger.trace("delete remote translog generation file [{}], not referenced by metadata anymore", generation);
deleteRemoteGeneration(generation);
} else {
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we breaking out of the for loop?

I can think of that we may want to retry the uploads - in which case should we start the for loop from minimum generation number that exists?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would increase the flush times to a very high number. The clean up can be taken as an async job on its own .

String translogFilename = Translog.getFilename(generation);
if (fileTransferTracker.uploaded(translogFilename)) {
logger.trace("delete remote translog generation file [{}], not referenced by metadata anymore", generation);
deleteRemoteGeneration(generation);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteRemoteGeneration method uses the current primaryTerm. This probably would not be handled when there is failover and the primary term has increased. Could we rely on metadata to fetch the right generation to primary term mapping for cleaning up the files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great catch @ashking94 . The issue here is the latest metadata doesn't know about the right primary term for this generation. I will create a backlog item for cleaning up the older primary term translog files . This can be done on every failover or an async job . I don't want to complicate the usual deletion flow due to this corner case .

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls address the comments.

@gbbafna gbbafna changed the title Deleting remote translog considering latest remote metadata [Remote Translog ]Deleting remote translog considering latest remote metadata Jan 17, 2023
@gbbafna gbbafna changed the title [Remote Translog ]Deleting remote translog considering latest remote metadata [Remote Translog] Deleting remote translog considering latest remote metadata Jan 17, 2023
@gbbafna gbbafna force-pushed the tx-safe-remote-delete branch from 9be02e5 to 246924d Compare January 17, 2023 04:14
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@gbbafna gbbafna marked this pull request as draft January 18, 2023 05:27
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@gbbafna gbbafna force-pushed the tx-safe-remote-delete branch from 2422c10 to 950c2bc Compare January 30, 2023 11:42
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ashking94 ashking94 added Storage:Durability Issues and PRs related to the durability framework v2.6.0 'Issues and PRs related to version v2.6.0' labels Jan 30, 2023
@ashking94 ashking94 force-pushed the tx-safe-remote-delete branch from 950c2bc to 61344d4 Compare January 30, 2023 17:33
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@ashking94
Copy link
Member

Created PR #6086 for rest of the development due to access issue.

@ashking94 ashking94 closed this Jan 30, 2023
@Poojita-Raj Poojita-Raj removed the v2.6.0 'Issues and PRs related to version v2.6.0' label Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog Storage:Durability Issues and PRs related to the durability framework
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants