Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integ test : Stop replication when remote cluster is unreachable #69

Merged
merged 1 commit into from
Aug 3, 2021

Conversation

soosinha
Copy link
Member

Signed-off-by: Sooraj Sinha [email protected]

Description

Stop replication should work even when remote cluster is unavailable. Currently, stop replication errors out if remote cluster is unavailable because remove retention lease method throws an exception.
This change catches the error error thrown by remove retention lease method so the stop replication API can go through. Also, added an integ test to verify the same

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Comment on lines 91 to 100
} catch(e: RetentionLeaseNotFoundException) {
// log error and bail
log.error("${e.message}")
} catch (e: Exception) {
log.error("Exception in removing retention lease", e)
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename the "removeRetentionLease" method to also indicate that it doesn't propagate exception?

For example - either removeRetentionLeaseNoThrow or attemptRetentionLeaseRemoval etc just so we are clear that it is only best attempt.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud : how would a cluster admin know that this has happened ? I see that logs are the only current way of this . Concerned as leaking retention leases on leader side can have side effects like no truncation of translog .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some message in the stop replication response to convey this?

The other option I can think of is let the stop replication fail by default but allow the customer to pass a flag to indicate that remote call should not be made in the stop replication call in which case the stop replication should complete without removing the retention lease.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this method should not throw exception as stop replication should be successful even if remote is unreachable. But we can build on this in the future to have an async mechanism to remove the retention lease.

Renamed the method as suggested for now.

@soosinha soosinha force-pushed the remote_down_stop_repl branch from e72aee2 to 9530b38 Compare July 30, 2021 12:05
@soosinha soosinha merged commit f5371a4 into opensearch-project:main Aug 3, 2021
@soosinha soosinha deleted the remote_down_stop_repl branch August 3, 2021 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants