-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Remote Translog] High failover durations #7381
Comments
@gbbafna here is a sample exception. Ideally the root cause we should do is to understand why there were do many translog files pilling up for delete. this indicates flush only succeeded after a while
|
@linuxpi : Yes, this is because we are deferring the deletes of translog to flush . With the proposed change, we will clean up the translog with every refresh , hence alleviating the sudden calls to remote store. |
Is your feature request related to a problem? Please describe.
For trimming Remote Translog, we rely on flush as well as remote segment store upload . Since Flush happen rarely as compared to remote segment store upload (happens on refresh) , we always end up depending upon flush to trim the translog .
This increases the failover time as the newly promoted primary needs to download lot of translog files
Describe the solution you'd like
Don't rely on flush to trim remote translog . Instead use the seq number safely persisted in remote segment store .
Alternate
Change the behavior of refresh . Call flush on every refresh .
The text was updated successfully, but these errors were encountered: