Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Synchronize checkpoint updates with failover. #4136

Closed
Tracked by #2212 ...
mch2 opened this issue Aug 4, 2022 · 2 comments
Closed
Tracked by #2212 ...

[Segment Replication] Synchronize checkpoint updates with failover. #4136

mch2 opened this issue Aug 4, 2022 · 2 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request v2.5.0 'Issues and PRs related to version v2.5.0'

Comments

@mch2
Copy link
Member

mch2 commented Aug 4, 2022

With #4135 and #3989, basic failover support is added for shards with segment replication enabled.

However, this change does not consider what happens to ongoing or incoming copy events during failover.

Replicas should remain as swappable backups that recovery quickly, so I do not think we should wait for file copy to complete for an ongoing replication. The replica should cancel the event and begin its failover steps (commit & rewire its engine). However, If a replica has an ongoing copy event that is in the finalize step, meaning all segments for a new checkpoint have arrived and the only remaining step is to wire into its directory reader, I think we can let it complete and then continue?

@mch2 mch2 added enhancement Enhancement or improvement to existing feature or request distributed framework labels Aug 4, 2022
@dreamer-89 dreamer-89 added the v2.4.0 'Issues and PRs related to version v2.4.0' label Sep 9, 2022
@anasalkouz anasalkouz added v2.5.0 'Issues and PRs related to version v2.5.0' and removed v2.4.0 'Issues and PRs related to version v2.4.0' labels Oct 31, 2022
@saratvemulapalli
Copy link
Member

@mch2 @anasalkouz this issue is tagged for 2.5. Our freeze is 1/10. Can we make it?

@mch2
Copy link
Member Author

mch2 commented Jan 6, 2023

@mch2 @anasalkouz this issue is tagged for 2.5. Our freeze is 1/10. Can we make it?

Thanks for calling this out @saratvemulapalli. This issue has actually already been resolved with the introduction of replication cancellation and commits on replicas during engine close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request v2.5.0 'Issues and PRs related to version v2.5.0'
Projects
Status: Done
Development

No branches or pull requests

4 participants