-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VReplication: Address SwitchTraffic bugs around replication lag and cancel on error #17616
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
59340f2
to
4800b81
Compare
Signed-off-by: Matt Lord <[email protected]>
4800b81
to
4e8e109
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17616 +/- ##
==========================================
- Coverage 67.69% 67.63% -0.06%
==========================================
Files 1586 1586
Lines 255403 255666 +263
==========================================
+ Hits 172883 172911 +28
- Misses 82520 82755 +235 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
rmsource, rmtarget := false, true | ||
if backward { | ||
rmsource, rmtarget = true, false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was another bug in our cancel path that was highlighted by the new test (specifically by trying to query the table after the failed switch).
Signed-off-by: Matt Lord <[email protected]>
e02aacb
to
bb1207a
Compare
Signed-off-by: Matt Lord <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
6cbcc8a
to
2083f25
Compare
Signed-off-by: Matt Lord <[email protected]>
2083f25
to
2253a88
Compare
Description
This PR improves the error handling for cancel failures (see issue). When the context changes are commented out in the PR branch we can see the new error handling in action here when using the manual test in the issue:
And with the context changes in place:
This PR also addresses the inaccuracy of the
max_v_replication_transaction_lag
value as noted in the issue. After adjusting this calculation, the manual test now fails in thecanSwitch()
pre-check as expected:And when monitoring the lag as the test runs you can see it correctly measured when there is little to no lag, you can see it increase as the actual lag increases, and then see it go back down (it is NOT included in the output when the value is 0):
Lastly, for
MoveTables
switch writes specifically, we were not properly undoing or canceling the switching of the tablet_controls->denied_tables list.Note
I think that we should backport this all the way to v19 as each of the issues fixed here — 1) the context usage 2) the vreplication transaction lag calculation 3) the MoveTables switch writes cancel denied tables changes — impact v19+.
Related Issue(s)
Checklist