-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have a common ERS for both VtOrc and Vtctl #8492
Merged
Merged
Changes from all commits
Commits
Show all changes
189 commits
Select commit
Hold shift + click to select a range
986a198
remove unused constants
GuptaManan100 0beabaa
moved reparent operations to vtorc
GuptaManan100 53d1829
start of adding common emergencyReparenter
GuptaManan100 73a8698
check if fixed and event dispatch for the shard info
GuptaManan100 ee3ffb0
PreRecoveryProcesses
GuptaManan100 9f1d06b
stop replication and check shard locked
GuptaManan100 40e9f96
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 ba99549
get primary recovery type
GuptaManan100 aa4a261
find primary candidates
GuptaManan100 74cbabc
check if need to override primary candidate
GuptaManan100 5f7c088
start replication
GuptaManan100 7bfc937
make vtorc use the new reparenter
GuptaManan100 73d609c
moved the package back
GuptaManan100 08103c3
bug fix
GuptaManan100 dfd4f8d
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 281455f
call the cancel function
GuptaManan100 9ada8ad
remove dead code
GuptaManan100 96b9fb5
added creator functions for the reparenters and started switching tests
GuptaManan100 8e5e41f
moved remaining unit tests to newer ers
GuptaManan100 eb3a127
remove old ers
GuptaManan100 873ca33
rename the new ERS
GuptaManan100 266aba2
added a failing test for ERS which promotes rdonly
GuptaManan100 e8cc7d9
added a failing test for vtorc which promotes a crossCellReplica
GuptaManan100 383d78f
stop replication via common code
GuptaManan100 19d3544
use the candidate list generated from the topo server instead of loca…
GuptaManan100 5ef4ab6
add function to restrict the candidates based on the type
GuptaManan100 49cf202
added logger to vtorc
GuptaManan100 558d7b3
fix test name and comments
GuptaManan100 073df19
fixed the test so that it does not introduce errant GTIDs
GuptaManan100 058901c
bug fix
GuptaManan100 ae49168
handle a TODO
GuptaManan100 7572ba9
create a new function for promoting primary
GuptaManan100 96e7244
used the newly created function
GuptaManan100 b91f34c
added the function to check whether the selected primary candidate is…
GuptaManan100 e46a4d0
fixed unit tests
GuptaManan100 4ee197d
create a function to get a better candidate
GuptaManan100 cd4f548
choose candidate using validCandidate list
GuptaManan100 2b7b142
refactor code into a new file
GuptaManan100 a4881e6
added function to replace primary with a better candidate
GuptaManan100 d7d52f0
fix test timeouts
GuptaManan100 eecc87e
potential bug fix
GuptaManan100 cc99d26
refactor override promotion method
GuptaManan100 186857f
add undo demotion code
GuptaManan100 ae47acd
promotePrimary to fix replication and semi-sync
GuptaManan100 1f74e46
fix tests to make the logs less polluted
GuptaManan100 be0180d
improve test to check replication statuses after failover
GuptaManan100 984e0dd
improved tests and a bug fix
GuptaManan100 8a803ae
handle error and waiting time for relay logs differently in vtorc
GuptaManan100 cf49355
fix lost rdonly test and the bug found
GuptaManan100 239e05e
fix promotion lag success test
GuptaManan100 dcd2208
fix promotion lag failure test and uncomment it
GuptaManan100 899fd0e
improve down primary promotion rule test
GuptaManan100 f825979
improved down primary promotion rule with lag test
GuptaManan100 ad31a29
improved down primary promotion rule with lag cross center test
GuptaManan100 f8fcd81
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 6f3c125
bug fixes introduced via merging
GuptaManan100 c43eda7
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 78c59c2
update promotePrimary rpc
GuptaManan100 2ccf828
use shardInfo instead of primaryStatusMap and bug fixes
GuptaManan100 826781c
remove uncalled GetNewPrimary function
GuptaManan100 5d147cd
remove winningPrimaryTabletAliasString from the vtctlreparentFunction…
GuptaManan100 56cf2d6
remove winningPosition from the vtctlreparentFunctions struct
GuptaManan100 ab25e75
remove validCandidates from the vtctlreparentFunctions struct
GuptaManan100 ab90712
remove statusMaps from the vtctlreparentFunctions struct
GuptaManan100 ae49119
remove tabletMap from the vtctlreparentFunctions struct
GuptaManan100 f2bc87d
removed setMaps entirely
GuptaManan100 c3eab1b
remove lockString from the vtctlreparentFunctions struct
GuptaManan100 b14ac05
unexport fields from the vtctlreparentFunctions struct
GuptaManan100 f333a78
remove postponeAll from the vtorcreparentFunctions struct
GuptaManan100 d8dd849
rename function to PostERSCompletionHook and call it after ERS completes
GuptaManan100 db58700
handle lint errors
GuptaManan100 d86f7f3
add topoServer, keyspace and shard to common code
GuptaManan100 47cae9d
removed keyspace, shard and ts from vtctlReparentFunctions struct
GuptaManan100 d5010ce
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 c347dc9
bug fix in test
GuptaManan100 ed441e4
added inline comments for starting functions of ERS
GuptaManan100 de81505
remove unused function
GuptaManan100 3fb0e60
add change type before calling promote replica in prs
GuptaManan100 d2fcb06
added inline comments upto check for ideal candidate
GuptaManan100 463277f
handle todo for lockAction and rp
GuptaManan100 30065e4
add comments to remaining part of ERS
GuptaManan100 74423b2
revert change to promoteReplica
GuptaManan100 cf12b41
fixed emergency-reparent unit tests
GuptaManan100 08d0e54
remove unused code
GuptaManan100 faf94b8
fixed planned_reparenter unit tests
GuptaManan100 34f4ac7
fixed reparent shard test
GuptaManan100 a5b777a
bug fix in finding errant GTIDs
GuptaManan100 07d2e88
fix slow server tests
GuptaManan100 0d4e526
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 536b933
added test in ERS for testing preference of a candidate in the same cell
GuptaManan100 9a654d5
fix server unit test for vtctldserver
GuptaManan100 c68f864
found data race
GuptaManan100 644e0d6
added a todo
GuptaManan100 115e57b
handle data race
GuptaManan100 1ccb8ef
fix vtorc tests
GuptaManan100 488ae35
remove lock shard from interface
GuptaManan100 6d65dfd
add atomic counter so that only 1 ers is issued at a time
GuptaManan100 cf2c99e
remove the pre-recovery processes from ers
GuptaManan100 8100ffd
remove the check for primary recovery type from ers
GuptaManan100 cabc4a7
removed check fixed function from the interface
GuptaManan100 7c19858
bug fix for shard counter
GuptaManan100 136f825
move vtorc to use vtctl ers
GuptaManan100 aaee0ae
also audit the steps in the callback logger
GuptaManan100 ca7c79a
removed handle relay log failure from the interface
GuptaManan100 9b2e13a
remove the ReparentFunctions interface
GuptaManan100 6009739
renamed reparent functions to EmergencyReparentOptions
GuptaManan100 570866b
moved lockAction and restrictValidCandidates from the emergencyRepare…
GuptaManan100 6ae51e7
moved durability policy to reparentUtil
GuptaManan100 d31fd1b
added sorter for ERS
GuptaManan100 d199764
use sorter in findPrimaryCandidate
GuptaManan100 1e02ff3
moved vtorc functionality to promotedReplicaIsIdeal
GuptaManan100 79c3c88
fixed remaining ers to match newer implementation
GuptaManan100 d2e2033
also override promotion for mustNotPromotionRules
GuptaManan100 e9c047b
fixed bug in analysis occuring due to max connection limit to 1
GuptaManan100 f3c54f2
added prevention for cross cell promotion as an argument
GuptaManan100 c89a8e2
skip tests that aren't supported yet
GuptaManan100 afc1787
set durability policy from the vtctld and vtctl binaries
GuptaManan100 6442969
remove unused code
GuptaManan100 686b76f
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 f1932de
remove duplicate function
GuptaManan100 cba5ac9
revert changes to prs
GuptaManan100 58b6f37
added flags for passing bool arguement preventing cross cell promotion
GuptaManan100 55493b2
refactor and addition of comments
GuptaManan100 1f39f98
removed reparent_functions file in reparentutil
GuptaManan100 f97051e
fix grpc tests
GuptaManan100 5c2fa2b
fix emergency_reparent tests
GuptaManan100 fce72c8
added comments to util file
GuptaManan100 a5b106f
added logging to ers functions
GuptaManan100 ea56879
added post ers code to vtorc
GuptaManan100 bd484e9
remove reparent_functions file from login in vtorc
GuptaManan100 9e6cd96
add test for ers counters
GuptaManan100 9416f39
add test for checking constraint satisfaction
GuptaManan100 5da378b
fix wrangler test
GuptaManan100 6209781
add test for reparenting replicas
GuptaManan100 0b57ed5
add test for promoting intermediate primary
GuptaManan100 dd93330
add test for getting better candidate
GuptaManan100 d730b90
add test for getting valid candidates and position list
GuptaManan100 9a6773e
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 1d4b981
fix error in fakeMySQlDaemon
GuptaManan100 ea1874b
use vitess log instead of golang log in durability
GuptaManan100 8a2257a
added test for waiting for catching up
GuptaManan100 5efd4c8
handle data race in test
GuptaManan100 5102ae7
added test for restricting valid candidate list
GuptaManan100 cf7e9b5
added tests for finding intermediate primary and added split brain de…
GuptaManan100 3f74bca
remove a todo
GuptaManan100 11d2e0d
move ers tests to their own package
GuptaManan100 ee1d628
update config file for tests
GuptaManan100 cb0f05c
remove unused code
GuptaManan100 786217f
force tablet refresh in vtorc
GuptaManan100 485b85d
updated functions and added comments
GuptaManan100 cc8ad44
add few more TODOs
GuptaManan100 a02b173
add timeout to catchup phase of ers
GuptaManan100 23cfb1c
handle todo in setReplicationSource
GuptaManan100 9e5bacd
add configuration to set wait time and use mutex to lock the ERS
GuptaManan100 31351e3
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 2751921
bug fix in vtorc
GuptaManan100 eac76e6
added for all the durability policies
GuptaManan100 f5d14e0
added unit tests for contraint failure to ers tests
GuptaManan100 aa09afd
do not cancel replication context until all the replicas are done
GuptaManan100 3bf3479
rename test package from prs to plannedreparent
GuptaManan100 ce8dd2f
use assert in tests where we do not want to abort in case of failure
GuptaManan100 9821156
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 890349e
ran make vtctldclient vtadmin_web_proto_types
GuptaManan100 0d4a6bf
fix naming for ERSSorter
GuptaManan100 6b00bcc
fix config test file to reflect change to package name
GuptaManan100 d1201f2
fix change in vtctl file
GuptaManan100 3948d7e
fix import lines
GuptaManan100 be76f8c
move functions from util to ers which can be used only in EmergencyRe…
GuptaManan100 3559ce1
removed arguments available in the method
GuptaManan100 533bc11
fix more imports
GuptaManan100 243673a
removed the NewEmergencyReparentOptions function
GuptaManan100 54bbcb9
collect all variables into a single place
GuptaManan100 18ab55b
moved promotionRules to its own package
GuptaManan100 0570edf
rename constants and functions in promotionrule package
GuptaManan100 2276ae3
remove dead code from main_test.go file in vtorc
GuptaManan100 beae839
renamed function in vtorc tests
GuptaManan100 a8f4dcb
added a end to end test verifying that the new primary can pull trans…
GuptaManan100 ed78b1b
refactor and add comments
GuptaManan100 cb76b33
keep pb imports separate
GuptaManan100 c4f2b6b
refactor according to review comments
GuptaManan100 39ca79f
random selection of tablets with respect to cells if preventCrossCell…
GuptaManan100 9d8456c
add new flag to the usage as well
GuptaManan100 653efb3
add comments to other boolean literals
GuptaManan100 1338df4
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 1f74111
rename some structs and functions for better readability
GuptaManan100 da23be5
added mutex for protecting concurrent access to the durability policies
GuptaManan100 e2c1938
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 1332dc6
rename 2 functions
GuptaManan100 90401da
Merge branch 'main' into prs-reconcile
GuptaManan100 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
204 changes: 204 additions & 0 deletions
204
go/test/endtoend/reparent/emergencyreparent/ers_test.go
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,204 @@ | ||
/* | ||
Copyright 2019 The Vitess Authors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package emergencyreparent | ||
|
||
import ( | ||
"context" | ||
"testing" | ||
"time" | ||
|
||
"github.com/stretchr/testify/require" | ||
|
||
"vitess.io/vitess/go/test/endtoend/cluster" | ||
"vitess.io/vitess/go/vt/log" | ||
) | ||
|
||
func TestTrivialERS(t *testing.T) { | ||
defer cluster.PanicHandler(t) | ||
setupReparentCluster(t) | ||
defer teardownCluster() | ||
|
||
confirmReplication(t, tab1, []*cluster.Vttablet{tab2, tab3, tab4}) | ||
|
||
// We should be able to do a series of ERS-es, even if nothing | ||
// is down, without issue | ||
for i := 1; i <= 4; i++ { | ||
out, err := ers(nil, "60s", "30s") | ||
log.Infof("ERS loop %d. EmergencyReparentShard Output: %v", i, out) | ||
require.NoError(t, err) | ||
time.Sleep(5 * time.Second) | ||
} | ||
// We should do the same for vtctl binary | ||
for i := 1; i <= 4; i++ { | ||
out, err := ersWithVtctl() | ||
log.Infof("ERS-vtctl loop %d. EmergencyReparentShard Output: %v", i, out) | ||
require.NoError(t, err) | ||
time.Sleep(5 * time.Second) | ||
} | ||
} | ||
|
||
func TestReparentIgnoreReplicas(t *testing.T) { | ||
defer cluster.PanicHandler(t) | ||
setupReparentCluster(t) | ||
defer teardownCluster() | ||
var err error | ||
|
||
ctx := context.Background() | ||
|
||
confirmReplication(t, tab1, []*cluster.Vttablet{tab2, tab3, tab4}) | ||
|
||
// Make the current primary agent and database unavailable. | ||
stopTablet(t, tab1, true) | ||
|
||
// Take down a replica - this should cause the emergency reparent to fail. | ||
stopTablet(t, tab3, true) | ||
|
||
// We expect this one to fail because we have an unreachable replica | ||
out, err := ers(nil, "60s", "30s") | ||
require.NotNil(t, err, out) | ||
|
||
// Now let's run it again, but set the command to ignore the unreachable replica. | ||
out, err = ersIgnoreTablet(nil, "60s", "30s", []*cluster.Vttablet{tab3}, false) | ||
require.Nil(t, err, out) | ||
|
||
// We'll bring back the replica we took down. | ||
restartTablet(t, tab3) | ||
|
||
// Check that old primary tablet is left around for human intervention. | ||
confirmOldPrimaryIsHangingAround(t) | ||
deleteTablet(t, tab1) | ||
validateTopology(t, false) | ||
|
||
newPrimary := getNewPrimary(t) | ||
// Check new primary has latest transaction. | ||
err = checkInsertedValues(ctx, t, newPrimary, insertVal) | ||
require.Nil(t, err) | ||
|
||
// bring back the old primary as a replica, check that it catches up | ||
resurrectTablet(ctx, t, tab1) | ||
} | ||
|
||
// TestERSPromoteRdonly tests that we never end up promoting a rdonly instance as the primary | ||
func TestERSPromoteRdonly(t *testing.T) { | ||
defer cluster.PanicHandler(t) | ||
setupReparentCluster(t) | ||
defer teardownCluster() | ||
var err error | ||
|
||
err = clusterInstance.VtctlclientProcess.ExecuteCommand("ChangeTabletType", tab2.Alias, "rdonly") | ||
require.NoError(t, err) | ||
|
||
err = clusterInstance.VtctlclientProcess.ExecuteCommand("ChangeTabletType", tab3.Alias, "rdonly") | ||
require.NoError(t, err) | ||
|
||
confirmReplication(t, tab1, []*cluster.Vttablet{tab2, tab3, tab4}) | ||
|
||
// Make the current primary agent and database unavailable. | ||
stopTablet(t, tab1, true) | ||
|
||
// We expect this one to fail because we have ignored all the replicas and have only the rdonly's which should not be promoted | ||
out, err := ersIgnoreTablet(nil, "30s", "30s", []*cluster.Vttablet{tab4}, false) | ||
require.NotNil(t, err, out) | ||
|
||
out, err = clusterInstance.VtctlclientProcess.ExecuteCommandWithOutput("GetShard", keyspaceShard) | ||
require.NoError(t, err) | ||
require.Contains(t, out, `"uid": 101`, "the primary should still be 101 in the shard info") | ||
} | ||
|
||
// TestERSPreventCrossCellPromotion tests that we promote a replica in the same cell as the previous primary if prevent cross cell promotion flag is set | ||
func TestERSPreventCrossCellPromotion(t *testing.T) { | ||
defer cluster.PanicHandler(t) | ||
setupReparentCluster(t) | ||
defer teardownCluster() | ||
var err error | ||
|
||
// confirm that replication is going smoothly | ||
confirmReplication(t, tab1, []*cluster.Vttablet{tab2, tab3, tab4}) | ||
|
||
// Make the current primary agent and database unavailable. | ||
stopTablet(t, tab1, true) | ||
|
||
// We expect that tab3 will be promoted since it is in the same cell as the previous primary | ||
out, err := ersIgnoreTablet(nil, "60s", "30s", []*cluster.Vttablet{tab2}, true) | ||
require.NoError(t, err, out) | ||
|
||
newPrimary := getNewPrimary(t) | ||
require.Equal(t, newPrimary.Alias, tab3.Alias, "tab3 should be the promoted primary") | ||
} | ||
|
||
// TestPullFromRdonly tests that if a rdonly tablet is the most advanced, then our promoted primary should have | ||
// caught up to it by pulling transactions from it | ||
func TestPullFromRdonly(t *testing.T) { | ||
defer cluster.PanicHandler(t) | ||
setupReparentCluster(t) | ||
defer teardownCluster() | ||
var err error | ||
|
||
ctx := context.Background() | ||
// make tab2 a rdonly tablet. | ||
// rename tablet so that the test is not confusing | ||
rdonly := tab2 | ||
err = clusterInstance.VtctlclientProcess.ExecuteCommand("ChangeTabletType", rdonly.Alias, "rdonly") | ||
require.NoError(t, err) | ||
|
||
// confirm that all the tablets can replicate successfully right now | ||
confirmReplication(t, tab1, []*cluster.Vttablet{rdonly, tab3, tab4}) | ||
|
||
// stop replication on the other two tablets | ||
err = clusterInstance.VtctlclientProcess.ExecuteCommand("StopReplication", tab3.Alias) | ||
require.NoError(t, err) | ||
err = clusterInstance.VtctlclientProcess.ExecuteCommand("StopReplication", tab4.Alias) | ||
require.NoError(t, err) | ||
|
||
// stop semi-sync on the primary so that any transaction now added does not require an ack | ||
runSQL(ctx, t, "SET GLOBAL rpl_semi_sync_master_enabled = false", tab1) | ||
|
||
// confirm that rdonly is able to replicate from our primary | ||
// This will also introduce a new transaction into the rdonly tablet which the other 2 replicas don't have | ||
confirmReplication(t, tab1, []*cluster.Vttablet{rdonly}) | ||
|
||
// Make the current primary agent and database unavailable. | ||
stopTablet(t, tab1, true) | ||
|
||
// start the replication back on the two tablets | ||
err = clusterInstance.VtctlclientProcess.ExecuteCommand("StartReplication", tab3.Alias) | ||
require.NoError(t, err) | ||
err = clusterInstance.VtctlclientProcess.ExecuteCommand("StartReplication", tab4.Alias) | ||
require.NoError(t, err) | ||
|
||
// check that tab3 and tab4 still only has 1 value | ||
err = checkCountOfInsertedValues(ctx, t, tab3, 1) | ||
require.NoError(t, err) | ||
err = checkCountOfInsertedValues(ctx, t, tab4, 1) | ||
require.NoError(t, err) | ||
|
||
// At this point we have successfully made our rdonly tablet more advanced than tab3 and tab4 without introducing errant GTIDs | ||
// We have simulated a network partition in which the primary and rdonly got isolated and then the primary went down leaving the rdonly most advanced | ||
|
||
// We expect that tab3 will be promoted since it is in the same cell as the previous primary | ||
// since we are preventing cross cell promotions | ||
// Also it must be fully caught up | ||
out, err := ersIgnoreTablet(nil, "60s", "30s", nil, true) | ||
require.NoError(t, err, out) | ||
|
||
newPrimary := getNewPrimary(t) | ||
require.Equal(t, newPrimary.Alias, tab3.Alias, "tab3 should be the promoted primary") | ||
|
||
// check that the new primary has the last transaction that only the rdonly had | ||
err = checkInsertedValues(ctx, t, newPrimary, insertVal) | ||
require.NoError(t, err) | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(leaving placeholder for myself to understand why we need this change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was required because I saw that we had a bug in this code, were even if two servers had the exact same GTID set but had different sources set, we were flagging them both as having errant GTIDs. I have added a test for that in
TestFindErrantGTIDs
. We do not need to remove the GTIDs originating from the source for the servers that we are checking against. We should only remove them for the server that is being tested for errant GTIDs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confuses me a bit. You only look for errant GTIDs in a descendant of a server; comparing siblings or cousins doesn't have the same strong check guarantees, the way I understand it.
But I'm unfamiliar with this code, I'm unsure how it's being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we do not flag a GTID as errant when -
For the second condition check we should not remove the GTIDs that originated from the source for the servers that we are comparing it against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I don't understand the answer. What is the "source" server? Is this the primary?
Could you please explain the context for this check? I'm unfamiliar with this part of the code because it's not originated in
orchestrator
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "source" server is the server which the tablet that we are checking errant GTIDs for. That would be the primary in most cases.
Yes, you are right. This errant GTID check is far from complete. What it does guarantee however is that anything marked as errant from this function will in all cases be errant. It does not guarantee that we catch all errant GTIDs. I have updated the PR description as well stating that this codepath needs change and should also use the durability policies for a more comprehensive errant GTID check.
We only mark a GTID as errant if it is only present in this server and does not originate from the previous primary. This is what the code does now.
Earlier, we used to mark a GTID as errant even when the same GTID was present in some other server but it was originating from that server's source. This was wrong and also how I stumbled onto this bug. The additional check restricting the GTIDs in the servers we checked against didn't make any sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As to how two tablets had different sources, that can happen because a failed ERS, or some network partition due to which a server's source wasn't fixed correctly. Anyways these GTIDs should not be marked as errant.