Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have a common ERS for both VtOrc and Vtctl #8492

Merged
merged 189 commits into from
Oct 18, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
189 commits
Select commit Hold shift + click to select a range
986a198
remove unused constants
GuptaManan100 Jul 16, 2021
0beabaa
moved reparent operations to vtorc
GuptaManan100 Jul 19, 2021
53d1829
start of adding common emergencyReparenter
GuptaManan100 Jul 19, 2021
73a8698
check if fixed and event dispatch for the shard info
GuptaManan100 Jul 21, 2021
ee3ffb0
PreRecoveryProcesses
GuptaManan100 Jul 22, 2021
9f1d06b
stop replication and check shard locked
GuptaManan100 Jul 22, 2021
40e9f96
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Jul 22, 2021
ba99549
get primary recovery type
GuptaManan100 Jul 22, 2021
aa4a261
find primary candidates
GuptaManan100 Jul 22, 2021
74cbabc
check if need to override primary candidate
GuptaManan100 Jul 22, 2021
5f7c088
start replication
GuptaManan100 Jul 22, 2021
7bfc937
make vtorc use the new reparenter
GuptaManan100 Jul 22, 2021
73d609c
moved the package back
GuptaManan100 Jul 22, 2021
08103c3
bug fix
GuptaManan100 Jul 22, 2021
dfd4f8d
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Aug 16, 2021
281455f
call the cancel function
GuptaManan100 Aug 16, 2021
9ada8ad
remove dead code
GuptaManan100 Aug 16, 2021
96b9fb5
added creator functions for the reparenters and started switching tests
GuptaManan100 Aug 16, 2021
8e5e41f
moved remaining unit tests to newer ers
GuptaManan100 Aug 16, 2021
eb3a127
remove old ers
GuptaManan100 Aug 16, 2021
873ca33
rename the new ERS
GuptaManan100 Aug 16, 2021
266aba2
added a failing test for ERS which promotes rdonly
GuptaManan100 Aug 17, 2021
e8cc7d9
added a failing test for vtorc which promotes a crossCellReplica
GuptaManan100 Aug 17, 2021
383d78f
stop replication via common code
GuptaManan100 Aug 18, 2021
19d3544
use the candidate list generated from the topo server instead of loca…
GuptaManan100 Aug 18, 2021
5ef4ab6
add function to restrict the candidates based on the type
GuptaManan100 Aug 19, 2021
49cf202
added logger to vtorc
GuptaManan100 Aug 19, 2021
558d7b3
fix test name and comments
GuptaManan100 Aug 19, 2021
073df19
fixed the test so that it does not introduce errant GTIDs
GuptaManan100 Aug 19, 2021
058901c
bug fix
GuptaManan100 Aug 19, 2021
ae49168
handle a TODO
GuptaManan100 Aug 23, 2021
7572ba9
create a new function for promoting primary
GuptaManan100 Aug 23, 2021
96e7244
used the newly created function
GuptaManan100 Aug 23, 2021
b91f34c
added the function to check whether the selected primary candidate is…
GuptaManan100 Aug 24, 2021
e46a4d0
fixed unit tests
GuptaManan100 Aug 24, 2021
4ee197d
create a function to get a better candidate
GuptaManan100 Aug 25, 2021
cd4f548
choose candidate using validCandidate list
GuptaManan100 Aug 25, 2021
2b7b142
refactor code into a new file
GuptaManan100 Aug 25, 2021
a4881e6
added function to replace primary with a better candidate
GuptaManan100 Aug 26, 2021
d7d52f0
fix test timeouts
GuptaManan100 Aug 26, 2021
eecc87e
potential bug fix
GuptaManan100 Aug 26, 2021
cc99d26
refactor override promotion method
GuptaManan100 Aug 26, 2021
186857f
add undo demotion code
GuptaManan100 Aug 26, 2021
ae47acd
promotePrimary to fix replication and semi-sync
GuptaManan100 Aug 27, 2021
1f74e46
fix tests to make the logs less polluted
GuptaManan100 Aug 27, 2021
be0180d
improve test to check replication statuses after failover
GuptaManan100 Aug 27, 2021
984e0dd
improved tests and a bug fix
GuptaManan100 Aug 27, 2021
8a803ae
handle error and waiting time for relay logs differently in vtorc
GuptaManan100 Aug 27, 2021
cf49355
fix lost rdonly test and the bug found
GuptaManan100 Aug 27, 2021
239e05e
fix promotion lag success test
GuptaManan100 Aug 27, 2021
dcd2208
fix promotion lag failure test and uncomment it
GuptaManan100 Aug 27, 2021
899fd0e
improve down primary promotion rule test
GuptaManan100 Aug 27, 2021
f825979
improved down primary promotion rule with lag test
GuptaManan100 Aug 27, 2021
ad31a29
improved down primary promotion rule with lag cross center test
GuptaManan100 Aug 27, 2021
f8fcd81
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Aug 27, 2021
6f3c125
bug fixes introduced via merging
GuptaManan100 Aug 27, 2021
c43eda7
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Sep 1, 2021
78c59c2
update promotePrimary rpc
GuptaManan100 Sep 1, 2021
2ccf828
use shardInfo instead of primaryStatusMap and bug fixes
GuptaManan100 Sep 4, 2021
826781c
remove uncalled GetNewPrimary function
GuptaManan100 Sep 6, 2021
5d147cd
remove winningPrimaryTabletAliasString from the vtctlreparentFunction…
GuptaManan100 Sep 6, 2021
56cf2d6
remove winningPosition from the vtctlreparentFunctions struct
GuptaManan100 Sep 6, 2021
ab25e75
remove validCandidates from the vtctlreparentFunctions struct
GuptaManan100 Sep 6, 2021
ab90712
remove statusMaps from the vtctlreparentFunctions struct
GuptaManan100 Sep 6, 2021
ae49119
remove tabletMap from the vtctlreparentFunctions struct
GuptaManan100 Sep 6, 2021
f2bc87d
removed setMaps entirely
GuptaManan100 Sep 6, 2021
c3eab1b
remove lockString from the vtctlreparentFunctions struct
GuptaManan100 Sep 6, 2021
b14ac05
unexport fields from the vtctlreparentFunctions struct
GuptaManan100 Sep 6, 2021
f333a78
remove postponeAll from the vtorcreparentFunctions struct
GuptaManan100 Sep 6, 2021
d8dd849
rename function to PostERSCompletionHook and call it after ERS completes
GuptaManan100 Sep 6, 2021
db58700
handle lint errors
GuptaManan100 Sep 6, 2021
d86f7f3
add topoServer, keyspace and shard to common code
GuptaManan100 Sep 6, 2021
47cae9d
removed keyspace, shard and ts from vtctlReparentFunctions struct
GuptaManan100 Sep 7, 2021
d5010ce
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Sep 7, 2021
c347dc9
bug fix in test
GuptaManan100 Sep 7, 2021
ed441e4
added inline comments for starting functions of ERS
GuptaManan100 Sep 7, 2021
de81505
remove unused function
GuptaManan100 Sep 7, 2021
3fb0e60
add change type before calling promote replica in prs
GuptaManan100 Sep 7, 2021
d2fcb06
added inline comments upto check for ideal candidate
GuptaManan100 Sep 7, 2021
463277f
handle todo for lockAction and rp
GuptaManan100 Sep 7, 2021
30065e4
add comments to remaining part of ERS
GuptaManan100 Sep 7, 2021
74423b2
revert change to promoteReplica
GuptaManan100 Sep 8, 2021
cf12b41
fixed emergency-reparent unit tests
GuptaManan100 Sep 8, 2021
08d0e54
remove unused code
GuptaManan100 Sep 8, 2021
faf94b8
fixed planned_reparenter unit tests
GuptaManan100 Sep 8, 2021
34f4ac7
fixed reparent shard test
GuptaManan100 Sep 8, 2021
a5b777a
bug fix in finding errant GTIDs
GuptaManan100 Sep 8, 2021
07d2e88
fix slow server tests
GuptaManan100 Sep 9, 2021
0d4e526
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Sep 9, 2021
536b933
added test in ERS for testing preference of a candidate in the same cell
GuptaManan100 Sep 9, 2021
9a654d5
fix server unit test for vtctldserver
GuptaManan100 Sep 9, 2021
c68f864
found data race
GuptaManan100 Sep 9, 2021
644e0d6
added a todo
GuptaManan100 Sep 9, 2021
115e57b
handle data race
GuptaManan100 Sep 10, 2021
1ccb8ef
fix vtorc tests
GuptaManan100 Sep 10, 2021
488ae35
remove lock shard from interface
GuptaManan100 Sep 13, 2021
6d65dfd
add atomic counter so that only 1 ers is issued at a time
GuptaManan100 Sep 13, 2021
cf2c99e
remove the pre-recovery processes from ers
GuptaManan100 Sep 13, 2021
8100ffd
remove the check for primary recovery type from ers
GuptaManan100 Sep 13, 2021
cabc4a7
removed check fixed function from the interface
GuptaManan100 Sep 15, 2021
7c19858
bug fix for shard counter
GuptaManan100 Sep 15, 2021
136f825
move vtorc to use vtctl ers
GuptaManan100 Sep 15, 2021
aaee0ae
also audit the steps in the callback logger
GuptaManan100 Sep 15, 2021
ca7c79a
removed handle relay log failure from the interface
GuptaManan100 Sep 15, 2021
9b2e13a
remove the ReparentFunctions interface
GuptaManan100 Sep 15, 2021
6009739
renamed reparent functions to EmergencyReparentOptions
GuptaManan100 Sep 15, 2021
570866b
moved lockAction and restrictValidCandidates from the emergencyRepare…
GuptaManan100 Sep 15, 2021
6ae51e7
moved durability policy to reparentUtil
GuptaManan100 Sep 19, 2021
d31fd1b
added sorter for ERS
GuptaManan100 Sep 20, 2021
d199764
use sorter in findPrimaryCandidate
GuptaManan100 Sep 20, 2021
1e02ff3
moved vtorc functionality to promotedReplicaIsIdeal
GuptaManan100 Sep 20, 2021
79c3c88
fixed remaining ers to match newer implementation
GuptaManan100 Sep 20, 2021
d2e2033
also override promotion for mustNotPromotionRules
GuptaManan100 Sep 20, 2021
e9c047b
fixed bug in analysis occuring due to max connection limit to 1
GuptaManan100 Sep 22, 2021
f3c54f2
added prevention for cross cell promotion as an argument
GuptaManan100 Sep 22, 2021
c89a8e2
skip tests that aren't supported yet
GuptaManan100 Sep 22, 2021
afc1787
set durability policy from the vtctld and vtctl binaries
GuptaManan100 Sep 22, 2021
6442969
remove unused code
GuptaManan100 Sep 22, 2021
686b76f
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Sep 22, 2021
f1932de
remove duplicate function
GuptaManan100 Sep 22, 2021
cba5ac9
revert changes to prs
GuptaManan100 Sep 23, 2021
58b6f37
added flags for passing bool arguement preventing cross cell promotion
GuptaManan100 Sep 23, 2021
55493b2
refactor and addition of comments
GuptaManan100 Sep 23, 2021
1f39f98
removed reparent_functions file in reparentutil
GuptaManan100 Sep 23, 2021
f97051e
fix grpc tests
GuptaManan100 Sep 23, 2021
5c2fa2b
fix emergency_reparent tests
GuptaManan100 Sep 23, 2021
fce72c8
added comments to util file
GuptaManan100 Sep 23, 2021
a5b106f
added logging to ers functions
GuptaManan100 Sep 23, 2021
ea56879
added post ers code to vtorc
GuptaManan100 Sep 23, 2021
bd484e9
remove reparent_functions file from login in vtorc
GuptaManan100 Sep 23, 2021
9e6cd96
add test for ers counters
GuptaManan100 Sep 23, 2021
9416f39
add test for checking constraint satisfaction
GuptaManan100 Sep 23, 2021
5da378b
fix wrangler test
GuptaManan100 Sep 23, 2021
6209781
add test for reparenting replicas
GuptaManan100 Sep 23, 2021
0b57ed5
add test for promoting intermediate primary
GuptaManan100 Sep 23, 2021
dd93330
add test for getting better candidate
GuptaManan100 Sep 23, 2021
d730b90
add test for getting valid candidates and position list
GuptaManan100 Sep 23, 2021
9a6773e
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Sep 24, 2021
1d4b981
fix error in fakeMySQlDaemon
GuptaManan100 Sep 24, 2021
ea1874b
use vitess log instead of golang log in durability
GuptaManan100 Sep 24, 2021
8a2257a
added test for waiting for catching up
GuptaManan100 Sep 24, 2021
5efd4c8
handle data race in test
GuptaManan100 Sep 24, 2021
5102ae7
added test for restricting valid candidate list
GuptaManan100 Sep 24, 2021
cf7e9b5
added tests for finding intermediate primary and added split brain de…
GuptaManan100 Sep 24, 2021
3f74bca
remove a todo
GuptaManan100 Sep 24, 2021
11d2e0d
move ers tests to their own package
GuptaManan100 Sep 24, 2021
ee1d628
update config file for tests
GuptaManan100 Sep 24, 2021
cb0f05c
remove unused code
GuptaManan100 Sep 24, 2021
786217f
force tablet refresh in vtorc
GuptaManan100 Sep 24, 2021
485b85d
updated functions and added comments
GuptaManan100 Sep 24, 2021
cc8ad44
add few more TODOs
GuptaManan100 Sep 25, 2021
a02b173
add timeout to catchup phase of ers
GuptaManan100 Sep 27, 2021
23cfb1c
handle todo in setReplicationSource
GuptaManan100 Sep 27, 2021
9e5bacd
add configuration to set wait time and use mutex to lock the ERS
GuptaManan100 Sep 27, 2021
31351e3
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Sep 27, 2021
2751921
bug fix in vtorc
GuptaManan100 Sep 27, 2021
eac76e6
added for all the durability policies
GuptaManan100 Sep 27, 2021
f5d14e0
added unit tests for contraint failure to ers tests
GuptaManan100 Sep 27, 2021
aa09afd
do not cancel replication context until all the replicas are done
GuptaManan100 Sep 27, 2021
3bf3479
rename test package from prs to plannedreparent
GuptaManan100 Sep 28, 2021
ce8dd2f
use assert in tests where we do not want to abort in case of failure
GuptaManan100 Sep 28, 2021
9821156
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Sep 28, 2021
890349e
ran make vtctldclient vtadmin_web_proto_types
GuptaManan100 Sep 28, 2021
0d4a6bf
fix naming for ERSSorter
GuptaManan100 Sep 28, 2021
6b00bcc
fix config test file to reflect change to package name
GuptaManan100 Sep 28, 2021
d1201f2
fix change in vtctl file
GuptaManan100 Sep 28, 2021
3948d7e
fix import lines
GuptaManan100 Sep 28, 2021
be76f8c
move functions from util to ers which can be used only in EmergencyRe…
GuptaManan100 Sep 28, 2021
3559ce1
removed arguments available in the method
GuptaManan100 Sep 28, 2021
533bc11
fix more imports
GuptaManan100 Sep 28, 2021
243673a
removed the NewEmergencyReparentOptions function
GuptaManan100 Sep 28, 2021
54bbcb9
collect all variables into a single place
GuptaManan100 Sep 28, 2021
18ab55b
moved promotionRules to its own package
GuptaManan100 Sep 28, 2021
0570edf
rename constants and functions in promotionrule package
GuptaManan100 Sep 28, 2021
2276ae3
remove dead code from main_test.go file in vtorc
GuptaManan100 Sep 28, 2021
beae839
renamed function in vtorc tests
GuptaManan100 Sep 28, 2021
a8f4dcb
added a end to end test verifying that the new primary can pull trans…
GuptaManan100 Sep 28, 2021
ed78b1b
refactor and add comments
GuptaManan100 Sep 28, 2021
cb76b33
keep pb imports separate
GuptaManan100 Sep 28, 2021
c4f2b6b
refactor according to review comments
GuptaManan100 Sep 29, 2021
39ca79f
random selection of tablets with respect to cells if preventCrossCell…
GuptaManan100 Sep 29, 2021
9d8456c
add new flag to the usage as well
GuptaManan100 Sep 29, 2021
653efb3
add comments to other boolean literals
GuptaManan100 Sep 29, 2021
1338df4
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Sep 29, 2021
1f74111
rename some structs and functions for better readability
GuptaManan100 Oct 8, 2021
da23be5
added mutex for protecting concurrent access to the durability policies
GuptaManan100 Oct 11, 2021
e2c1938
Merge remote-tracking branch 'upstream/main' into prs-reconcile
GuptaManan100 Oct 11, 2021
1332dc6
rename 2 functions
GuptaManan100 Oct 11, 2021
90401da
Merge branch 'main' into prs-reconcile
GuptaManan100 Oct 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion go/cmd/vtcombo/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,10 @@ func main() {
vtg := vtgate.Init(context.Background(), resilientServer, tpb.Cells[0], tabletTypesToWait)

// vtctld configuration and init
vtctld.InitVtctld(ts)
err = vtctld.InitVtctld(ts)
if err != nil {
exit.Return(1)
}

servenv.OnRun(func() {
addStatusParts(vtg)
Expand Down
7 changes: 7 additions & 0 deletions go/cmd/vtctl/vtctl.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ import (
"vitess.io/vitess/go/vt/vtctl"
"vitess.io/vitess/go/vt/vtctl/grpcvtctldserver"
"vitess.io/vitess/go/vt/vtctl/localvtctldclient"
"vitess.io/vitess/go/vt/vtctl/reparentutil"
"vitess.io/vitess/go/vt/vttablet/tmclient"
"vitess.io/vitess/go/vt/workflow"
"vitess.io/vitess/go/vt/wrangler"
Expand All @@ -46,6 +47,7 @@ import (
var (
waitTime = flag.Duration("wait-time", 24*time.Hour, "time to wait on an action")
detachedMode = flag.Bool("detach", false, "detached mode - run vtcl detached from the terminal")
durability = flag.String("durability", "none", "type of durability to enforce. Default is none. Other values are dictated by registered plugins")
)

func init() {
Expand Down Expand Up @@ -91,6 +93,11 @@ func main() {
log.Warningf("cannot connect to syslog: %v", err)
}

if err := reparentutil.SetDurabilityPolicy(*durability, nil); err != nil {
log.Errorf("error in setting durability policy: %v", err)
exit.Return(1)
}

closer := trace.StartTracing("vtctl")
defer trace.LogErrorsWhenClosing(closer)

Expand Down
6 changes: 5 additions & 1 deletion go/cmd/vtctld/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ limitations under the License.
package main

import (
"vitess.io/vitess/go/exit"
"vitess.io/vitess/go/vt/servenv"
"vitess.io/vitess/go/vt/topo"
"vitess.io/vitess/go/vt/vtctld"
Expand All @@ -40,7 +41,10 @@ func main() {
defer ts.Close()

// Init the vtctld core
vtctld.InitVtctld(ts)
err := vtctld.InitVtctld(ts)
if err != nil {
exit.Return(1)
}

// Register http debug/health
vtctld.RegisterDebugHealthHandler(ts)
Expand Down
13 changes: 8 additions & 5 deletions go/cmd/vtctldclient/command/reparents.go
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ var emergencyReparentShardOptions = struct {
WaitReplicasTimeout time.Duration
NewPrimaryAliasStr string
IgnoreReplicaAliasStrList []string
PreventCrossCellPromotion bool
}{}

func commandEmergencyReparentShard(cmd *cobra.Command, args []string) error {
Expand Down Expand Up @@ -108,11 +109,12 @@ func commandEmergencyReparentShard(cmd *cobra.Command, args []string) error {
cli.FinishedParsing(cmd)

resp, err := client.EmergencyReparentShard(commandCtx, &vtctldatapb.EmergencyReparentShardRequest{
Keyspace: keyspace,
Shard: shard,
NewPrimary: newPrimaryAlias,
IgnoreReplicas: ignoreReplicaAliases,
WaitReplicasTimeout: protoutil.DurationToProto(emergencyReparentShardOptions.WaitReplicasTimeout),
Keyspace: keyspace,
Shard: shard,
NewPrimary: newPrimaryAlias,
IgnoreReplicas: ignoreReplicaAliases,
WaitReplicasTimeout: protoutil.DurationToProto(emergencyReparentShardOptions.WaitReplicasTimeout),
PreventCrossCellPromotion: emergencyReparentShardOptions.PreventCrossCellPromotion,
})
if err != nil {
return err
Expand Down Expand Up @@ -261,6 +263,7 @@ func commandTabletExternallyReparented(cmd *cobra.Command, args []string) error
func init() {
EmergencyReparentShard.Flags().DurationVar(&emergencyReparentShardOptions.WaitReplicasTimeout, "wait-replicas-timeout", *topo.RemoteOperationTimeout, "Time to wait for replicas to catch up in reparenting.")
EmergencyReparentShard.Flags().StringVar(&emergencyReparentShardOptions.NewPrimaryAliasStr, "new-primary", "", "Alias of a tablet that should be the new primary. If not specified, the vtctld will select the best candidate to promote.")
EmergencyReparentShard.Flags().BoolVar(&emergencyReparentShardOptions.PreventCrossCellPromotion, "prevent-cross-cell-promotion", false, "Only promotes a new primary from the same cell as the previous primary")
EmergencyReparentShard.Flags().StringSliceVarP(&emergencyReparentShardOptions.IgnoreReplicaAliasStrList, "ignore-replicas", "i", nil, "Comma-separated, repeated list of replica tablet aliases to ignore during the emergency reparent.")
Root.AddCommand(EmergencyReparentShard)

Expand Down
11 changes: 1 addition & 10 deletions go/mysql/replication_status.go
Original file line number Diff line number Diff line change
Expand Up @@ -123,16 +123,7 @@ func (s *ReplicationStatus) FindErrantGTIDs(otherReplicaStatuses []*ReplicationS
if !ok {
panic("The receiver ReplicationStatus contained a Mysql56GTIDSet in its relay log, but a replica's ReplicationStatus is of another flavor. This should never happen.")
}
// Copy and throw out primary SID from consideration, so we don't mutate input.
otherSetNoPrimarySID := make(Mysql56GTIDSet, len(otherSet))
for sid, intervals := range otherSet {
if sid == status.SourceUUID {
continue
}
otherSetNoPrimarySID[sid] = intervals
}

otherSets = append(otherSets, otherSetNoPrimarySID)
otherSets = append(otherSets, otherSet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(leaving placeholder for myself to understand why we need this change)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was required because I saw that we had a bug in this code, were even if two servers had the exact same GTID set but had different sources set, we were flagging them both as having errant GTIDs. I have added a test for that in TestFindErrantGTIDs. We do not need to remove the GTIDs originating from the source for the servers that we are checking against. We should only remove them for the server that is being tested for errant GTIDs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were even if two servers had the exact same GTID set but had different sources set, we were flagging them both as having errant GTIDs.

This confuses me a bit. You only look for errant GTIDs in a descendant of a server; comparing siblings or cousins doesn't have the same strong check guarantees, the way I understand it.

But I'm unfamiliar with this code, I'm unsure how it's being used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we do not flag a GTID as errant when -

  1. It originates from the source server. (Reason for this is that the failed primary might be running with a semi-sync setup and this server might be the only one to have acked the transaction, so we cant mark it as errant)
  2. Atleast two servers have that GTID.

For the second condition check we should not remove the GTIDs that originated from the source for the servers that we are comparing it against.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't understand the answer. What is the "source" server? Is this the primary?

  1. if two servers have that GTID it could still be errant, assuming these two servers were ever (even briefly) in parent-child relationship.

Could you please explain the context for this check? I'm unfamiliar with this part of the code because it's not originated in orchestrator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "source" server is the server which the tablet that we are checking errant GTIDs for. That would be the primary in most cases.

if two servers have that GTID it could still be errant, assuming these two servers were ever (even briefly) in parent-child relationship.

Yes, you are right. This errant GTID check is far from complete. What it does guarantee however is that anything marked as errant from this function will in all cases be errant. It does not guarantee that we catch all errant GTIDs. I have updated the PR description as well stating that this codepath needs change and should also use the durability policies for a more comprehensive errant GTID check.

We only mark a GTID as errant if it is only present in this server and does not originate from the previous primary. This is what the code does now.
Earlier, we used to mark a GTID as errant even when the same GTID was present in some other server but it was originating from that server's source. This was wrong and also how I stumbled onto this bug. The additional check restricting the GTIDs in the servers we checked against didn't make any sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to how two tablets had different sources, that can happen because a failed ERS, or some network partition due to which a server's source wasn't fixed correctly. Anyways these GTIDs should not be marked as errant.

}

// Copy set for final diffSet so we don't mutate receiver.
Expand Down
47 changes: 30 additions & 17 deletions go/mysql/replication_status_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ package mysql

import (
"testing"

"github.com/stretchr/testify/require"
)

func TestStatusReplicationRunning(t *testing.T) {
Expand Down Expand Up @@ -81,22 +83,33 @@ func TestFindErrantGTIDs(t *testing.T) {
sourceSID: []interval{{2, 6}, {15, 45}},
}

status1 := ReplicationStatus{SourceUUID: sourceSID, RelayLogPosition: Position{GTIDSet: set1}}
status2 := ReplicationStatus{SourceUUID: sourceSID, RelayLogPosition: Position{GTIDSet: set2}}
status3 := ReplicationStatus{SourceUUID: sourceSID, RelayLogPosition: Position{GTIDSet: set3}}

got, err := status1.FindErrantGTIDs([]*ReplicationStatus{&status2, &status3})
if err != nil {
t.Errorf("%v", err)
}

want := Mysql56GTIDSet{
sid1: []interval{{39, 39}, {40, 49}, {71, 75}},
sid2: []interval{{1, 2}, {6, 7}, {20, 21}, {26, 31}, {38, 50}, {60, 66}},
sid4: []interval{{1, 30}},
}

if !got.Equal(want) {
t.Errorf("got %#v; want %#v", got, want)
testcases := []struct {
mainRepStatus *ReplicationStatus
otherRepStatuses []*ReplicationStatus
want Mysql56GTIDSet
}{{
mainRepStatus: &ReplicationStatus{SourceUUID: sourceSID, RelayLogPosition: Position{GTIDSet: set1}},
otherRepStatuses: []*ReplicationStatus{
{SourceUUID: sourceSID, RelayLogPosition: Position{GTIDSet: set2}},
{SourceUUID: sourceSID, RelayLogPosition: Position{GTIDSet: set3}},
},
want: Mysql56GTIDSet{
sid1: []interval{{39, 39}, {40, 49}, {71, 75}},
sid2: []interval{{1, 2}, {6, 7}, {20, 21}, {26, 31}, {38, 50}, {60, 66}},
sid4: []interval{{1, 30}},
},
}, {
mainRepStatus: &ReplicationStatus{SourceUUID: sourceSID, RelayLogPosition: Position{GTIDSet: set1}},
otherRepStatuses: []*ReplicationStatus{{SourceUUID: sid1, RelayLogPosition: Position{GTIDSet: set1}}},
// servers with the same GTID sets should not be diagnosed with errant GTIDs
want: nil,
}}

for _, testcase := range testcases {
t.Run("", func(t *testing.T) {
got, err := testcase.mainRepStatus.FindErrantGTIDs(testcase.otherRepStatuses)
require.NoError(t, err)
require.Equal(t, testcase.want, got)
})
}
}
204 changes: 204 additions & 0 deletions go/test/endtoend/reparent/emergencyreparent/ers_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
/*
Copyright 2019 The Vitess Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package emergencyreparent

import (
"context"
"testing"
"time"

"github.com/stretchr/testify/require"

"vitess.io/vitess/go/test/endtoend/cluster"
"vitess.io/vitess/go/vt/log"
)

func TestTrivialERS(t *testing.T) {
defer cluster.PanicHandler(t)
setupReparentCluster(t)
defer teardownCluster()

confirmReplication(t, tab1, []*cluster.Vttablet{tab2, tab3, tab4})

// We should be able to do a series of ERS-es, even if nothing
// is down, without issue
for i := 1; i <= 4; i++ {
out, err := ers(nil, "60s", "30s")
log.Infof("ERS loop %d. EmergencyReparentShard Output: %v", i, out)
require.NoError(t, err)
time.Sleep(5 * time.Second)
}
// We should do the same for vtctl binary
for i := 1; i <= 4; i++ {
out, err := ersWithVtctl()
log.Infof("ERS-vtctl loop %d. EmergencyReparentShard Output: %v", i, out)
require.NoError(t, err)
time.Sleep(5 * time.Second)
}
}

func TestReparentIgnoreReplicas(t *testing.T) {
defer cluster.PanicHandler(t)
setupReparentCluster(t)
defer teardownCluster()
var err error

ctx := context.Background()

confirmReplication(t, tab1, []*cluster.Vttablet{tab2, tab3, tab4})

// Make the current primary agent and database unavailable.
stopTablet(t, tab1, true)

// Take down a replica - this should cause the emergency reparent to fail.
stopTablet(t, tab3, true)

// We expect this one to fail because we have an unreachable replica
out, err := ers(nil, "60s", "30s")
require.NotNil(t, err, out)

// Now let's run it again, but set the command to ignore the unreachable replica.
out, err = ersIgnoreTablet(nil, "60s", "30s", []*cluster.Vttablet{tab3}, false)
require.Nil(t, err, out)

// We'll bring back the replica we took down.
restartTablet(t, tab3)

// Check that old primary tablet is left around for human intervention.
confirmOldPrimaryIsHangingAround(t)
deleteTablet(t, tab1)
validateTopology(t, false)

newPrimary := getNewPrimary(t)
// Check new primary has latest transaction.
err = checkInsertedValues(ctx, t, newPrimary, insertVal)
require.Nil(t, err)

// bring back the old primary as a replica, check that it catches up
resurrectTablet(ctx, t, tab1)
}

// TestERSPromoteRdonly tests that we never end up promoting a rdonly instance as the primary
func TestERSPromoteRdonly(t *testing.T) {
defer cluster.PanicHandler(t)
setupReparentCluster(t)
defer teardownCluster()
var err error

err = clusterInstance.VtctlclientProcess.ExecuteCommand("ChangeTabletType", tab2.Alias, "rdonly")
require.NoError(t, err)

err = clusterInstance.VtctlclientProcess.ExecuteCommand("ChangeTabletType", tab3.Alias, "rdonly")
require.NoError(t, err)

confirmReplication(t, tab1, []*cluster.Vttablet{tab2, tab3, tab4})

// Make the current primary agent and database unavailable.
stopTablet(t, tab1, true)

// We expect this one to fail because we have ignored all the replicas and have only the rdonly's which should not be promoted
out, err := ersIgnoreTablet(nil, "30s", "30s", []*cluster.Vttablet{tab4}, false)
require.NotNil(t, err, out)

out, err = clusterInstance.VtctlclientProcess.ExecuteCommandWithOutput("GetShard", keyspaceShard)
require.NoError(t, err)
require.Contains(t, out, `"uid": 101`, "the primary should still be 101 in the shard info")
}

// TestERSPreventCrossCellPromotion tests that we promote a replica in the same cell as the previous primary if prevent cross cell promotion flag is set
func TestERSPreventCrossCellPromotion(t *testing.T) {
defer cluster.PanicHandler(t)
setupReparentCluster(t)
defer teardownCluster()
var err error

// confirm that replication is going smoothly
confirmReplication(t, tab1, []*cluster.Vttablet{tab2, tab3, tab4})

// Make the current primary agent and database unavailable.
stopTablet(t, tab1, true)

// We expect that tab3 will be promoted since it is in the same cell as the previous primary
out, err := ersIgnoreTablet(nil, "60s", "30s", []*cluster.Vttablet{tab2}, true)
require.NoError(t, err, out)

newPrimary := getNewPrimary(t)
require.Equal(t, newPrimary.Alias, tab3.Alias, "tab3 should be the promoted primary")
}

// TestPullFromRdonly tests that if a rdonly tablet is the most advanced, then our promoted primary should have
// caught up to it by pulling transactions from it
func TestPullFromRdonly(t *testing.T) {
defer cluster.PanicHandler(t)
setupReparentCluster(t)
defer teardownCluster()
var err error

ctx := context.Background()
// make tab2 a rdonly tablet.
// rename tablet so that the test is not confusing
rdonly := tab2
err = clusterInstance.VtctlclientProcess.ExecuteCommand("ChangeTabletType", rdonly.Alias, "rdonly")
require.NoError(t, err)

// confirm that all the tablets can replicate successfully right now
confirmReplication(t, tab1, []*cluster.Vttablet{rdonly, tab3, tab4})

// stop replication on the other two tablets
err = clusterInstance.VtctlclientProcess.ExecuteCommand("StopReplication", tab3.Alias)
require.NoError(t, err)
err = clusterInstance.VtctlclientProcess.ExecuteCommand("StopReplication", tab4.Alias)
require.NoError(t, err)

// stop semi-sync on the primary so that any transaction now added does not require an ack
runSQL(ctx, t, "SET GLOBAL rpl_semi_sync_master_enabled = false", tab1)

// confirm that rdonly is able to replicate from our primary
// This will also introduce a new transaction into the rdonly tablet which the other 2 replicas don't have
confirmReplication(t, tab1, []*cluster.Vttablet{rdonly})

// Make the current primary agent and database unavailable.
stopTablet(t, tab1, true)

// start the replication back on the two tablets
err = clusterInstance.VtctlclientProcess.ExecuteCommand("StartReplication", tab3.Alias)
require.NoError(t, err)
err = clusterInstance.VtctlclientProcess.ExecuteCommand("StartReplication", tab4.Alias)
require.NoError(t, err)

// check that tab3 and tab4 still only has 1 value
err = checkCountOfInsertedValues(ctx, t, tab3, 1)
require.NoError(t, err)
err = checkCountOfInsertedValues(ctx, t, tab4, 1)
require.NoError(t, err)

// At this point we have successfully made our rdonly tablet more advanced than tab3 and tab4 without introducing errant GTIDs
// We have simulated a network partition in which the primary and rdonly got isolated and then the primary went down leaving the rdonly most advanced

// We expect that tab3 will be promoted since it is in the same cell as the previous primary
// since we are preventing cross cell promotions
// Also it must be fully caught up
out, err := ersIgnoreTablet(nil, "60s", "30s", nil, true)
require.NoError(t, err, out)

newPrimary := getNewPrimary(t)
require.Equal(t, newPrimary.Alias, tab3.Alias, "tab3 should be the promoted primary")

// check that the new primary has the last transaction that only the rdonly had
err = checkInsertedValues(ctx, t, newPrimary, insertVal)
require.NoError(t, err)
}
Loading