Have a common ERS for both VtOrc and Vtctl #8492

GuptaManan100 · 2021-07-19T09:25:46Z

Description

Overview

VtOrc did not use the Emergency Reparent Shard functionality in case of primary failure. This PR changes that behaviour so that all the code paths use a common Emergency Reparent Shard.

VtOrc had a durability policy along with extra configurations that it used to make decisions while running reparent operations. Since we want the new ERS to also have access to this information, the durability policies have been moved to the reparentutil package and duarbility can be specified in the vtctl and vtctld binaries when they are bought up. Also an additional flag has been added to EmergencyReparentShard which allows the users to prevent any cross cell promotions. Also, as part of reconciling the differences between the two code paths, we have chosen to keep the best of both worlds, which means that VtOrc now has the additional capability to also detect errant GTIDs and vtctld has improved by the two step process that VtOrc had of promoting a primary candidate which is the most advanced, and then later checking if we can improve that promotion.
Additional metrics have been added to keep track of the number of ERS run, number of successful and failed ERS runs.

New Emergency Reparent Shard Steps

First step is to lock the shard for the given operation.
Read all the tablets and there informations
Stop replication on all the tablets and build the status map which contains their replication status.
Find the valid candidates for becoming the primary. This is where we check for errant GTIDs and remove the tablets that have them from consideration.
Restrict the valid candidates list. We remove any tablet which is of the type DRAINED, RESTORE or BACKUP.
Wait for all candidates to apply relay logs
Find the intermediate source for replication that we want other tablets to replicate from. This step chooses the most advanced tablet. Further ties are broken by using the promotion rule. In case the user has specified a tablet specifically, then it is selected, as long as it is the most advanced. Here we also check for split brain scenarios and check that the selected replica must be more advanced than all the other valid candidates. We fail in case there is a split brain detected.
Reparent all the other tablets to start replicating from the intermediate source. We do not promote this tablet to a primary instance, we only let other replicas start replicating from this tablet
Try to find the candidate for promotion if we do not already have the most ideal one. We prefer to choose a candidate with the best possible durability rule. We only look for candidates in the same cell as our previous primary if the user has disabled cross cell promotions. However, if there is an explicit request from the user to promote a specific tablet, then we choose that tablet.
If our candidate is different from our intermediate source, then we wait for it to catch up to it.
Check if the new primary that we will promote satisfies all the constraints of promotion. Here we also check that we do not promote a primary of type MustNotPromoteRule. This additional step also fixes the issue EmergencyReparentShard promotes rdonly tablet to primary #7441 since all rdonly tablets are assigned this promotion rule.
At this point, we promote our primary candidate. We do this in the final step where we call PromoteReplica which fixes the semi-sync, changes the tablet type, set the primary to read-write and flushes the binlogs. We also populate the reparent journal and set replication on all the replicas. As of now PromoteReplica RPC uses the semi-sync information from the tablet flags and not the durability policy.

Changes to vtctld Emergency Reparent Shard -

Bug fix so that we do not promote rdonly tablets to primary
Two step primary selection which allows us to improve our candidate to be in the same cell if possible. Also, if the user requested tablet is not the most advanced, then instead of failing we wait for it to catch up.
Additional flag that allows users to prevent cross cell promotion. Default is set to false, to mimic the old behaviour.
Better detection and reporting of split brain scenarios. The case were two servers pass the errant GTID test but each has transactions the other doesn't. Before these changes the tie would have been broken indecisively and either server would have been promoted. Instead, now we fail and let the user run the ERS specifying which tablet they want us to ignore.
Better candidate promotions using the durability policies. These durability policies have been ported over from the vtorc codebase and now can be specified during vtctl and vtctld server startups.

Changes to VtOrc -

In case we need to override promotion, we do it properly instead of failing and it still being permanent.
Check if any tablet has errant GTIDs before promotion. This was not present in vtorc earlier. This will allow us to prevent promoting a server which has errant GTIDs.
Reconciled both code passes. Use Emergency Reparent Shard while still having the same behaviour and checks.
Removed lost replicas check and hooks. ERS does not check for replicas that can no longer replicate and does not run lost replica hooks on them.
Removed pre-recovery hooks.
Do not have a majority version or binlog format check in the ERS anymore. Vtorc earlier used to check that the promoted primary had the majority version as the other servers. Also, when choosing the final candidate we used to check whether it can replicate from the intermediate source or not. That check has also been removed now.
Code related to binlog server has been removed since it is not relevant to Vitess.

Changes not in this PR but will be addressed later -

LockShard usage to prevent conflicting recoveries from vtorc and refreshing ephemeral information after shard locking.
Errant GTID detection will eventually become redundant when the full durability policy code will be in.
Durability policies are not used everywhere and eventually they will be. The places that should use them but aren't right now are -
a. Fixing semi sync on the primary still uses the vttablet flag. This flag should be deprecated and the durability policies should be used instead.
b. ERS code exits in case of more than one tablets are not reachable. We should instead use the durability policies to decide whether we can run a successful promotion or not even in case of multiple failures.
c. Finding errant GTIDs code does not flag a GTID as errant as long as it is present in two servers. We should instead use the durability policies to decide if they are errant or not.

Other changes in this PR -

Bug-fix in the code that is used to find errant GTIDs. Previously there were cases when, even though 2 tablets had the same GTID set, they were both marked as having errant GTIDs.

Related Issue(s)

Fixes EmergencyReparentShard promotes rdonly tablet to primary #7441
Progress tracking in RFC - Durability and consensus in Vtorc #8975

Checklist

Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Needs to be specified in the release notes

Signed-off-by: GuptaManan100 <[email protected]>

Signed-off-by: Manan Gupta <[email protected]>

…l database in vtorc Signed-off-by: Manan Gupta <[email protected]>

Signed-off-by: Manan Gupta <[email protected]>

rafael

One more pass on my end. One thing that I'm struggling a bit is convincing myself that we didn't miss anything important that was happening in VOrchestrator logic, that is not ported to the new ERS.

I haven't look closely at the tests, so I might be able to get more clarity on this as I take a closer look at that.

rafael · 2021-10-05T17:00:24Z

go/vt/orchestrator/inst/instance_topology_dao.go

@@ -360,7 +360,9 @@ func StopReplicas(replicas [](*Instance), stopReplicationMethod StopReplicationM

 // StopReplicasNicely will attemt to stop all given replicas nicely, up to timeout
 func StopReplicasNicely(replicas [](*Instance), timeout time.Duration) [](*Instance) {
-	return StopReplicas(replicas, StopReplicationNice, timeout)
+	stoppedReplicas := StopReplicas(replicas, StopReplicationNice, timeout)
+	stoppedReplicas = RemoveNilInstances(stoppedReplicas)


I see. This is great context. I think we should document this in the comments for RemoveNilInstances so future readers know why we need to remove the nil instances.

rafael · 2021-10-05T17:29:13Z

go/vt/orchestrator/logic/topology_recovery.go

@@ -815,113 +613,120 @@ func checkAndRecoverDeadPrimary(analysisEntry inst.ReplicationAnalysis, candidat
 	if !(forceInstanceRecovery || analysisEntry.ClusterDetails.HasAutomatedPrimaryRecovery) {
 		return false, nil, nil
 	}
+	tablet, err := TabletRefresh(analysisEntry.AnalyzedInstanceKey)


Correct. I don't think they can be entirely removed (that was not my suggestion). But we should strive to keep it at a minimum. Here we are adding more calls that we used to have.

In my opinion this should be revisited before GA.

go/vt/orchestrator/logic/topology_recovery.go

rafael · 2021-10-05T17:55:40Z

go/vt/orchestrator/logic/topology_recovery.go

-		return false, topologyRecovery, err
+
+	// check if we have received an ERS in progress, if we do, we should not continue with the recovery
+	if checkAndSetIfERSInProgress() {


I see. Let's also talk about this in our sync. Curious if this is needed for GA.

rafael · 2021-10-05T18:44:55Z

go/vt/vtctl/reparentutil/emergency_reparenter.go

+
+	// find the valid candidates for becoming the primary
+	// this is where we check for errant GTIDs and remove the tablets that have them from consideration
+	validCandidates, err = FindValidEmergencyReparentCandidates(statusMap, primaryStatusMap)


In orchestrator, as part of finding replica candidates (chooseCandidateReplica), it looks like some replicas get removed by various checks. One in particular that I was trying to find here are the ones in CanReplicateFrom. Were those removed? Or am I missing them?

Those checks were MySQL version and binlog format checks. They have been removed for now.

rafael · 2021-10-05T21:43:21Z

Bringing back from a conversation in Slack with Deepthi:

Some logic in the original VOrchestrator hasn't been ported over. This seems to be because:

It was decided it's no longer relevant.
It is relevant, but it was out of scope for this initial iteration.

Some examples of these changes are (or I couldn't find the equivalent in the new code):

Recovery from binlog server logic (PrimaryRecoveryBinlogServer).
Some sanity checks around IsSmallerMajorVersion.
Sanity checks defined in replica.CanReplicateFrom.

We should have a detailed accounting of how this new implementation is deviating from the original one. @~GuptaManan100, I think you did a great job documenting the new flow, but calling out explicitly what was left out will be super helpful. That way, people can have context on what to expect from VOrchestrator when comparing it to Orchestrator.

I think that it would be great to document:

The most relevant pieces of logic that were removed and why?
The most relevant pieces of logic removed that will be added in future versions.

shlomi-noach · 2021-10-06T05:44:12Z

irrespective of whether the flag PreventCrossCellPromotion is set to true or false, the code would still prefer promoting a tablet from the same cell, so if there is REPLICA tablet in the same cell then that is the one that will get elected. This logic live in the identifyPrimaryCandidate function.
After talking with @sougou, I have changed the behaviour of ERS to not prefer candidates from the same cell if preventCrossCellPromotion is set false. This will be much closer to the previous version of ERS where we choose the most advanced primary candidate without looking at the cell information at all.

The orchestrator way of dealing with this: use promotion rules. Yes, orchestrator generally prefers same-DC/zone servers, but then the user can choose to assign a neutral promotion rule for local servers, and a prefer promotion rule for remote servers, and that's really the way to tell orchestrator how to proceed. The flag PreventCrossCellPromotion is more of a kill-switch (I think I added it after after the infamous 2019 outage).

In my opinion, setting PreventCrossCellPromotion = false should not mean "prefer cross cell promotion". It doesn't read like that, English-wise. I think this will confuse many people. If anything, add a flag named EnforceCrossCellPromotion. But again, I think the right way to go is promotion rules.

shlomi-noach · 2021-10-06T06:04:30Z

@rafael PrimaryRecoveryBinlogServer is something I developed at, and for, Booking.com, and was literally never in actual use. Over the years there's been many dreams about new/different binlog server implementations users would use. These never came true, and the code is anyway irrelevant to vitess. This removal makes perfect sense.

shlomi-noach

I reviewed about half the changes, before my brain shut down 😛
I still need to review durability.go, emergency_reparenter.go ,ers_sorter.go etc. these actually seem to be the more critical changes 🤕

shlomi-noach · 2021-10-06T05:50:22Z

go/mysql/replication_status.go

-		// Copy and throw out primary SID from consideration, so we don't mutate input.
-		otherSetNoPrimarySID := make(Mysql56GTIDSet, len(otherSet))
-		for sid, intervals := range otherSet {
-			if sid == status.SourceUUID {
-				continue
-			}
-			otherSetNoPrimarySID[sid] = intervals
-		}
-
-		otherSets = append(otherSets, otherSetNoPrimarySID)
+		otherSets = append(otherSets, otherSet)


were even if two servers had the exact same GTID set but had different sources set, we were flagging them both as having errant GTIDs.

This confuses me a bit. You only look for errant GTIDs in a descendant of a server; comparing siblings or cousins doesn't have the same strong check guarantees, the way I understand it.

But I'm unfamiliar with this code, I'm unsure how it's being used.

shlomi-noach · 2021-10-06T06:06:23Z

go/vt/orchestrator/logic/topology_recovery.go

-	// That's it! We must do recovery!
-	// TODO(sougou): This function gets called by GracefulPrimaryTakeover which may
-	// need to obtain shard lock before getting here.
-	unlock, err := LockShard(analysisEntry.AnalyzedInstanceKey)


Why is LockShard removed?

LockShard is called inside the ERS code so we should not call it outside

Cool, follow question, and I'm asking because I'm not sure how the flow works; basically we shouldn't even begin the operation if LockShard fails. My question is: since LockShard is called inside ERS code, are there any steps taken before that point, that we should avoid?

Nope, there are no operations that require the shard to be locked before ERS is called.

All those operations are for counters and deciding whether we should run an operation at all in the first place. But yes, we should be doing a tablet refresh after the shard is locked. But this problem is out of scope of this PR and will be addressed before GA release of vtorc

go/vt/orchestrator/logic/topology_recovery.go

shlomi-noach · 2021-10-06T06:11:54Z

go/vt/vtctl/reparentutil/durability.go

+		return fmt.Errorf("durability policy %v not found", name)
+	}
+	log.Infof("Durability setting: %v", name)
+	curDurabilityPolicy = newDurabilityCreationFunc(durabilityParams)


Let's add a mutex now, so that if/when we eventually support dynamic durability changes, we are protected.

rafael · 2021-10-06T14:21:22Z

@rafael PrimaryRecoveryBinlogServer is something I developed at, and for, Booking.com, and was literally never in actual use. Over the years there's been many dreams about new/different binlog server implementations users would use. These never came true, and the code is anyway irrelevant to vitess. This removal makes perfect sense.

This makes sense and I assumed it was something along those lines. We never used that as well, so. I have no worries on this being removed :D

I'm highlighting that it seems that there are decisions to remove certain things (with good reasons) but it's not obvious to the readers of the PR. Being very explicit about those seems good context for folks trying to get involved in this work.

Signed-off-by: Manan Gupta <[email protected]>

rafael · 2021-10-10T18:34:13Z

I took one more pass and I didn't find anything new that we haven't discussed already. It seems that there are still few outstanding comments from Shlomi. From my perspective, once those are resolved this is good to be merged and keep iterating.

As discussed, if we can be super explicit on this it will be good context for other folks following this work.

From my notes on the last session, for the next iteration we will need to expand further on:

What to do about LockShard and decide if the current approach is good enough for GA?
It seems that some pieces of existent logic for errant GTIDS might be redundant (this is something @sougou called out).
The areas where the pluggable durability is not being used yet and some assumptions are being made.

Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 · 2021-10-11T07:09:41Z

@ajm188 could you take a look now that I have addressed all of your review comments?

all my blocking comments have been addressed

ajm188

one more rename we should do before merging, but i've gotten myself out of the blocking path

go/vt/vtctl/reparentutil/durability.go

Signed-off-by: Manan Gupta <[email protected]>

shlomi-noach

I did not cover the entire changes. Even though I'm naturally very familiar with orchestrator code, I found this PR to be a bit overwhelming. Most of my review comments were based on the logic in some functions, but I lack the understanding of how everything is combined. I know there are recorded meetings I can later watch.
Notable I didn't do a good review job on go/vt/vtctl/reparentutil/...

I do appreciate that there seems to be good testing added. They look legit. They're very well commented.

shlomi-noach · 2021-10-14T14:14:00Z

go/vt/vtctl/reparentutil/durability.go

+		return fmt.Errorf("durability policy %v not found", name)
+	}
+	log.Infof("Durability setting: %v", name)
+	curDurabilityPolicy = newDurabilityCreationFunc(durabilityParams)


Throughout this file I only see use of Neutral and MustNot. I'd rather see Prefer and MustNot.
Logically, the two options are the same. Both Neutral and Prefer are good as candidates, and obviously "better" than MustNot.
However, Prefer is more indicative that "yes, this server is really a good one". In orchestrator code, when orchestrator sees a prefer server, it is able to cut short further investigation; whereas when orchestrator promotes a Neutral server, it proceeds to check "is there any server better than this?".

Non-blocking comment for your consideration.

rafael

Catching up with the latest updates. From my perspective, good to merge and keep iterating.

I think we should be super loud with the community when this makes it in the next release, to fully test ERS in their environments.

Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 · 2021-10-18T05:03:07Z

Yes, you are right @shlomi-noach, we are going to make a lot more changes to the durability policies and will also use prefer promote rules.

GuptaManan100 added 3 commits July 16, 2021 11:37

remove unused constants

986a198

Signed-off-by: GuptaManan100 <[email protected]>

moved reparent operations to vtorc

0beabaa

Signed-off-by: GuptaManan100 <[email protected]>

start of adding common emergencyReparenter

53d1829

Signed-off-by: GuptaManan100 <[email protected]>

GuptaManan100 added Type: Internal Cleanup Component: Cluster management labels Jul 19, 2021

GuptaManan100 added 11 commits July 21, 2021 20:29

check if fixed and event dispatch for the shard info

73a8698

Signed-off-by: GuptaManan100 <[email protected]>

PreRecoveryProcesses

ee3ffb0

Signed-off-by: GuptaManan100 <[email protected]>

stop replication and check shard locked

9f1d06b

Signed-off-by: GuptaManan100 <[email protected]>

Merge remote-tracking branch 'upstream/main' into prs-reconcile

40e9f96

Signed-off-by: GuptaManan100 <[email protected]>

get primary recovery type

ba99549

Signed-off-by: GuptaManan100 <[email protected]>

find primary candidates

aa4a261

Signed-off-by: GuptaManan100 <[email protected]>

check if need to override primary candidate

74cbabc

Signed-off-by: GuptaManan100 <[email protected]>

start replication

5f7c088

Signed-off-by: GuptaManan100 <[email protected]>

make vtorc use the new reparenter

7bfc937

Signed-off-by: GuptaManan100 <[email protected]>

moved the package back

73d609c

Signed-off-by: GuptaManan100 <[email protected]>

bug fix

08103c3

Signed-off-by: GuptaManan100 <[email protected]>

GuptaManan100 force-pushed the prs-reconcile branch from 40f6770 to 08103c3 Compare July 22, 2021 14:55

GuptaManan100 added 13 commits August 16, 2021 18:06

Merge remote-tracking branch 'upstream/main' into prs-reconcile

dfd4f8d

Signed-off-by: GuptaManan100 <[email protected]>

call the cancel function

281455f

Signed-off-by: GuptaManan100 <[email protected]>

remove dead code

9ada8ad

Signed-off-by: GuptaManan100 <[email protected]>

added creator functions for the reparenters and started switching tests

96b9fb5

Signed-off-by: GuptaManan100 <[email protected]>

moved remaining unit tests to newer ers

8e5e41f

Signed-off-by: GuptaManan100 <[email protected]>

remove old ers

eb3a127

Signed-off-by: GuptaManan100 <[email protected]>

rename the new ERS

873ca33

Signed-off-by: GuptaManan100 <[email protected]>

added a failing test for ERS which promotes rdonly

266aba2

Signed-off-by: GuptaManan100 <[email protected]>

added a failing test for vtorc which promotes a crossCellReplica

e8cc7d9

Signed-off-by: GuptaManan100 <[email protected]>

stop replication via common code

383d78f

Signed-off-by: Manan Gupta <[email protected]>

use the candidate list generated from the topo server instead of loca…

19d3544

…l database in vtorc Signed-off-by: Manan Gupta <[email protected]>

add function to restrict the candidates based on the type

5ef4ab6

Signed-off-by: Manan Gupta <[email protected]>

added logger to vtorc

49cf202

Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 added 2 commits September 29, 2021 23:11

add comments to other boolean literals

653efb3

Signed-off-by: Manan Gupta <[email protected]>

Merge remote-tracking branch 'upstream/main' into prs-reconcile

1338df4

Signed-off-by: Manan Gupta <[email protected]>

rafael reviewed Oct 5, 2021

View reviewed changes

shlomi-noach reviewed Oct 6, 2021

View reviewed changes

rename some structs and functions for better readability

1f74111

Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 added 2 commits October 11, 2021 11:51

added mutex for protecting concurrent access to the durability policies

da23be5

Signed-off-by: Manan Gupta <[email protected]>

Merge remote-tracking branch 'upstream/main' into prs-reconcile

e2c1938

Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 mentioned this pull request Oct 11, 2021

RFC - Durability and consensus in Vtorc #8975

Closed

22 tasks

ajm188 reviewed Oct 11, 2021

View reviewed changes

go/vt/vtctl/reparentutil/durability.go Outdated Show resolved Hide resolved

rename 2 functions

1332dc6

Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 requested review from ajm188 and shlomi-noach October 11, 2021 16:52

This was referenced Oct 13, 2021

Update docs for ers vitessio/website#844

Closed

Update docs for ers vitessio/website#845

Merged

shlomi-noach approved these changes Oct 14, 2021

View reviewed changes

rafael approved these changes Oct 14, 2021

View reviewed changes

Merge branch 'main' into prs-reconcile

90401da

Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 merged commit f47096e into vitessio:main Oct 18, 2021

GuptaManan100 deleted the prs-reconcile branch October 18, 2021 06:24

GuptaManan100 mentioned this pull request Oct 18, 2021

[vtorc] PostMasterFailoverProcesses callback invalid #8769

Closed

GuptaManan100 mentioned this pull request Nov 18, 2021

Make PRS use Durability Rules #9259

Merged

3 tasks

GuptaManan100 mentioned this pull request Apr 21, 2022

Refresh ephemeral information before cluster operations in VTOrc #10115

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have a common ERS for both VtOrc and Vtctl #8492

Have a common ERS for both VtOrc and Vtctl #8492

GuptaManan100 commented Jul 19, 2021 •

edited

Loading

rafael left a comment

rafael Oct 5, 2021

rafael Oct 5, 2021

rafael Oct 5, 2021

rafael Oct 5, 2021

GuptaManan100 Oct 11, 2021

rafael commented Oct 5, 2021

shlomi-noach commented Oct 6, 2021

shlomi-noach commented Oct 6, 2021

shlomi-noach left a comment

shlomi-noach Oct 6, 2021

shlomi-noach Oct 6, 2021

GuptaManan100 Oct 11, 2021

shlomi-noach Oct 11, 2021

GuptaManan100 Oct 11, 2021

GuptaManan100 Oct 11, 2021

shlomi-noach Oct 6, 2021

rafael commented Oct 6, 2021

rafael commented Oct 10, 2021

GuptaManan100 commented Oct 11, 2021

ajm188 left a comment

shlomi-noach left a comment

shlomi-noach Oct 14, 2021

rafael left a comment

GuptaManan100 commented Oct 18, 2021

Have a common ERS for both VtOrc and Vtctl #8492

Have a common ERS for both VtOrc and Vtctl #8492

Conversation

GuptaManan100 commented Jul 19, 2021 • edited Loading

Description

Overview

New Emergency Reparent Shard Steps

Changes to vtctld Emergency Reparent Shard -

Changes to VtOrc -

Changes not in this PR but will be addressed later -

Other changes in this PR -

Related Issue(s)

Checklist

Deployment Notes

rafael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rafael commented Oct 5, 2021

shlomi-noach commented Oct 6, 2021

shlomi-noach commented Oct 6, 2021

shlomi-noach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rafael commented Oct 6, 2021

rafael commented Oct 10, 2021

GuptaManan100 commented Oct 11, 2021

ajm188 left a comment

Choose a reason for hiding this comment

shlomi-noach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rafael left a comment

Choose a reason for hiding this comment

GuptaManan100 commented Oct 18, 2021

GuptaManan100 commented Jul 19, 2021 •

edited

Loading