Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have a common ERS for both VtOrc and Vtctl #8492

Merged
merged 189 commits into from
Oct 18, 2021

Conversation

GuptaManan100
Copy link
Member

@GuptaManan100 GuptaManan100 commented Jul 19, 2021

Description

Overview

VtOrc did not use the Emergency Reparent Shard functionality in case of primary failure. This PR changes that behaviour so that all the code paths use a common Emergency Reparent Shard.

VtOrc had a durability policy along with extra configurations that it used to make decisions while running reparent operations. Since we want the new ERS to also have access to this information, the durability policies have been moved to the reparentutil package and duarbility can be specified in the vtctl and vtctld binaries when they are bought up. Also an additional flag has been added to EmergencyReparentShard which allows the users to prevent any cross cell promotions. Also, as part of reconciling the differences between the two code paths, we have chosen to keep the best of both worlds, which means that VtOrc now has the additional capability to also detect errant GTIDs and vtctld has improved by the two step process that VtOrc had of promoting a primary candidate which is the most advanced, and then later checking if we can improve that promotion.
Additional metrics have been added to keep track of the number of ERS run, number of successful and failed ERS runs.

New Emergency Reparent Shard Steps

  1. First step is to lock the shard for the given operation.
  2. Read all the tablets and there informations
  3. Stop replication on all the tablets and build the status map which contains their replication status.
  4. Find the valid candidates for becoming the primary. This is where we check for errant GTIDs and remove the tablets that have them from consideration.
  5. Restrict the valid candidates list. We remove any tablet which is of the type DRAINED, RESTORE or BACKUP.
  6. Wait for all candidates to apply relay logs
  7. Find the intermediate source for replication that we want other tablets to replicate from. This step chooses the most advanced tablet. Further ties are broken by using the promotion rule. In case the user has specified a tablet specifically, then it is selected, as long as it is the most advanced. Here we also check for split brain scenarios and check that the selected replica must be more advanced than all the other valid candidates. We fail in case there is a split brain detected.
  8. Reparent all the other tablets to start replicating from the intermediate source. We do not promote this tablet to a primary instance, we only let other replicas start replicating from this tablet
  9. Try to find the candidate for promotion if we do not already have the most ideal one. We prefer to choose a candidate with the best possible durability rule. We only look for candidates in the same cell as our previous primary if the user has disabled cross cell promotions. However, if there is an explicit request from the user to promote a specific tablet, then we choose that tablet.
  10. If our candidate is different from our intermediate source, then we wait for it to catch up to it.
  11. Check if the new primary that we will promote satisfies all the constraints of promotion. Here we also check that we do not promote a primary of type MustNotPromoteRule. This additional step also fixes the issue EmergencyReparentShard promotes rdonly tablet to primary #7441 since all rdonly tablets are assigned this promotion rule.
  12. At this point, we promote our primary candidate. We do this in the final step where we call PromoteReplica which fixes the semi-sync, changes the tablet type, set the primary to read-write and flushes the binlogs. We also populate the reparent journal and set replication on all the replicas. As of now PromoteReplica RPC uses the semi-sync information from the tablet flags and not the durability policy.

Changes to vtctld Emergency Reparent Shard -

  1. Bug fix so that we do not promote rdonly tablets to primary
  2. Two step primary selection which allows us to improve our candidate to be in the same cell if possible. Also, if the user requested tablet is not the most advanced, then instead of failing we wait for it to catch up.
  3. Additional flag that allows users to prevent cross cell promotion. Default is set to false, to mimic the old behaviour.
  4. Better detection and reporting of split brain scenarios. The case were two servers pass the errant GTID test but each has transactions the other doesn't. Before these changes the tie would have been broken indecisively and either server would have been promoted. Instead, now we fail and let the user run the ERS specifying which tablet they want us to ignore.
  5. Better candidate promotions using the durability policies. These durability policies have been ported over from the vtorc codebase and now can be specified during vtctl and vtctld server startups.

Changes to VtOrc -

  1. In case we need to override promotion, we do it properly instead of failing and it still being permanent.
  2. Check if any tablet has errant GTIDs before promotion. This was not present in vtorc earlier. This will allow us to prevent promoting a server which has errant GTIDs.
  3. Reconciled both code passes. Use Emergency Reparent Shard while still having the same behaviour and checks.
  4. Removed lost replicas check and hooks. ERS does not check for replicas that can no longer replicate and does not run lost replica hooks on them.
  5. Removed pre-recovery hooks.
  6. Do not have a majority version or binlog format check in the ERS anymore. Vtorc earlier used to check that the promoted primary had the majority version as the other servers. Also, when choosing the final candidate we used to check whether it can replicate from the intermediate source or not. That check has also been removed now.
  7. Code related to binlog server has been removed since it is not relevant to Vitess.

Changes not in this PR but will be addressed later -

  1. LockShard usage to prevent conflicting recoveries from vtorc and refreshing ephemeral information after shard locking.
  2. Errant GTID detection will eventually become redundant when the full durability policy code will be in.
  3. Durability policies are not used everywhere and eventually they will be. The places that should use them but aren't right now are -
    a. Fixing semi sync on the primary still uses the vttablet flag. This flag should be deprecated and the durability policies should be used instead.
    b. ERS code exits in case of more than one tablets are not reachable. We should instead use the durability policies to decide whether we can run a successful promotion or not even in case of multiple failures.
    c. Finding errant GTIDs code does not flag a GTID as errant as long as it is present in two servers. We should instead use the durability policies to decide if they are errant or not.

Other changes in this PR -

  1. Bug-fix in the code that is used to find errant GTIDs. Previously there were cases when, even though 2 tablets had the same GTID set, they were both marked as having errant GTIDs.

Related Issue(s)

Checklist

  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Needs to be specified in the release notes

Copy link
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more pass on my end. One thing that I'm struggling a bit is convincing myself that we didn't miss anything important that was happening in VOrchestrator logic, that is not ported to the new ERS.

I haven't look closely at the tests, so I might be able to get more clarity on this as I take a closer look at that.

@@ -360,7 +360,9 @@ func StopReplicas(replicas [](*Instance), stopReplicationMethod StopReplicationM

// StopReplicasNicely will attemt to stop all given replicas nicely, up to timeout
func StopReplicasNicely(replicas [](*Instance), timeout time.Duration) [](*Instance) {
return StopReplicas(replicas, StopReplicationNice, timeout)
stoppedReplicas := StopReplicas(replicas, StopReplicationNice, timeout)
stoppedReplicas = RemoveNilInstances(stoppedReplicas)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This is great context. I think we should document this in the comments for RemoveNilInstances so future readers know why we need to remove the nil instances.

@@ -815,113 +613,120 @@ func checkAndRecoverDeadPrimary(analysisEntry inst.ReplicationAnalysis, candidat
if !(forceInstanceRecovery || analysisEntry.ClusterDetails.HasAutomatedPrimaryRecovery) {
return false, nil, nil
}
tablet, err := TabletRefresh(analysisEntry.AnalyzedInstanceKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I don't think they can be entirely removed (that was not my suggestion). But we should strive to keep it at a minimum. Here we are adding more calls that we used to have.

In my opinion this should be revisited before GA.

go/vt/orchestrator/logic/topology_recovery.go Show resolved Hide resolved
return false, topologyRecovery, err

// check if we have received an ERS in progress, if we do, we should not continue with the recovery
if checkAndSetIfERSInProgress() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Let's also talk about this in our sync. Curious if this is needed for GA.


// find the valid candidates for becoming the primary
// this is where we check for errant GTIDs and remove the tablets that have them from consideration
validCandidates, err = FindValidEmergencyReparentCandidates(statusMap, primaryStatusMap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In orchestrator, as part of finding replica candidates (chooseCandidateReplica), it looks like some replicas get removed by various checks. One in particular that I was trying to find here are the ones in CanReplicateFrom. Were those removed? Or am I missing them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those checks were MySQL version and binlog format checks. They have been removed for now.

@rafael
Copy link
Member

rafael commented Oct 5, 2021

Bringing back from a conversation in Slack with Deepthi:

Some logic in the original VOrchestrator hasn't been ported over. This seems to be because:

  • It was decided it's no longer relevant.
  • It is relevant, but it was out of scope for this initial iteration.

Some examples of these changes are (or I couldn't find the equivalent in the new code):

  • Recovery from binlog server logic (PrimaryRecoveryBinlogServer).
  • Some sanity checks around IsSmallerMajorVersion.
  • Sanity checks defined in replica.CanReplicateFrom.

We should have a detailed accounting of how this new implementation is deviating from the original one. @~GuptaManan100, I think you did a great job documenting the new flow, but calling out explicitly what was left out will be super helpful. That way, people can have context on what to expect from VOrchestrator when comparing it to Orchestrator.

I think that it would be great to document:

  • The most relevant pieces of logic that were removed and why?
  • The most relevant pieces of logic removed that will be added in future versions.

@shlomi-noach
Copy link
Contributor

irrespective of whether the flag PreventCrossCellPromotion is set to true or false, the code would still prefer promoting a tablet from the same cell, so if there is REPLICA tablet in the same cell then that is the one that will get elected. This logic live in the identifyPrimaryCandidate function.
After talking with @sougou, I have changed the behaviour of ERS to not prefer candidates from the same cell if preventCrossCellPromotion is set false. This will be much closer to the previous version of ERS where we choose the most advanced primary candidate without looking at the cell information at all.

The orchestrator way of dealing with this: use promotion rules. Yes, orchestrator generally prefers same-DC/zone servers, but then the user can choose to assign a neutral promotion rule for local servers, and a prefer promotion rule for remote servers, and that's really the way to tell orchestrator how to proceed. The flag PreventCrossCellPromotion is more of a kill-switch (I think I added it after after the infamous 2019 outage).

In my opinion, setting PreventCrossCellPromotion = false should not mean "prefer cross cell promotion". It doesn't read like that, English-wise. I think this will confuse many people. If anything, add a flag named EnforceCrossCellPromotion. But again, I think the right way to go is promotion rules.

@shlomi-noach
Copy link
Contributor

@rafael PrimaryRecoveryBinlogServer is something I developed at, and for, Booking.com, and was literally never in actual use. Over the years there's been many dreams about new/different binlog server implementations users would use. These never came true, and the code is anyway irrelevant to vitess. This removal makes perfect sense.

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed about half the changes, before my brain shut down 😛
I still need to review durability.go, emergency_reparenter.go ,ers_sorter.go etc. these actually seem to be the more critical changes 🤕

// Copy and throw out primary SID from consideration, so we don't mutate input.
otherSetNoPrimarySID := make(Mysql56GTIDSet, len(otherSet))
for sid, intervals := range otherSet {
if sid == status.SourceUUID {
continue
}
otherSetNoPrimarySID[sid] = intervals
}

otherSets = append(otherSets, otherSetNoPrimarySID)
otherSets = append(otherSets, otherSet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were even if two servers had the exact same GTID set but had different sources set, we were flagging them both as having errant GTIDs.

This confuses me a bit. You only look for errant GTIDs in a descendant of a server; comparing siblings or cousins doesn't have the same strong check guarantees, the way I understand it.

But I'm unfamiliar with this code, I'm unsure how it's being used.

// That's it! We must do recovery!
// TODO(sougou): This function gets called by GracefulPrimaryTakeover which may
// need to obtain shard lock before getting here.
unlock, err := LockShard(analysisEntry.AnalyzedInstanceKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is LockShard removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LockShard is called inside the ERS code so we should not call it outside

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, follow question, and I'm asking because I'm not sure how the flow works; basically we shouldn't even begin the operation if LockShard fails. My question is: since LockShard is called inside ERS code, are there any steps taken before that point, that we should avoid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, there are no operations that require the shard to be locked before ERS is called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All those operations are for counters and deciding whether we should run an operation at all in the first place. But yes, we should be doing a tablet refresh after the shard is locked. But this problem is out of scope of this PR and will be addressed before GA release of vtorc

go/vt/orchestrator/logic/topology_recovery.go Show resolved Hide resolved
return fmt.Errorf("durability policy %v not found", name)
}
log.Infof("Durability setting: %v", name)
curDurabilityPolicy = newDurabilityCreationFunc(durabilityParams)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a mutex now, so that if/when we eventually support dynamic durability changes, we are protected.

@rafael
Copy link
Member

rafael commented Oct 6, 2021

@rafael PrimaryRecoveryBinlogServer is something I developed at, and for, Booking.com, and was literally never in actual use. Over the years there's been many dreams about new/different binlog server implementations users would use. These never came true, and the code is anyway irrelevant to vitess. This removal makes perfect sense.

This makes sense and I assumed it was something along those lines. We never used that as well, so. I have no worries on this being removed :D

I'm highlighting that it seems that there are decisions to remove certain things (with good reasons) but it's not obvious to the readers of the PR. Being very explicit about those seems good context for folks trying to get involved in this work.

@rafael
Copy link
Member

rafael commented Oct 10, 2021

I took one more pass and I didn't find anything new that we haven't discussed already. It seems that there are still few outstanding comments from Shlomi. From my perspective, once those are resolved this is good to be merged and keep iterating.

As discussed, if we can be super explicit on this it will be good context for other folks following this work.

From my notes on the last session, for the next iteration we will need to expand further on:

  • What to do about LockShard and decide if the current approach is good enough for GA?
  • It seems that some pieces of existent logic for errant GTIDS might be redundant (this is something @sougou called out).
  • The areas where the pluggable durability is not being used yet and some assumptions are being made.

@GuptaManan100
Copy link
Member Author

@ajm188 could you take a look now that I have addressed all of your review comments?

@ajm188 ajm188 dismissed their stale review October 11, 2021 16:45

all my blocking comments have been addressed

Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more rename we should do before merging, but i've gotten myself out of the blocking path

go/vt/vtctl/reparentutil/durability.go Outdated Show resolved Hide resolved
Signed-off-by: Manan Gupta <[email protected]>
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not cover the entire changes. Even though I'm naturally very familiar with orchestrator code, I found this PR to be a bit overwhelming. Most of my review comments were based on the logic in some functions, but I lack the understanding of how everything is combined. I know there are recorded meetings I can later watch.
Notable I didn't do a good review job on go/vt/vtctl/reparentutil/...

I do appreciate that there seems to be good testing added. They look legit. They're very well commented.

return fmt.Errorf("durability policy %v not found", name)
}
log.Infof("Durability setting: %v", name)
curDurabilityPolicy = newDurabilityCreationFunc(durabilityParams)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout this file I only see use of Neutral and MustNot. I'd rather see Prefer and MustNot.
Logically, the two options are the same. Both Neutral and Prefer are good as candidates, and obviously "better" than MustNot.
However, Prefer is more indicative that "yes, this server is really a good one". In orchestrator code, when orchestrator sees a prefer server, it is able to cut short further investigation; whereas when orchestrator promotes a Neutral server, it proceeds to check "is there any server better than this?".

Non-blocking comment for your consideration.

Copy link
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching up with the latest updates. From my perspective, good to merge and keep iterating.

I think we should be super loud with the community when this makes it in the next release, to fully test ERS in their environments.

@GuptaManan100
Copy link
Member Author

Yes, you are right @shlomi-noach, we are going to make a lot more changes to the durability policies and will also use prefer promote rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTorc Vitess Orchestrator integration release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EmergencyReparentShard promotes rdonly tablet to primary
6 participants