-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: can't recover after a upgrade failed #120
Conversation
…en/tidb-operator into tidb-graceful-upgrade
/run-e2e-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no need to forceUpgrade
:
The gracefulUpgrade
can cover this all the situations.
pkg/manager/member/pd_upgrader.go
Outdated
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s pd upgraded pod: [%s] are not ready", ns, tcName, podName) | ||
} | ||
continue | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove else
if tc.Status.PD.StatefulSet.CurrentReplicas == 0 { | ||
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s pd doesn't have old version pod to upgrade", ns, tcName) | ||
if !templateEqual(newSet.Spec.Template, oldSet.Spec.Template) { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set Partition
to the max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And add a comment to indicate the reason why return nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Partition default is max
if !tc.PDAllPodsStarted() { | ||
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s pd pods are not all created", ns, tcName) | ||
} | ||
setUpgradePartition(newSet, *oldSet.Spec.UpdateStrategy.RollingUpdate.Partition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be added to the else
condition of:
https://github.com/pingcap/tidb-operator/pull/120/files#diff-1f146ae94e5d78830633639d134919cdR69
pkg/manager/member/pd_upgrader.go
Outdated
return err | ||
} | ||
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s pd member: [%s] is transferring leader to pd member: [%s]", ns, tcName, upgradePodName, targetName) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove else
pkg/manager/member/pd_upgrader.go
Outdated
upgradePodName := pdPodName(tcName, ordinal) | ||
if tc.Status.PD.Leader.Name == upgradePodName { | ||
var targetName string | ||
if ordinal == *newSet.Spec.Replicas-1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define *newSet.Spec.Replicas-1
as a var:
lastOrdinal = *newSet.Spec.Replicas-1
/run-e2e-tests |
c60c0d1
to
24524f2
Compare
/run-e2e-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
podName := tikvPodName(tcName, i-1) | ||
setUpgradePartition(newSet, *oldSet.Spec.UpdateStrategy.RollingUpdate.Partition) | ||
for i := tc.Status.TiKV.StatefulSet.Replicas - 1; i >= 0; i-- { | ||
store := tku.getStoreByOrdinal(tc, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getStoreByOrdinal
method should not range over tc.Status.TiKV.TombstoneStores
. The reason is: #124
/run-e2e-tests |
pkg/manager/member/tikv_upgrader.go
Outdated
@@ -84,11 +84,16 @@ func (tku *tikvUpgrader) Upgrade(tc *v1alpha1.TidbCluster, oldSet *apps.Stateful | |||
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s tikv pod: [%s] have not label: %s", ns, tcName, podName, apps.ControllerRevisionHashLabelKey) | |||
} | |||
|
|||
if store == nil { | |||
setUpgradePartition(newSet, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why set partition to i
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should return an error if store == nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/run-e2e-tests |
pkg/manager/member/pd_upgrader.go
Outdated
targetName = pdPodName(tcName, *newSet.Spec.Replicas-1) | ||
revision, exist := pod.Labels[apps.ControllerRevisionHashLabelKey] | ||
if !exist { | ||
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s pd pod: [%s] have not label: %s", ns, tcName, podName, apps.ControllerRevisionHashLabelKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/have not/has no/
pkg/manager/member/pd_upgrader.go
Outdated
return nil | ||
} | ||
|
||
func (pu *pdUpgrader) pdStatefulSetIsUpgrading(set *apps.StatefulSet, tc *v1alpha1.TidbCluster) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is unused and duplicated with the method in pd_member_manager.go
@@ -106,7 +129,7 @@ func TestPDUpgraderUpgrade(t *testing.T) { | |||
changePods: nil, | |||
transferLeaderErr: false, | |||
errExpectFn: func(g *GomegaWithT, err error) { | |||
g.Expect(err.Error()).To(Equal("tidbcluster: [default/upgrader]'s pd upgraded pods are not all ready")) | |||
g.Expect(err.Error()).To(Equal(fmt.Sprintf("tidbcluster: [default/upgrader]'s pd upgraded pod: [%s] are not ready", pdPodName(upgradeTcName, 2)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/are/is/
pkg/manager/member/tidb_upgrader.go
Outdated
} | ||
revision, exist := pod.Labels[apps.ControllerRevisionHashLabelKey] | ||
if !exist { | ||
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s tidb pod: [%s] have not label: %s", ns, tcName, podName, apps.ControllerRevisionHashLabelKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/have not/has no/
pkg/manager/member/tikv_upgrader.go
Outdated
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s upgraded tikv pods are not all running", ns, tcName) | ||
revision, exist := pod.Labels[apps.ControllerRevisionHashLabelKey] | ||
if !exist { | ||
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s tikv pod: [%s] have not label: %s", ns, tcName, podName, apps.ControllerRevisionHashLabelKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/have not/has no/
pkg/manager/member/tikv_upgrader.go
Outdated
return controller.RequeueErrorf("tidbcluster: [%s/%s]'s tikv pod: [%s] have not label: %s", ns, tcName, podName, apps.ControllerRevisionHashLabelKey) | ||
} | ||
|
||
if store == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to after line 76 to avoid retrieving pod when store is nil.
pkg/manager/member/tikv_upgrader.go
Outdated
} | ||
} | ||
|
||
return nil | ||
return controller.RequeueErrorf("tidbcluster: [%s/%s] have not find store status of tikv pod: [%s]", ns, tcName, upgradePodName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"no store status found for tikv pod"
c0c361a
to
8ea29e7
Compare
/run-e2e-tests |
1 similar comment
/run-e2e-tests |
/run-e2e-tests |
/run-e2e-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/run-e2e-tests |
/run-e2e-tests |
* tidb graceful upgrade * fix upgrade state sync bug
Co-authored-by: Lilian Lee <[email protected]> Co-authored-by: pingcap-github-bot <[email protected]>
this pr fixes #119 issue.
In order to fix the issue and make codes more reasonable, this pr rewrited upgrade feature and changed some unit test case, so the code has changed a lot, but in principle it should only affect the upgrade.
btw the pr should be merge after pr #112
This PR also fixes #134
@weekface @tennix @onlymellb @gregwebs PTAL