Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

chore: limit number of upgrade retries if new CP nodes bootstrap fails #4068

Merged
merged 9 commits into from
Dec 8, 2020

Conversation

jadarsie
Copy link
Member

@jadarsie jadarsie commented Nov 30, 2020

Reason for Change:

Most of the failures that cause aks-engine upgrade to fail are retryable. However bad or incompatible properties.orchestratorProfile.kubernetesConfig property values in your API model will prevent kubelet (or control plane components) from starting successfully. In this case, aks-engine upgrade assumes that the node was upgraded successfully because it only checks for the orchestrator tag.

This PR limits the number of upgrade operations if two upgraded CP nodes are not ready.

Notes:

  • Updated client-go to min supported orchestrator version (v1.16)
  • Target tools-install installs gomock

Credit Where Due:

Does this change contain code from or inspired by another project?

  • No
  • Yes

Requirements:

@codecov
Copy link

codecov bot commented Nov 30, 2020

Codecov Report

Merging #4068 (e01b207) into master (48f6762) will increase coverage by 0.04%.
The diff coverage is 86.79%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4068      +/-   ##
==========================================
+ Coverage   73.18%   73.23%   +0.04%     
==========================================
  Files         135      135              
  Lines       20579    20625      +46     
==========================================
+ Hits        15061    15104      +43     
- Misses       4545     4547       +2     
- Partials      973      974       +1     
Impacted Files Coverage Δ
pkg/armhelpers/mockclients.go 24.16% <0.00%> (-0.06%) ⬇️
...g/operations/kubernetesupgrade/upgradeagentnode.go 52.45% <0.00%> (ø)
.../operations/kubernetesupgrade/upgrademasternode.go 42.55% <0.00%> (ø)
pkg/operations/kubernetesupgrade/upgrader.go 62.13% <ø> (-0.03%) ⬇️
pkg/operations/kubernetesupgrade/upgradecluster.go 81.11% <93.87%> (+2.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 48f6762...e01b207. Read the comment docs.

@jadarsie jadarsie requested a review from jackfrancis November 30, 2020 18:00
@jadarsie jadarsie requested a review from jackfrancis November 30, 2020 18:38
@jadarsie jadarsie requested a review from jackfrancis December 2, 2020 00:04
@@ -21,6 +22,8 @@ type Client interface {
ListAllPods() (*v1.PodList, error)
// ListNodes returns a list of Nodes registered in the api server.
ListNodes() (*v1.NodeList, error)
// ListNodesByOptions returns a list of Nodes registered in the api server.
ListNodesByOptions(opts metav1.ListOptions) (*v1.NodeList, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing this publicly exportable interface definition is dangerous FYI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding methods should be fine, right?

return uc.Translator.Errorf("Error while querying ARM for resources: %+v", err)
}

if kubeClient != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this nil check? We should be confident that a non-err response from L100 above will guarantee that k is not nil, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not confident then I'd just add that nil check to the err condition:

if err != nil || k == nil {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only reason is to pass test Should not fail if a Kubernetes client cannot be created, do not know the history so I am just complying

@@ -1071,3 +1076,237 @@ var _ = Describe("Upgrade Kubernetes cluster tests", func() {
Expect(len(newNode.Spec.Taints)).To(Equal(2))
})
})

func TestCheckControlPlaneNodesStatus(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

}
// return error if more than 1 upgraded node is not ready
// do not bother if upgradedNotReadyCount == mastersCount, at that point upgrade cannot take down more nodes
if upgradedNotReadyCount > 1 && upgradedNotReadyCount < mastersCount {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the extra && condition here? Shouldn't we always return error if there are more than one not ready control plane VM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, I thought I had a good reason but on a second thought it is actually not

@jadarsie jadarsie requested a review from jackfrancis December 3, 2020 17:10
Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot
Copy link

acs-bot commented Dec 8, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, jadarsie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [jackfrancis,jadarsie]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jadarsie jadarsie merged commit 5ae841d into Azure:master Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants