[Digital Ocean] Implement KOPS validate cluster #9476

srikiz · 2020-07-02T11:53:10Z

Implementation of KOPS validate cluster for Digital Ocean KOPS provider.
Tested with ./kops validate cluster and it returns the state of your KOPS cluster as expected.

k8s-ci-robot · 2020-07-02T11:53:18Z

Hi @srikiz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hakman · 2020-07-02T12:55:50Z

/ok-to-test

hakman · 2020-07-02T13:12:42Z

/lgtm

rifelpet · 2020-07-02T22:12:38Z

/approve

k8s-ci-robot · 2020-07-02T22:12:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rifelpet, srikiz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rifelpet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

johngmyers · 2020-07-03T00:22:41Z

pkg/resources/digitalocean/cloud.go

+	for _, member := range g.Members {
+
+		// DO doesn't have a notion of Auto Scaling Group - use the same config Type for both current config and new config.
+		err := cg.NewCloudInstanceGroupMember(member, g.GroupType, g.GroupType, nodeMap)


This means DO won't be able to know when to do a rolling update. All rolling updates would have to use --force.

Looking at the DO Cloud, it doesn't (yet) appear to support deleting instances, so I'm guessing rolling updates don't work at all. Perhaps @srikiz can confirm?

kops/pkg/resources/digitalocean/cloud.go

Lines 101 to 111 in 38195fb

// DeleteInstance is not implemented yet, is func needs to delete a DO instance.

func (c *Cloud) DeleteInstance(i *cloudinstances.CloudInstanceGroupMember) error {

klog.V(8).Info("digitalocean cloud provider DeleteInstance not implemented yet")

return fmt.Errorf("digital ocean cloud provider does not support deleting cloud instances at this time")

}

// DetachInstance is not implemented yet. It needs to cause a cloud instance to no longer be counted against the group's size limits.

func (c *Cloud) DetachInstance(i *cloudinstances.CloudInstanceGroupMember) error {

klog.V(8).Info("digitalocean cloud provider DetachInstance not implemented yet")

return fmt.Errorf("digital ocean cloud provider does not support surging")

}

yes @rifelpet - that is something I want to check next.

Yeah, looks like DO doesn't support rolling updates. So that's not a blocker for this PR.

The comment is a bit off-point—the second and third parameters just need to be unequal whenever any field of the godo.DropletCreateRequest used to create the instance was different than what would be used to create a new instance. This can be implemented as either a generation id or a hash of the fields.

sure, thanks @johngmyers - I'll keep a note of this when I implement rolling update feature.

fejta-bot · 2020-07-03T03:58:32Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

johngmyers · 2020-07-03T04:19:42Z

pkg/resources/digitalocean/cloud.go

@@ -38,6 +39,8 @@ import (

 const TagKubernetesClusterIndex = "k8s-index"
 const TagKubernetesClusterNamePrefix = "KubernetesCluster"
+const TagKubernetesInstanceGroup = "kops-instancegroup"
+const DOInstanceGroupConfig = "do-instancegroup-config"


Where is this referenced?

thanks, removed DOInstanceGroupConfig

hakman · 2020-07-03T04:23:43Z

Let's wait a bit before merging this.
/lgtm cancel

johngmyers · 2020-07-03T04:23:56Z

pkg/resources/digitalocean/cloud.go

@@ -38,6 +39,8 @@ import (

 const TagKubernetesClusterIndex = "k8s-index"
 const TagKubernetesClusterNamePrefix = "KubernetesCluster"
+const TagKubernetesInstanceGroup = "kops-instancegroup"


Why have a duplicate of do.TagKubernetesClusterInstanceGroupPrefix?

@johngmyers - I think this needs refactoring. I tried to use it and it looks like we have a cyclic dependency here. Ideally, I should have kept this logic (and few other methods) in the do package, but that's a little more work that I thought of.
Would it be okay if we get this merged, and I take up the clean up activity as a separate PR? Please suggest.

Perhaps move TagKubernetesClusterInstanceGroupPrefix here?

I removed TagKubernetesClusterInstanceGroupPrefix and instead used the above const.

johngmyers · 2020-07-03T04:42:59Z

The DO code is a bit oversimplified, apparently because DO doesn't support autoscaling groups. But kops doesn't care (much) about the autoscaling aspect of autoscaling groups.

The existing DropletBuilder tags all instances that don't have master role as being in the "nodes" instance group. This ignores the possibility of the admin creating additional instance groups and causes this code to always identify non-master nodes as being in the "nodes" instance group, regardless of which instance group actually created them.

For validation, GetCloudGroups needs to identify:

The instance groups deployed (from whatever ASG-equivalent concept the provider code maps instance groups to)
The target size of the ASG-equivalent
The extant instances, and for each:
** Which instance group caused the instance to be created
** Which k8s node the instance corresponds to, if the instance has joined the cluster

For rolling update, it needs to identify:
** Whether the instance was created with the version of the spec that is the most current for the ASG-equivalent
** Whether the instance is detached (if the provider supports surging)

For rolling update, it also needs DeleteInstance to terminate the instance and cause a replacement instance to be created. (Except, if it supports surging, to not replace the instance if it was previously detached.)

johngmyers · 2020-07-03T05:10:38Z

To further clarify what I wrote, validation will throw failures if there aren't enough instances in a group to cover what that group reports as its target size or if any instances don't have a corresponding node (haven't joined the cluster). Additionally it will throw a failure if GetCloudGroups fails to return a group for an extant InstanceGroup.

Rolling update needs GetCloudGroups to identify which instances are out of date, which NewCloudInstanceGroupMember determines by whether the second and third parameters compare equal. It also needs to be able to terminate instances. It needs validation to eventually succeed after termination, which means that after termination the group needs to eventually have enough instances to match its reported target size.

srikiz · 2020-07-03T18:11:33Z

To further clarify what I wrote, validation will throw failures if there aren't enough instances in a group to cover what that group reports as its target size or if any instances don't have a corresponding node (haven't joined the cluster). Additionally it will throw a failure if GetCloudGroups fails to return a group for an extant InstanceGroup.

Rolling update needs GetCloudGroups to identify which instances are out of date, which NewCloudInstanceGroupMember determines by whether the second and third parameters compare equal. It also needs to be able to terminate instances. It needs validation to eventually succeed after termination, which means that after termination the group needs to eventually have enough instances to match its reported target size.

Thanks, this helps. I need to discuss with @timoreimann if we can support Rolling update for DO. I'll be doing that next week. In the meanwhile, I would like to get this current PR going, if you think it's okay. Below is the validate cluster response -

`Validating cluster dev5.k8s.local

INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
master-tor1-1 Master s-2vcpu-4gb 1 1 tor1
master-tor1-2 Master s-2vcpu-4gb 1 1 tor1
master-tor1-3 Master s-2vcpu-4gb 1 1 tor1
nodes Node s-2vcpu-4gb 2 2 tor1

NODE STATUS
NAME ROLE READY
10.137.176.248 master True
10.137.240.102 master True
10.137.240.248 master True
10.137.240.3 node True
10.137.8.136 node True

Your cluster dev5.k8s.local is ready`

johngmyers · 2020-07-03T18:40:47Z

pkg/resources/digitalocean/cloud.go

+
+	for _, member := range g.Members {
+
+		// DO doesn't have a notion of Auto Scaling Group - use the same config Type for both current config and new config.


Suggested change

// DO doesn't have a notion of Auto Scaling Group - use the same config Type for both current config and new config.

// TODO use a hash of the godo.DropletCreateRequest fields for second and third parameters.

sure, updated.

johngmyers · 2020-07-03T18:48:44Z

@srikiz Rolling update support does not block this PR. In my mind the thing that blocks this PR is its identifying all non-master instances as being in the "nodes" instance group. The droplet creation code should tag instances with the name of the instance group that created them so this code can assign them to the right group.

The code for assigning the group of the masters seems unnecessarily complex. If there's a reason the masters need to have special case string manipulation, please explain.

johngmyers · 2020-07-03T20:27:27Z

I believe there is scope to share code between this and dotasks.Droplet.Find(). dotasks.Droplet is really a DropletGroup.

srikiz · 2020-07-07T04:50:16Z

@srikiz Rolling update support does not block this PR. In my mind the thing that blocks this PR is its identifying all non-master instances as being in the "nodes" instance group. The droplet creation code should tag instances with the name of the instance group that created them so this code can assign them to the right group.

The code for assigning the group of the masters seems unnecessarily complex. If there's a reason the masters need to have special case string manipulation, please explain.

Thanks for the inputes - I updated code to also check for instance group called "nodes" if it doesn't match the master. While creating the droplet, we currently tag the master with 3 different tags.

kubernetescluster: {clusterName}
Kops-instancegroup: master-{region}
k8s-index: 1 (will always be 1 for a single master cluster). For a multi-master node, it can be K8s-Index:2, or k8s-index:3
The instance group itself is a combination of kops-instancegroup and the k8s-index.
I will keep a note to find if I could update kops-instancegroup itself to the correct value (instead of concatenating with k8s-index).

For the worker nodes we have the below tags.

Kubernetescluster: {clusterName}
Kops-instancegroup: nodes

I updated code to also check if the matching "kops-instancegroup: nodes" tag is available on the droplet.

johngmyers · 2020-07-07T05:37:51Z

Why don't you put the instancegroup's ObjectMeta.Name in kops-instancegroup?

It looks like the code will only return at most one CloudInstanceGroup of non-master type. That means cluster validation will always fail if there are two or more non-master instance groups.

srikiz · 2020-07-07T06:03:42Z

Why don't you put the instancegroup's ObjectMeta.Name in kops-instancegroup?

It looks like the code will only return at most one CloudInstanceGroup of non-master type. That means cluster validation will always fail if there are two or more non-master instance groups.

I don't know if there is a way in DO to add more non-master instance groups. DO doesn't support any new instance group creation. I am currently managing with the help of tags. I tag droplets that act as worker nodes with "kops-instancegroup:nodes" tag. If there are more than 1 DO KOPS cluster, I currently retrieve all droplets that match the KOPS cluster name here - https://github.com/kubernetes/kops/pull/9476/files#diff-09d5ce1959bcc4dffcb0d3d88a1ffb55R332

I believe DO currently don't have support for adding additional instance groups.

srikiz · 2020-07-08T05:12:50Z

I believe there is scope to share code between this and dotasks.Droplet.Find(). dotasks.Droplet is really a DropletGroup.

Yeah. I moved the consts from the current package (upup/pkg/fi/cloudup/do) into pkg/resources/digitialocean. It I had to modify droplets.go, api_loadbalancer.go, master_volumes.go, etc. I thought it would be safer to make this change as a separate PR. Hope that's okay.

johngmyers · 2020-07-08T05:30:40Z

dotasks.Droplet is essentially a (non-automatic) scaling group. Its RenderDO() handles instance groups by creating enough droplets to satisfy the instancegroup's expected count. I don't see anything that would prevent Droplet from creating droplets for multiple worker instance groups.

Droplet puts an ASG name, which is derived from the instancegroup and cluster names, into the Name field of each droplet it creates. This can be used to determine which instancegroup a given droplet belongs to. Droplet.Find() appears to essentially do this, though it filters for the instancegroup it was instantiated for.

srikiz · 2020-07-09T07:48:33Z

dotasks.Droplet is essentially a (non-automatic) scaling group. Its RenderDO() handles instance groups by creating enough droplets to satisfy the instancegroup's expected count. I don't see anything that would prevent Droplet from creating droplets for multiple worker instance groups.

Droplet puts an ASG name, which is derived from the instancegroup and cluster names, into the Name field of each droplet it creates. This can be used to determine which instancegroup a given droplet belongs to. Droplet.Find() appears to essentially do this, though it filters for the instancegroup it was instantiated for.

I see, got it. I will put Objectmeta.Name to the kops-instancegroup tag and verify it. Will update the PR shortly.

…e group tag

srikiz · 2020-07-14T08:04:13Z

@johngmyers - I have incorporated as per your comments - that really helped !
Please do have a look when you get a chance. Thanks !

hakman · 2020-07-14T08:15:48Z

/retest

srikiz · 2020-07-14T08:39:04Z

/retest

johngmyers

Much better. I'm putting on a hold in case you want to use the standard type for GroupType. Feel free to cancel the hold if you want it to go in as-is.
/hold
/lgtm

johngmyers · 2020-07-14T20:12:13Z

pkg/resources/digitalocean/cloud.go

+type DOInstanceGroup struct {
+	ClusterName       string
+	InstanceGroupName string
+	GroupType         string   // will be either "master" or "worker"


Use pkg/apis/kops/InstanceGroupRole as the type?

I'll keep a note to update this when I touch the rolling update feature. I'll add a new tag when I create a droplet that holds this information, so I can extract it from the droplet tag.
Thanks for your help in reviewing this @johngmyers !

You can get the group role from the instancegroup spec of the ig that was in the tag.

srikiz · 2020-07-15T18:09:50Z

/hold cancel

srikiz added 3 commits June 29, 2020 20:26

Initial changes for validate cluster implementation

422b033

Implement DO Validate Cluster method

bff90aa

add err nil checks

5a34d42

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 2, 2020

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 2, 2020

k8s-ci-robot requested review from gjtempleton and zetaab July 2, 2020 11:53

k8s-ci-robot added the area/provider/digitalocean Issues or PRs related to digitalocean provider label Jul 2, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 2, 2020

k8s-ci-robot assigned hakman Jul 2, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 2, 2020

johngmyers reviewed Jul 3, 2020

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 3, 2020

johngmyers reviewed Jul 3, 2020

View reviewed changes

incorporate review comment

6b98cdd

johngmyers reviewed Jul 3, 2020

View reviewed changes

incorporate review comments

0209830

incorporate review comments to use instance group name for DO instanc…

160a4b8

…e group tag

johngmyers reviewed Jul 14, 2020

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 14, 2020

k8s-ci-robot assigned johngmyers Jul 14, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 14, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2020

k8s-ci-robot merged commit 0658248 into kubernetes:master Jul 15, 2020

k8s-ci-robot added this to the v1.19 milestone Jul 15, 2020

srikiz deleted the DO-implement-validate-cluster branch October 25, 2021 17:38

	// DeleteInstance is not implemented yet, is func needs to delete a DO instance.
	func (c Cloud) DeleteInstance(i cloudinstances.CloudInstanceGroupMember) error {
	klog.V(8).Info("digitalocean cloud provider DeleteInstance not implemented yet")
	return fmt.Errorf("digital ocean cloud provider does not support deleting cloud instances at this time")
	}

	// DetachInstance is not implemented yet. It needs to cause a cloud instance to no longer be counted against the group's size limits.
	func (c Cloud) DetachInstance(i cloudinstances.CloudInstanceGroupMember) error {
	klog.V(8).Info("digitalocean cloud provider DetachInstance not implemented yet")
	return fmt.Errorf("digital ocean cloud provider does not support surging")
	}


		for _, member := range g.Members {

		// DO doesn't have a notion of Auto Scaling Group - use the same config Type for both current config and new config.

	// DO doesn't have a notion of Auto Scaling Group - use the same config Type for both current config and new config.
	// TODO use a hash of the godo.DropletCreateRequest fields for second and third parameters.

[Digital Ocean] Implement KOPS validate cluster #9476

[Digital Ocean] Implement KOPS validate cluster #9476

Conversation

srikiz commented Jul 2, 2020

k8s-ci-robot commented Jul 2, 2020

hakman commented Jul 2, 2020

hakman commented Jul 2, 2020

rifelpet commented Jul 2, 2020

k8s-ci-robot commented Jul 2, 2020

Choose a reason for hiding this comment

rifelpet Jul 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fejta-bot commented Jul 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hakman commented Jul 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johngmyers commented Jul 3, 2020

johngmyers commented Jul 3, 2020

srikiz commented Jul 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johngmyers commented Jul 3, 2020

johngmyers commented Jul 3, 2020

srikiz commented Jul 7, 2020

johngmyers commented Jul 7, 2020

srikiz commented Jul 7, 2020 • edited Loading

srikiz commented Jul 8, 2020

johngmyers commented Jul 8, 2020

srikiz commented Jul 9, 2020

srikiz commented Jul 14, 2020

hakman commented Jul 14, 2020

srikiz commented Jul 14, 2020

johngmyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srikiz commented Jul 15, 2020

rifelpet Jul 3, 2020 •

edited

Loading

srikiz commented Jul 7, 2020 •

edited

Loading