Add support to rke2 and k3s #737

rawmind0 · 2021-09-02T10:49:11Z

Requires #711

revog · 2021-09-14T13:16:21Z

@rawmind0
I'm following your developemt regarding this terraform provider since a few days. Until a pre-built provider is available for download, I built it on my own.
After creating the RKE2 cluster in Rancher with TF Provider I get the following error:

¦ Error: Setting cluster V2 legacy data: Bad response statusCode [503]. Status [503 Service Unavailable]. Body: [baseType=error, code=ClusterUnavailable, message=ClusterUnavailable 503: cluster not found] from [https://rancher.css.ch/v3/clusters/c-m-b9sw227w?action=generateKubeconfig]
¦
¦   with module.cluster.rancher2_cluster_v2.create,
¦   on modules/cluster/create.tf line 1, in resource "rancher2_cluster_v2" "create":
¦    1: resource "rancher2_cluster_v2" "create" {

Same error occurs also when trying to generate the Kubeconfig at https://rancher/v3/clusters:

{
"baseType": "error",
"code": "ClusterUnavailable",
"message": "ClusterUnavailable 503: cluster not found",
"status": 503,
"type": "error"
}

This error makes it impossible to run the registration node_commands on the nodes with Terraform as its execution gets aborted.
After registering 1 node by hand, the TF plan runs fine. My assumption is, that the Kubeconfig is then available for download.

rawmind0 · 2021-09-14T13:38:44Z

@revog , thanks for the feedback, that's still WIP

revog · 2021-09-14T13:40:04Z

@revog , thanks for the feedback, that's still WIP

I know ;-) - just wanted to let you know
btw - currently I'm testing a fix

rawmind0 · 2021-09-14T16:56:37Z

Anyway, i didn't get this error anytime on all my tests. Are you getting it consistently?? It seems something related with the Rancher API availability 503 Service Unavailable. Also cluster V2 has been added to acceptance tests and it's working fine

rawmind0 · 2021-09-14T18:01:32Z

Tested k3s and rke2 clusters deployments for custom and amazonec2:

Custom:

# Create a new rancher v2 RKE2 custom Cluster v2
resource "rancher2_cluster_v2" "foo" {
  name = "foo"
  fleet_namespace = "fleet-ns"
  kubernetes_version = "v1.21.4+rke2r2"
  enable_network_policy = false
  default_cluster_role_for_project_members = "user"
}

# Create a new rancher v2 K3S custom Cluster v2
resource "rancher2_cluster_v2" "foo" {
  name = "foo"
  fleet_namespace = "fleet-ns"
  kubernetes_version = "v1.21.4+k3s1"
  enable_network_policy = false
  default_cluster_role_for_project_members = "user"
}

amazonec2: defining chart_values

# Create amazonec2 cloud credential
resource "rancher2_cloud_credential" "foo" {
  name = "foo"
  amazonec2_credential_config {
    access_key = "<ACCESS_KEY>"
    secret_key = "<SECRET_KEY>"
  }
}

# Create amazonec2 machine config v2
resource "rancher2_machine_config_v2" "foo" {
  generate_name = "test-foo"
  amazonec2_config {
    ami =  "<AMI_ID>"
    region = "<REGION>"
    security_group = [<AWS_SG>]
    subnet_id = "<SUBNET_ID>"
    vpc_id = "<VPC_ID>"
    zone = "<ZONE>"
  }
}

resource "rancher2_cluster_v2" "foo" {
  name = "foo"
  kubernetes_version = "v1.21.4+k3s1"
  enable_network_policy = false
  rke_config {
    machine_pools {
      name = "pool1"
      cloud_credential_secret_name = rancher2_cloud_credential.foo.id
      control_plane_role = true
      etcd_role = true
      worker_role = true
      quantity = 1
      machine_config {
        kind = rancher2_machine_config_v2.foo.kind
        name = rancher2_machine_config_v2.foo.name
      }
    }
    machine_global_config = {
      cni = "calico"
      disable-kube-proxy = false
      etcd-expose-metrics = false
    }
    upgrade_strategy {
      control_plane_concurrency = "10%"
      worker_concurrency = "10%"
    }
    etcd {
      snapshot_schedule_cron = "0 */5 * * *"
      snapshot_retention = 5
    }
    chart_values = <<EOF
rke2-calico:
  calicoctl:
    image: rancher/mirrored-calico-ctl
    tag: v3.19.2
  certs:
    node:
      cert: null
      commonName: null
      key: null
    typha:
      caBundle: null
      cert: null
      commonName: null
      key: null
  felixConfiguration:
    featureDetectOverride: ChecksumOffloadBroken=true
  global:
    systemDefaultRegistry: ""
  imagePullSecrets: {}
  installation:
    calicoNetwork:
      bgp: Disabled
      ipPools:
      - blockSize: 24
        cidr: 10.42.0.0/16
        encapsulation: VXLAN
        natOutgoing: Enabled
    controlPlaneTolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
      operator: Exists
    - effect: NoExecute
      key: node-role.kubernetes.io/etcd
      operator: Exists
    enabled: true
    imagePath: rancher
    imagePrefix: mirrored-calico-
    kubernetesProvider: ""
  ipamConfig:
    autoAllocateBlocks: true
    strictAffinity: true
  tigeraOperator:
    image: rancher/mirrored-calico-operator
    registry: docker.io
    version: v1.17.6
EOF
  }
}

amazonec2, azure, digitalocean, linode, openstack, and vsphere cloud providers are already supported for machine config and cluster V2

revog · 2021-09-15T05:29:41Z

Anyway, i didn't get this error anytime on all my tests. Are you getting it consistently?? It seems something related with the Rancher API availability 503 Service Unavailable. Also cluster V2 has been added to acceptance tests and it's working fine

Yes we got this error all the time. The main difference to your deployments is, that we are using Rancher and RKE2 on-premise NOT in the cloud. I could imagine that the behavior in these 2 contexts could be different?

We temporarily fixed it in file resource_rancher2_cluster_v2.go (line 288) to avoid any thrown errors. This way the kube_config attribute will be set to "" until a kubeconfig file is generated and downloadable (after bootrstrapping of first node).

		if err != nil {
                        //return fmt.Errorf("Setting cluster V2 legacy data: %v", err)
                        d.Set("kube_config", "")
                        return nil
		}
		d.Set("kube_config", kubeConfig.Config)

rawmind0 · 2021-09-15T08:09:55Z

Yes we got this error all the time. The main difference to your deployments is, that we are using Rancher and RKE2 on-premise NOT in the cloud. I could imagine that the behavior in these 2 contexts could be different?

I've tested both scenarios, on premise and in the cloud, not getting this error anytime. It should be related with a race condition. Same issue has been reported here, #742

We temporarily fixed it in file resource_rancher2_cluster_v2.go (line 288) to avoid any thrown errors. This way the kube_config attribute will be set to "" until a kubeconfig file is generated and downloadable (after bootrstrapping of first node).

Thanks for the fix proposal but i think a retry option should fit better as here the kube-config may be let empty. Working on a fix with this scope.

rawmind0 · 2021-09-15T08:19:35Z

@revog , updated PR adding a fix for the kube-config issue, https://github.com/rancher/terraform-provider-rancher2/pull/737/files#diff-035d1bb527d3ce1fde79a8bf6a191e1c441f7e2074f17caa106ba35a66fefa0bR576

Could you please test it as i'm not able to reproduce it?? Thanks!

revog · 2021-09-15T12:00:05Z

@revog , updated PR adding a fix for the kube-config issue, https://github.com/rancher/terraform-provider-rancher2/pull/737/files#diff-035d1bb527d3ce1fde79a8bf6a191e1c441f7e2074f17caa106ba35a66fefa0bR576

Could you please test it as i'm not able to reproduce it?? Thanks!

Great, with the fix the error is gone. Thank you!

revog · 2021-09-15T14:14:26Z

Hi @rawmind0
Found another issue regarding elements in machine_global_config attribute.

Error: Incorrect attribute value type
¦
¦   on modules/rke2/create.tf line 17, in resource "rancher2_cluster_v2" "create":
¦   17:     machine_global_config = {
¦   18:       cluster-cidr = "10.220.0.0/16"
¦   19:       service-cidr = "10.221.0.0/16"
¦   20:       cni = "cilium" # MAKE VAR
¦   21:       profile = "cis-1.6" # MAKE VAR
¦   22:       disable = "rke2-ingress-nginx"
¦   23:       kube-apiserver-arg = [
¦   24:         "anonymous-auth=true",
¦   25:         "authentication-token-webhook-config-file=/var/lib/rancher/rke2/kube-api-authn-webhook.yaml"
¦   26:       ]
¦   27: #      kube-controller-manager-arg = ""
¦   28: #      kube-scheduler-arg = ""
¦   29: #      tls-san = "rke-${local.stage}-api.css.ch"
¦   30:     }
¦
¦ Inappropriate value for attribute "machine_global_config": element "kube-apiserver-arg": string required.

What's the correct way for passing custom parameters to the K8s components like apiserver, controllermanger or kube-scheduler?
Same error applies also to the disable and tls-san element. When I enter the value (as desired) as string, the string gets split within the Rancher webui and every character of the string is put in its own "field".

rawmind0 · 2021-09-15T17:01:37Z

Hi @revog , yes, this is expected as tf map type is a map[string]string instead of map[string]interface{}, so it doesn't support arrays, etc.. What can be done here is treat this as a yaml string instead of tf map, as it's done at chart_values argument.

rawmind0 · 2021-09-15T17:14:56Z

Updated PR to use machine_global_config argument as yaml string

revog · 2021-09-16T09:55:24Z

Hi @rawmind0
thanks for fixing the machine_global_config issue. I'm now able to predefine custom cluster settings. But unfortunately I get again the error when generating the Kubeconfig:

¦ Error: Setting cluster V2 legacy data: Timeout getting cluster Kubeconfig: Bad response statusCode [503]. Status [503 Service Unavailable]. Body: [baseType=error, code=ClusterUnavailable, message=ClusterUnavailable 503: cluster not found] from [https://rancher.xxx.tld/v3/clusters/c-m-zhwzxfz7?action=generateKubeconfig]
¦
¦   with module.rke2.rancher2_cluster_v2.create,
¦   on modules/rke2/create.tf line 1, in resource "rancher2_cluster_v2" "create":
¦    1: resource "rancher2_cluster_v2" "create" {
¦

Were there again changes regarding this function?

rawmind0 · 2021-09-16T11:46:50Z

Hi @revog , the function didn't changed. You are getting a timeout. It seems something related to your Rancher installation. The timeout is configurable, have you tried to increase it?? https://registry.terraform.io/providers/rancher/rancher2/latest/docs#timeout

revog · 2021-09-16T12:24:52Z

You are getting a timeout. It seems something related to your Rancher installation. Th

I did some further investigation and found the cause. As long as I disable ACE the cluster get created in Rancher and kubeconfig is avaiilable. But when I enable ACE (by setting local_auth_endpoint) the mentioned error occurs.

For testing I took the former build (without the machine_global_config fix - and as you could guess - the kubeconfig generation with active ACE is possible.

So my assumption is still, that something changed in the code?

a-blender

Reviewed for new attributes, resources, and data source matches with the changelog. Are any of these files generated by Rancher? How would I know if they are?

rawmind0 · 2021-09-17T00:38:21Z

Reviewed for new attributes, resources, and data source matches with the changelog. Are any of these files generated by Rancher? How would I know if they are?

Thanks for the review @annablender. The provider is using Rancher API to work, but none of these files are generated by Rancher. For generated files, we use to use special prefix as zz_generated

rawmind0 · 2021-09-17T10:47:07Z

For testing I took the former build (without the machine_global_config fix - and as you could guess - the kubeconfig generation with active ACE is possible.

So my assumption is still, that something changed in the code?

As mentioned, the getClusterKubeconfig function code hasn't changed. The only think that has changed is the way to treat machine_global_config argument, from tf map to string in yaml format. May be the machine_global_config configured values are generating something wrong?? Is the same cluster config working if it's deployed from the ui??

revog · 2021-09-17T11:24:49Z

For testing I took the former build (without the machine_global_config fix - and as you could guess - the kubeconfig generation with active ACE is possible.
So my assumption is still, that something changed in the code?

As mentioned, the getClusterKubeconfig function code hasn't changed. The only think that has changed is the way to treat machine_global_config argument, from tf map to string in yaml format. May be the machine_global_config configured values are generating something wrong?? Is the same cluster config working if it's deployed from the ui??

Hmm that's strange. When I take the latest build and disable ACE (enable = false in local_auth_endpoint) then I am able to create the RKE2 cluster and get the kubeconfig without any problems. After the successfull run of Terraform I re-enabled ACE (set enable = true) and run terraform plan and terraform apply - and this led to the same error:

¦ Error: Setting cluster V2 legacy data: Getting cluster Kubeconfig: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [baseType=error, code=ServerError, message=the server could not find the requested resource (post clusterauthtokens.meta.k8s.io)] from [https://rancher.local.tld/v3/clusters/c-m-6dmfxltw?action=generateKubeconfig]

If I take the former build (without the change related to machine_global_config) and remove this machine_global_config part (as it is not yet supported in this build) and enable ACE (set enable = true in local_auth_endpoint) the RKE2 cluster gets created successfully and the kubeconfig is accessible ...

Sorry, but I don't get it why this behavior occurs?!

// UPDATE
I did some further investigation and found the specific root cause. As long as no value for tls-san (within machine_global_config) is set, the creation works with enabled ACE. When you set a value for tls-san this error occurs. So it is related to the machine_global_config part you mentioned :-).

rawmind0 · 2021-09-17T12:42:51Z

// UPDATE
I did some further investigation and found the specific root cause. As long as no value for tls-san (within machine_global_config) is set, the creation works with enabled ACE. When you set a value for tls-san this error occurs. So it is related to the machine_global_config part you mentioned :-).

@revog thanks for the further investigation on this.

I've done some testing just using the rancher API (outside tfp), and getting same 503 result when ACE is enabled.

{
    "baseType": "error",
    "code": "ClusterUnavailable",
    "message": "ClusterUnavailable 503: cluster not found",
    "status": 503,
    "type": "error"
}

I've added another fix to the tf provider, just logging a warn instead of error on getClusterKubeconfig 503 errors if ACE is enabled. Could you please test it??

…iew)

…mand and insecure_windows_node_command

…ntial_config.default_region

revog · 2021-09-17T13:26:12Z

I've added another fix to the tf provider, just logging a warn instead of error on getClusterKubeconfig 503 errors if ACE is enabled. Could you please test it??

Great, it works now as expected. Thank you.

rawmind0 mentioned this pull request Sep 2, 2021

Add support to rancher v2.6 #711

Merged

8 tasks

rawmind0 self-assigned this Sep 6, 2021

rawmind0 added the [zube]: Working label Sep 6, 2021

rawmind0 force-pushed the rke2v2 branch 5 times, most recently from ffcae95 to f550cc1 Compare September 10, 2021 11:24

rawmind0 force-pushed the rke2v2 branch 2 times, most recently from b79d74f to 7097e25 Compare September 14, 2021 12:07

rawmind0 force-pushed the rke2v2 branch from 7097e25 to 02788ab Compare September 14, 2021 17:57

rawmind0 changed the title ~~[WIP] Add support to rke2 and k3s~~ Add support to rke2 and k3s Sep 14, 2021

rawmind0 added [zube]: Review and removed [zube]: Working labels Sep 14, 2021

rawmind0 requested a review from a team September 14, 2021 18:02

rawmind0 force-pushed the rke2v2 branch from 02788ab to 73cc3f3 Compare September 15, 2021 08:16

rawmind0 mentioned this pull request Sep 15, 2021

rancher2_cluster fails to provision against rancher 2.6.0 #742

Closed

rawmind0 force-pushed the rke2v2 branch from 73cc3f3 to 2a60fb8 Compare September 15, 2021 17:13

a-blender approved these changes Sep 16, 2021

View reviewed changes

rawmind0 force-pushed the rke2v2 branch from 2a60fb8 to 83c01ff Compare September 17, 2021 09:15

rawmind0 force-pushed the rke2v2 branch from 83c01ff to 2e80d70 Compare September 17, 2021 12:32

rawmind0 added 6 commits September 17, 2021 15:20

Added rancher2_cluster_v2 resource to support rke2 and k3s (tech prev…

e751da4

…iew)

Updated cluster_registration_token attribute, added insecure_node_com…

ee38457

…mand and insecure_windows_node_command

Added note to bootstrap Rancher v2.6.0 and above

efce302

Added new optional argument rancher2_cloud_credential.amazonec2_crede…

a45268c

…ntial_config.default_region

Added resource rancher2_machine_config_v2. Go and docs files

3e327f6

Updated CHANGELOG.md

4da9910

rawmind0 force-pushed the rke2v2 branch from 2e80d70 to 4da9910 Compare September 17, 2021 13:20

rawmind0 merged commit 6c5b292 into rancher:master Sep 17, 2021

zube bot added [zube]: Done and removed [zube]: Review labels Sep 17, 2021

zube bot removed the [zube]: Done label Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to rke2 and k3s #737

Add support to rke2 and k3s #737

rawmind0 commented Sep 2, 2021

revog commented Sep 14, 2021

rawmind0 commented Sep 14, 2021

revog commented Sep 14, 2021

rawmind0 commented Sep 14, 2021 •

edited

Loading

rawmind0 commented Sep 14, 2021

revog commented Sep 15, 2021

rawmind0 commented Sep 15, 2021

rawmind0 commented Sep 15, 2021

revog commented Sep 15, 2021

revog commented Sep 15, 2021 •

edited

Loading

rawmind0 commented Sep 15, 2021

rawmind0 commented Sep 15, 2021

revog commented Sep 16, 2021 •

edited

Loading

rawmind0 commented Sep 16, 2021

revog commented Sep 16, 2021

a-blender left a comment •

edited

Loading

rawmind0 commented Sep 17, 2021

rawmind0 commented Sep 17, 2021

revog commented Sep 17, 2021 •

edited

Loading

rawmind0 commented Sep 17, 2021

revog commented Sep 17, 2021

Add support to rke2 and k3s #737

Add support to rke2 and k3s #737

Conversation

rawmind0 commented Sep 2, 2021

revog commented Sep 14, 2021

rawmind0 commented Sep 14, 2021

revog commented Sep 14, 2021

rawmind0 commented Sep 14, 2021 • edited Loading

rawmind0 commented Sep 14, 2021

revog commented Sep 15, 2021

rawmind0 commented Sep 15, 2021

rawmind0 commented Sep 15, 2021

revog commented Sep 15, 2021

revog commented Sep 15, 2021 • edited Loading

rawmind0 commented Sep 15, 2021

rawmind0 commented Sep 15, 2021

revog commented Sep 16, 2021 • edited Loading

rawmind0 commented Sep 16, 2021

revog commented Sep 16, 2021

a-blender left a comment • edited Loading

Choose a reason for hiding this comment

rawmind0 commented Sep 17, 2021

rawmind0 commented Sep 17, 2021

revog commented Sep 17, 2021 • edited Loading

rawmind0 commented Sep 17, 2021

revog commented Sep 17, 2021

rawmind0 commented Sep 14, 2021 •

edited

Loading

revog commented Sep 15, 2021 •

edited

Loading

revog commented Sep 16, 2021 •

edited

Loading

a-blender left a comment •

edited

Loading

revog commented Sep 17, 2021 •

edited

Loading