Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to rke2 and k3s #737

Merged
merged 6 commits into from
Sep 17, 2021
Merged

Add support to rke2 and k3s #737

merged 6 commits into from
Sep 17, 2021

Conversation

rawmind0
Copy link
Contributor

@rawmind0 rawmind0 commented Sep 2, 2021

Requires #711

@rawmind0 rawmind0 mentioned this pull request Sep 2, 2021
8 tasks
@rawmind0 rawmind0 self-assigned this Sep 6, 2021
@rawmind0 rawmind0 force-pushed the rke2v2 branch 5 times, most recently from ffcae95 to f550cc1 Compare September 10, 2021 11:24
@rawmind0 rawmind0 force-pushed the rke2v2 branch 2 times, most recently from b79d74f to 7097e25 Compare September 14, 2021 12:07
@revog
Copy link

revog commented Sep 14, 2021

@rawmind0
I'm following your developemt regarding this terraform provider since a few days. Until a pre-built provider is available for download, I built it on my own.
After creating the RKE2 cluster in Rancher with TF Provider I get the following error:

¦ Error: Setting cluster V2 legacy data: Bad response statusCode [503]. Status [503 Service Unavailable]. Body: [baseType=error, code=ClusterUnavailable, message=ClusterUnavailable 503: cluster not found] from [https://rancher.css.ch/v3/clusters/c-m-b9sw227w?action=generateKubeconfig]
¦
¦   with module.cluster.rancher2_cluster_v2.create,
¦   on modules/cluster/create.tf line 1, in resource "rancher2_cluster_v2" "create":
¦    1: resource "rancher2_cluster_v2" "create" {

Same error occurs also when trying to generate the Kubeconfig at https://rancher/v3/clusters:

{
"baseType": "error",
"code": "ClusterUnavailable",
"message": "ClusterUnavailable 503: cluster not found",
"status": 503,
"type": "error"
}

This error makes it impossible to run the registration node_commands on the nodes with Terraform as its execution gets aborted.
After registering 1 node by hand, the TF plan runs fine. My assumption is, that the Kubeconfig is then available for download.

@rawmind0
Copy link
Contributor Author

@revog , thanks for the feedback, that's still WIP

@revog
Copy link

revog commented Sep 14, 2021

@revog , thanks for the feedback, that's still WIP

I know ;-) - just wanted to let you know
btw - currently I'm testing a fix

@rawmind0
Copy link
Contributor Author

rawmind0 commented Sep 14, 2021

Anyway, i didn't get this error anytime on all my tests. Are you getting it consistently?? It seems something related with the Rancher API availability 503 Service Unavailable. Also cluster V2 has been added to acceptance tests and it's working fine

@rawmind0 rawmind0 changed the title [WIP] Add support to rke2 and k3s Add support to rke2 and k3s Sep 14, 2021
@rawmind0
Copy link
Contributor Author

Tested k3s and rke2 clusters deployments for custom and amazonec2:

  • Custom:
# Create a new rancher v2 RKE2 custom Cluster v2
resource "rancher2_cluster_v2" "foo" {
  name = "foo"
  fleet_namespace = "fleet-ns"
  kubernetes_version = "v1.21.4+rke2r2"
  enable_network_policy = false
  default_cluster_role_for_project_members = "user"
}

# Create a new rancher v2 K3S custom Cluster v2
resource "rancher2_cluster_v2" "foo" {
  name = "foo"
  fleet_namespace = "fleet-ns"
  kubernetes_version = "v1.21.4+k3s1"
  enable_network_policy = false
  default_cluster_role_for_project_members = "user"
}
  • amazonec2: defining chart_values
# Create amazonec2 cloud credential
resource "rancher2_cloud_credential" "foo" {
  name = "foo"
  amazonec2_credential_config {
    access_key = "<ACCESS_KEY>"
    secret_key = "<SECRET_KEY>"
  }
}

# Create amazonec2 machine config v2
resource "rancher2_machine_config_v2" "foo" {
  generate_name = "test-foo"
  amazonec2_config {
    ami =  "<AMI_ID>"
    region = "<REGION>"
    security_group = [<AWS_SG>]
    subnet_id = "<SUBNET_ID>"
    vpc_id = "<VPC_ID>"
    zone = "<ZONE>"
  }
}

resource "rancher2_cluster_v2" "foo" {
  name = "foo"
  kubernetes_version = "v1.21.4+k3s1"
  enable_network_policy = false
  rke_config {
    machine_pools {
      name = "pool1"
      cloud_credential_secret_name = rancher2_cloud_credential.foo.id
      control_plane_role = true
      etcd_role = true
      worker_role = true
      quantity = 1
      machine_config {
        kind = rancher2_machine_config_v2.foo.kind
        name = rancher2_machine_config_v2.foo.name
      }
    }
    machine_global_config = {
      cni = "calico"
      disable-kube-proxy = false
      etcd-expose-metrics = false
    }
    upgrade_strategy {
      control_plane_concurrency = "10%"
      worker_concurrency = "10%"
    }
    etcd {
      snapshot_schedule_cron = "0 */5 * * *"
      snapshot_retention = 5
    }
    chart_values = <<EOF
rke2-calico:
  calicoctl:
    image: rancher/mirrored-calico-ctl
    tag: v3.19.2
  certs:
    node:
      cert: null
      commonName: null
      key: null
    typha:
      caBundle: null
      cert: null
      commonName: null
      key: null
  felixConfiguration:
    featureDetectOverride: ChecksumOffloadBroken=true
  global:
    systemDefaultRegistry: ""
  imagePullSecrets: {}
  installation:
    calicoNetwork:
      bgp: Disabled
      ipPools:
      - blockSize: 24
        cidr: 10.42.0.0/16
        encapsulation: VXLAN
        natOutgoing: Enabled
    controlPlaneTolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
      operator: Exists
    - effect: NoExecute
      key: node-role.kubernetes.io/etcd
      operator: Exists
    enabled: true
    imagePath: rancher
    imagePrefix: mirrored-calico-
    kubernetesProvider: ""
  ipamConfig:
    autoAllocateBlocks: true
    strictAffinity: true
  tigeraOperator:
    image: rancher/mirrored-calico-operator
    registry: docker.io
    version: v1.17.6
EOF
  }
}

amazonec2, azure, digitalocean, linode, openstack, and vsphere cloud providers are already supported for machine config and cluster V2

@rawmind0 rawmind0 requested a review from a team September 14, 2021 18:02
@revog
Copy link

revog commented Sep 15, 2021

Anyway, i didn't get this error anytime on all my tests. Are you getting it consistently?? It seems something related with the Rancher API availability 503 Service Unavailable. Also cluster V2 has been added to acceptance tests and it's working fine

Yes we got this error all the time. The main difference to your deployments is, that we are using Rancher and RKE2 on-premise NOT in the cloud. I could imagine that the behavior in these 2 contexts could be different?

We temporarily fixed it in file resource_rancher2_cluster_v2.go (line 288) to avoid any thrown errors. This way the kube_config attribute will be set to "" until a kubeconfig file is generated and downloadable (after bootrstrapping of first node).

		if err != nil {
                        //return fmt.Errorf("Setting cluster V2 legacy data: %v", err)
                        d.Set("kube_config", "")
                        return nil
		}
		d.Set("kube_config", kubeConfig.Config)

@rawmind0
Copy link
Contributor Author

Yes we got this error all the time. The main difference to your deployments is, that we are using Rancher and RKE2 on-premise NOT in the cloud. I could imagine that the behavior in these 2 contexts could be different?

I've tested both scenarios, on premise and in the cloud, not getting this error anytime. It should be related with a race condition. Same issue has been reported here, #742

We temporarily fixed it in file resource_rancher2_cluster_v2.go (line 288) to avoid any thrown errors. This way the kube_config attribute will be set to "" until a kubeconfig file is generated and downloadable (after bootrstrapping of first node).

Thanks for the fix proposal but i think a retry option should fit better as here the kube-config may be let empty. Working on a fix with this scope.

@rawmind0
Copy link
Contributor Author

@revog , updated PR adding a fix for the kube-config issue, https://github.com/rancher/terraform-provider-rancher2/pull/737/files#diff-035d1bb527d3ce1fde79a8bf6a191e1c441f7e2074f17caa106ba35a66fefa0bR576

Could you please test it as i'm not able to reproduce it?? Thanks!

@revog
Copy link

revog commented Sep 15, 2021

@revog , updated PR adding a fix for the kube-config issue, https://github.com/rancher/terraform-provider-rancher2/pull/737/files#diff-035d1bb527d3ce1fde79a8bf6a191e1c441f7e2074f17caa106ba35a66fefa0bR576

Could you please test it as i'm not able to reproduce it?? Thanks!

Great, with the fix the error is gone. Thank you!

@revog
Copy link

revog commented Sep 15, 2021

Hi @rawmind0
Found another issue regarding elements in machine_global_config attribute.

Error: Incorrect attribute value type
¦
¦   on modules/rke2/create.tf line 17, in resource "rancher2_cluster_v2" "create":
¦   17:     machine_global_config = {
¦   18:       cluster-cidr = "10.220.0.0/16"
¦   19:       service-cidr = "10.221.0.0/16"
¦   20:       cni = "cilium" # MAKE VAR
¦   21:       profile = "cis-1.6" # MAKE VAR
¦   22:       disable = "rke2-ingress-nginx"
¦   23:       kube-apiserver-arg = [
¦   24:         "anonymous-auth=true",
¦   25:         "authentication-token-webhook-config-file=/var/lib/rancher/rke2/kube-api-authn-webhook.yaml"
¦   26:       ]
¦   27: #      kube-controller-manager-arg = ""
¦   28: #      kube-scheduler-arg = ""
¦   29: #      tls-san = "rke-${local.stage}-api.css.ch"
¦   30:     }
¦
¦ Inappropriate value for attribute "machine_global_config": element "kube-apiserver-arg": string required.

What's the correct way for passing custom parameters to the K8s components like apiserver, controllermanger or kube-scheduler?
Same error applies also to the disable and tls-san element. When I enter the value (as desired) as string, the string gets split within the Rancher webui and every character of the string is put in its own "field".

tls-alt

@rawmind0
Copy link
Contributor Author

Hi @revog , yes, this is expected as tf map type is a map[string]string instead of map[string]interface{}, so it doesn't support arrays, etc.. What can be done here is treat this as a yaml string instead of tf map, as it's done at chart_values argument.

@rawmind0
Copy link
Contributor Author

Updated PR to use machine_global_config argument as yaml string

@revog
Copy link

revog commented Sep 16, 2021

Hi @rawmind0
thanks for fixing the machine_global_config issue. I'm now able to predefine custom cluster settings. But unfortunately I get again the error when generating the Kubeconfig:

¦ Error: Setting cluster V2 legacy data: Timeout getting cluster Kubeconfig: Bad response statusCode [503]. Status [503 Service Unavailable]. Body: [baseType=error, code=ClusterUnavailable, message=ClusterUnavailable 503: cluster not found] from [https://rancher.xxx.tld/v3/clusters/c-m-zhwzxfz7?action=generateKubeconfig]
¦
¦   with module.rke2.rancher2_cluster_v2.create,
¦   on modules/rke2/create.tf line 1, in resource "rancher2_cluster_v2" "create":
¦    1: resource "rancher2_cluster_v2" "create" {
¦

Were there again changes regarding this function?

@rawmind0
Copy link
Contributor Author

Hi @revog , the function didn't changed. You are getting a timeout. It seems something related to your Rancher installation. The timeout is configurable, have you tried to increase it?? https://registry.terraform.io/providers/rancher/rancher2/latest/docs#timeout

@revog
Copy link

revog commented Sep 16, 2021

You are getting a timeout. It seems something related to your Rancher installation. Th

I did some further investigation and found the cause. As long as I disable ACE the cluster get created in Rancher and kubeconfig is avaiilable. But when I enable ACE (by setting local_auth_endpoint) the mentioned error occurs.

For testing I took the former build (without the machine_global_config fix - and as you could guess - the kubeconfig generation with active ACE is possible.

So my assumption is still, that something changed in the code?

Copy link
Contributor

@a-blender a-blender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed for new attributes, resources, and data source matches with the changelog. Are any of these files generated by Rancher? How would I know if they are?

@rawmind0
Copy link
Contributor Author

Reviewed for new attributes, resources, and data source matches with the changelog. Are any of these files generated by Rancher? How would I know if they are?

Thanks for the review @annablender. The provider is using Rancher API to work, but none of these files are generated by Rancher. For generated files, we use to use special prefix as zz_generated

@rawmind0
Copy link
Contributor Author

For testing I took the former build (without the machine_global_config fix - and as you could guess - the kubeconfig generation with active ACE is possible.

So my assumption is still, that something changed in the code?

As mentioned, the getClusterKubeconfig function code hasn't changed. The only think that has changed is the way to treat machine_global_config argument, from tf map to string in yaml format. May be the machine_global_config configured values are generating something wrong?? Is the same cluster config working if it's deployed from the ui??

@revog
Copy link

revog commented Sep 17, 2021

For testing I took the former build (without the machine_global_config fix - and as you could guess - the kubeconfig generation with active ACE is possible.
So my assumption is still, that something changed in the code?

As mentioned, the getClusterKubeconfig function code hasn't changed. The only think that has changed is the way to treat machine_global_config argument, from tf map to string in yaml format. May be the machine_global_config configured values are generating something wrong?? Is the same cluster config working if it's deployed from the ui??

Hmm that's strange. When I take the latest build and disable ACE (enable = false in local_auth_endpoint) then I am able to create the RKE2 cluster and get the kubeconfig without any problems. After the successfull run of Terraform I re-enabled ACE (set enable = true) and run terraform plan and terraform apply - and this led to the same error:

¦ Error: Setting cluster V2 legacy data: Getting cluster Kubeconfig: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [baseType=error, code=ServerError, message=the server could not find the requested resource (post clusterauthtokens.meta.k8s.io)] from [https://rancher.local.tld/v3/clusters/c-m-6dmfxltw?action=generateKubeconfig]

If I take the former build (without the change related to machine_global_config) and remove this machine_global_config part (as it is not yet supported in this build) and enable ACE (set enable = true in local_auth_endpoint) the RKE2 cluster gets created successfully and the kubeconfig is accessible ...

Sorry, but I don't get it why this behavior occurs?!

// UPDATE
I did some further investigation and found the specific root cause. As long as no value for tls-san (within machine_global_config) is set, the creation works with enabled ACE. When you set a value for tls-san this error occurs. So it is related to the machine_global_config part you mentioned :-).

@rawmind0
Copy link
Contributor Author

// UPDATE
I did some further investigation and found the specific root cause. As long as no value for tls-san (within machine_global_config) is set, the creation works with enabled ACE. When you set a value for tls-san this error occurs. So it is related to the machine_global_config part you mentioned :-).

@revog thanks for the further investigation on this.

I've done some testing just using the rancher API (outside tfp), and getting same 503 result when ACE is enabled.

{
    "baseType": "error",
    "code": "ClusterUnavailable",
    "message": "ClusterUnavailable 503: cluster not found",
    "status": 503,
    "type": "error"
}

I've added another fix to the tf provider, just logging a warn instead of error on getClusterKubeconfig 503 errors if ACE is enabled. Could you please test it??

@revog
Copy link

revog commented Sep 17, 2021

I've added another fix to the tf provider, just logging a warn instead of error on getClusterKubeconfig 503 errors if ACE is enabled. Could you please test it??

Great, it works now as expected. Thank you.

@rawmind0 rawmind0 merged commit 6c5b292 into rancher:master Sep 17, 2021
@zube zube bot removed the [zube]: Done label Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants