Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to build cluster with CIS profile (cis-1.5) enabled #851

Closed
grk-pancham opened this issue Apr 6, 2021 · 36 comments
Closed

Unable to build cluster with CIS profile (cis-1.5) enabled #851

grk-pancham opened this issue Apr 6, 2021 · 36 comments

Comments

@grk-pancham
Copy link

We are unable to build cluster with CIS profile (cis-1.5) enabled. I think it is failing in the initial CIS benchmarks checks and aborts with the below error due to unmet requirements. We are using containerd for the container runtime. Where is the initial setup or requirements documentated so that the initial CIS checks pass and rke2 can build cluster successfully ?

Error:
missing required user: unknow user etcd
invalid kernel parameter value vm.overcommit_memory=0 - expected 1
invalid kernel parameter value kernel.panic=0 - expected 10

Version: v1.19.7+rke2r1

Config:
write-kubeconfig-mode: "0600"
write-kubeconfig: /app/rke2/kube-config.yaml
data-dir: /app/rke2
cluster-cidr: "10.42.0.0/16"
service-cidr: "10.43.0.0/16"
disable:

  • rke2-canal
    cloud-provider-name: "aws"
    tls-san:
  • ""
    node-name: ""
    node-label:
  • "server=rke2-server-dev"
    profile: "cis-1.5"
    selinux: true
@brandond
Copy link
Member

brandond commented Apr 6, 2021

Have you checked out the CIS hardening guide in the docs?
https://docs.rke2.io/security/hardening_guide/

@grk-pancham
Copy link
Author

Thanks Brad pointing out to the docs. I am somehow missed that part.

I am able to pass thru the initial failure but now the kubelet does not seem to start the other k8 services like api server, scheduler etc. kubelet is trying to reach the api server to register itself but the api server is not running and is basically stuck in that process. Please advice.

@brandond
Copy link
Member

brandond commented Apr 6, 2021

How long have you given it? Are you using a private registry or airgap image archive to mirror the images locally? It can take a bit to start up the first time as it pulls all the various images and it won't appear to be doing anything until the etcd and apiserver pods are running.

@grk-pancham
Copy link
Author

grk-pancham commented Apr 6, 2021

I have given long enough time but it looks like it is stuck. looking at the kubelet logs , looks like it is unable start containerd process to spin up the k8 services. any idea why it is failing to start containerd ? is it possible that the containerd configuration is missing something that is required for CIS to be met.

E0406 22:20:09.724172   17733 remote_runtime.go:113] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown
E0406 22:20:09.724213   17733 kuberuntime_sandbox.go:69] CreatePodSandbox for pod "internal_kube-system(1be4fc34bdb6056763aa9650087de0fb)" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown
E0406 22:20:09.724228   17733 kuberuntime_manager.go:741] createPodSandbox for pod "internal_kube-system(1be4fc34bdb6056763aa9650087de0fb)" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown
E0406 22:20:09.724311   17733 pod_workers.go:191] Error syncing pod 1be4fc34bdb6056763aa9650087de0fb ("internal_kube-system(1be4fc34bdb6056763aa9650087de0fb)"), skipping: failed to "CreatePodSandbox" for "etcd-ip-10-12-137-185.us-gov-west-1.compute.internal_kube-system(1be4fc34bdb6056763aa9650087de0fb)" with CreatePodSandboxError: "CreatePodSandbox for pod \"internal_kube-system(1be4fc34bdb6056763aa9650087de0fb)\" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown"

@brandond
Copy link
Member

brandond commented Apr 6, 2021

Is this on a selinux-enabled system? Did you install the correct selinux packages? What distro and kernel is this host running?

The best clue I have is at opencontainers/runc#2031 (comment) which suggests this is caused by older kernels + odd selinux configuration?

@grk-pancham
Copy link
Author

Correct. sestatus is enforcing.

We are using rhel 7.9 and kernel is 3.10.0-1160.15.2.el7.x86_64.

@brandond
Copy link
Member

brandond commented Apr 7, 2021

We validate on RHEL7 and 8 and I haven't seen this. How did you install RKE2? Can you confirm that you installed from RPM, and have the required rke2-selinux packages (and their dependencies) installed?

Edit: I see that you have customized the data-dir: value. This makes your life MUCH harder on selinux-enabled systems, since all the selinux policies apply to files in their default path of /var/lib/rancher/rke2. Is there any reason in particular that you're changing that? Your life will be much easier if you can keep it in the default location - mount a different disk or partition there if necessary (and then reinstall rke2-selinux to trigger restorecon to label it properly), but don't change it.

We don't specifically call this out in the docs at the moment, but we do have other issues regarding it: #474 (comment)

@grk-pancham
Copy link
Author

We installed rke2 from the tar file. How do i verify that rke2-selinux dependencies are installed ? If you can point me to some docs that will be great.

We changed the data-dir location since the root directory we have have only limited capacity and not allowed to grow. Since the data dir will grow over time, i updated it to new location on separate EBS volume that we use. Should I remove data-dir and give it a try ?

Also how do i customize the location of the etcd database ? I did not see any option in rke2 to customize the location of the etcd database

@brandond
Copy link
Member

brandond commented Apr 7, 2021

We installed rke2 from the tar file. How do i verify that rke2-selinux dependencies are installed ? If you can point me to some docs that will be great.

It is recommended that RKE2 be installed from RPM on selinux-enabled systems, as this ensures that all the selinux dependencies are installed. The tarball install does not use the same paths for RKE2 binaries as the RPM, so even if you installed the rke2-selinux RPM alongside the tarball, it still would not fix your problem.

We changed the data-dir location since the root directory we have have only limited capacity and not allowed to grow. Since the data dir will grow over time, i updated it to new location on separate EBS volume that we use. Should I remove data-dir and give it a try ?

I would recommend mounting the secondary EBS volume at /var/lib/rancher and then using the default data-dir value so that you don't have to try to build your own selinux policy. Ensure that this path is mounted when you install RKE2 so that the selinux labels are set properly.

Also how do i customize the location of the etcd database ? I did not see any option in rke2 to customize the location of the etcd database

The etcd database cannot currently be individually relocated; it will always be at $DATADIR/server/db.

@grk-pancham
Copy link
Author

Hi Brad - I am unable to install rke2 using rpm. how do I fix this issue ? I did not provide any version or install type before running the install script. Are the rpm repos available now ?

failure: repodata/repomd.xml.asc from rancher-rke2-1.20-stable: [Errno 256] No more mirrors to try.
https://rpm.rancher.io/rke2/stable/1.20/centos/7/x86_64/repodata/repomd.xml.asc: [Errno 14] HTTPS Error 404 - Not Found

@grk-pancham
Copy link
Author

Hi Brad - on side note, I removed the data-dir and selinux options from config file and just keeping profile cis-1.5 option in it. I was able to get to a point where it starts all the services but when I check the node status with kubectl it says node is NotReady. I found that the calico install is not run to install calico. I dropped the two tigera yaml files in server/manifestss folder but looks like rke2 did not pick them up to install calico. i ran the tigera yaml files manually but do not see the calico node created in kube-system or tigera-operator namespace. How do I install calico on rke2 ? I followed this docs to install calico.
530c685

@brandond
Copy link
Member

brandond commented Apr 7, 2021

That's not a file that exists as part of our Yum repo; I'm not sure why your system is looking for it. Is this what originally led you to installing via tarball? Can you compare your repo file?

[root@centos01 ~]# cat /etc/yum.repos.d/rancher-rke2.repo
[rancher-rke2-common-stable]
name=Rancher RKE2 Common (stable)
baseurl=https://rpm.rancher.io/rke2/stable/common/centos/7/noarch
enabled=1
gpgcheck=1
gpgkey=https://rpm.rancher.io/public.key
[rancher-rke2-1.20-stable]
name=Rancher RKE2 1.20 (stable)
baseurl=https://rpm.rancher.io/rke2/stable/1.20/centos/7/x86_64
enabled=1
gpgcheck=1
gpgkey=https://rpm.rancher.io/public.key

For the CNI issue - have you disabled canal on all of your servers? Are there any errors in the rke2-server logs regarding deployment of those manifests?

@grk-pancham
Copy link
Author

grk-pancham commented Apr 8, 2021

Hi Brad - I have the exact yum repo file except with reference to 1.19 but still not sure why yum install is failing on that specific url.

[rancher-rke2-common-stable]
name=Rancher RKE2 Common (stable)
baseurl=https://rpm.rancher.io/rke2/stable/common/centos/7/noarch
enabled=1
gpgcheck=1
gpgkey=https://rpm.rancher.io/public.key
[rancher-rke2-1.19-stable]
name=Rancher RKE2 1.19 (stable)
baseurl=https://rpm.rancher.io/rke2/stable/1.19/centos/7/x86_64
enabled=1
gpgcheck=1
gpgkey=https://rpm.rancher.io/public.key

@grk-pancham
Copy link
Author

grk-pancham commented Apr 22, 2021

Hi Brad - Sorry for the late reply. I finally was able to install RKE2 rpms with local install. This time I see in kubelet log file that the container runtime is not ready even though I have copied the calico manifests files to "/var/lib/rancher/rke2/server/manifests" directory. Does RKE2 run these manifests every time we start the rke2 server service or does it only run these manifests only once ? Any idea why should kubelet say container runtime is not ready ?

RPMS installed:
container-selinux-2.119.2-1.911c772.el7_8.noarch.rpm rke2-common-1.19.9~rke2r1-0.el8.x86_64.rpm rke2-selinux-0.4-1.el8.noarch.rpm rke2-server-1.19.9~rke2r1-0.el8.x86_64.rpm

Error in kubelet
E0422 03:54:16.745706 17705 kubelet.go:2134] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Error in containerd
error="failed to create containerd task: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown"

@brandond
Copy link
Member

brandond commented Apr 22, 2021

failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown

opencontainers/runc#2030

This appears to be an error in runc, but should be fixed in the version included in RKE2. Do you only see this error when attempting to use your own CNI plugin, or do you get the same thing with Canal?

@grk-pancham
Copy link
Author

grk-pancham commented Apr 22, 2021

Hi Brad,

These are the steps I performed before starting the rke2-server via the service. looks like I am missing some install step here. please advice.

  1. installed containerd via the tar file. this installs the ctr client and when I run ctr version I get the client and server version. Please note I have not installed the containered.io package which is mentioned in the Kubernetes containerd install page as shown below Is that required ?
    Install the containerd.io package from the official Docker repositories. Instructions for setting up the Docker repository for your respective Linux distribution and installing the containerd.io package can be found at Install Docker Engine.Install the containerd.io package from the official Docker repositories. Instructions for setting up the Docker repository for your respective Linux distribution and installing the containerd.io package can be found at Install Docker Engine.
  2. Installed below RPMS
    container-selinux-2.119.2-1.911c772.el7_8.noarch.rpm
    rke2-common-1.19.9rke2r1-0.el8.x86_64.rpm
    rke2-selinux-0.4-1.el8.noarch.rpm
    rke2-server-1.19.9
    rke2r1-0.el8.x86_64.rpm
  3. Copied the calico manifests to the /var/lib/rancher/rke2/server/manifests" directory

After this I started the rke2 server . It starts kubelet which starts the pods for api server, etc but the pod fail too start due to CNI not initialized and the error that I gave earlier. Does this mean that the containerd is not installed properly or CNI plugin is not installed ? I thought calico would install CNI plugins. Do i need to follow the steps to install CNI plugin mentioned on this page
https://docs.projectcalico.org/getting-started/kubernetes/hardway/install-cni-plugin. I also see the --network-plugin=cni parameter is missing in the kubelet process .

@brandond
Copy link
Member

Wait, why are you running your own containerd? RKE2 includes its own containerd, and the selinux policies we install will only work for paths used by our containerd, not a user-provided containerd.

@grk-pancham
Copy link
Author

that was good to know. I will skip the containerd install and try.

So you do not think I need to install the CNI plugin for Calico that I mentioned earlier ?

@grk-pancham
Copy link
Author

Hi Brad - I skipped the containerd manual install but installed the CNI plugin as mentioned in the calico page but I still get the same error. Am I missing any step here ?

@brandond
Copy link
Member

At this point I would probably just follow the quick-start instructions and get a basic installation working. Once that is done, try again replacing canal with calico.

@grk-pancham
Copy link
Author

Hi Brad - As suggested , I started from the quick start and was finally able to build cluster with RKE2 and calico. However I see that the DNS service is not running in kube-system and see errors in the coredns pod. Any idea why this would happen ?

.:53 [INFO] plugin/reload: Running configuration MD5 = 7da3877dbcacfd983f39051ecafd33bd CoreDNS-1.6.9 linux/amd64, go1.15.8b5, 17665683 [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:60721->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:56845->10.11.176.134:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:54198->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:54023->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:53421->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:50191->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:33279->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:39477->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:51302->10.11.176.235:53: i/o timeout [ERROR] plugin/errors: 2 7533043099195048891.7991570519037476386. HINFO: read udp 192.168.88.130:58005->10.11.176.235:53: i/o timeout

@brandond
Copy link
Member

brandond commented Apr 23, 2021

Wait, is it not running, or is it running with errors?

I don't recognize any of those IP addresses - they're not in any of the normal cluster CIDR ranges. Are you able to identify them within your environment?

@grk-pancham
Copy link
Author

I think this CIDR (192.*) is from the calico manifests. I am going to update the CALICO_IPV4POOL_CIDR value to the cluster CIDR in manifests and reinstall calico. will update you soon.

@grk-pancham
Copy link
Author

I updated the CALICO_IPV4POOL_CIDR value to cluster CIDR in calico manifests and reinstalled calico but still getting the same error. The core dns pods are up but showing the above errors that I mentioned. Any idea why coredns is failing ?

@brandond
Copy link
Member

Did you rebuild the cluster? It's pretty hard to change cidrs once the cluster is up.

@grk-pancham
Copy link
Author

Yes, I had to rebuild the cluster. Do we need to install any DNS add-on ? according to Kubernetes docs, I could be missing some add-on.
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/
If you see that no CoreDNS Pod is running or that the Pod has failed/completed, the DNS add-on may not be deployed by default in your current environment and you will have to deploy it manually.

@brandond
Copy link
Member

CoreDNS is your dns addon. Are you getting the exact same messages? Did you see the same messages when using the default CNI?

@grk-pancham
Copy link
Author

Correct, CoreDNS is failing with timeout errors.

I did not look into CoreDNS when I tried default Canal. what do you mean by default CNI ?

@grk-pancham
Copy link
Author

You are right Brad, I do not see any issues if I use default canal. I see this coredns timeout only when I use calico

@grk-pancham
Copy link
Author

Also when I run ctr client , I only get client version and it times out and fails to return server version. I see the containerd process is also running. Why it fails to show server version ?

ctr version
`Client:
Version: v1.4.4-k3s1
Revision: 70786f0464ebb57cc75df378049a52850d71cc66
Go version: go1.15.8b5

ctr: failed to dial "/run/containerd/containerd.sock": context deadline exceeded
`

@grk-pancham
Copy link
Author

Hi Brad - I just wanted to check to find out if Calico works on RKE2 ?

@grk-pancham
Copy link
Author

Hi Brad - I was able to build RKE2 K8 cluster with calico. But i am facing a weird issue now. After the cluster was successfully build and tested, I found that the RKE2 binaries like rke2 has disappeared from out install dir. The install dir only has containerd binaries. We are installing RKE2 in custom folder and not in /var/lib/rancher/rke folder. Due to this I could not restart the cluster since the rke2 binary is missing. Please advice.

@brandond
Copy link
Member

brandond commented Apr 26, 2021

Which binaries are you missing? The main RKE2 binary should install to /usr/local/bin/rke2 or /usr/bin/rke2, depending on whether you're using the tarball or RPM. Everything else gets extracted from the runtime image to $DATADIR/data/$RELEASE/bin/ during startup.

@grk-pancham
Copy link
Author

I used the INSTALL_RKE2_ARTIFACT_PATH to install rke2 binaries in /app/rke2 folder. Now they are missing from there after 12 hours. I successfully tested the cluster after running the rke2 server and everything was running fine.
BTW they are also not in the /usr/local/bin/rke2 or /usr/bin/rke2 folders

@brandond
Copy link
Member

INSTALL_RKE2_ARTIFACT_PATH is just the path where the tarballs or RPMs and checksums should be found when the install script is run, it is NOT the location that RKE2 is installed to. We don't delete them at the end of the install script, so I am guessing something or someone else is responsible for their removal. Did you perhaps put them in a temporary directory that is cleaned up nightly?

@stale
Copy link

stale bot commented Oct 23, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Oct 23, 2021
@stale stale bot closed this as completed Nov 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants