-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to build cluster with CIS profile (cis-1.5) enabled #851
Comments
Have you checked out the CIS hardening guide in the docs? |
Thanks Brad pointing out to the docs. I am somehow missed that part. I am able to pass thru the initial failure but now the kubelet does not seem to start the other k8 services like api server, scheduler etc. kubelet is trying to reach the api server to register itself but the api server is not running and is basically stuck in that process. Please advice. |
How long have you given it? Are you using a private registry or airgap image archive to mirror the images locally? It can take a bit to start up the first time as it pulls all the various images and it won't appear to be doing anything until the etcd and apiserver pods are running. |
I have given long enough time but it looks like it is stuck. looking at the kubelet logs , looks like it is unable start containerd process to spin up the k8 services. any idea why it is failing to start containerd ? is it possible that the containerd configuration is missing something that is required for CIS to be met.
|
Is this on a selinux-enabled system? Did you install the correct selinux packages? What distro and kernel is this host running? The best clue I have is at opencontainers/runc#2031 (comment) which suggests this is caused by older kernels + odd selinux configuration? |
Correct. sestatus is enforcing. We are using rhel 7.9 and kernel is 3.10.0-1160.15.2.el7.x86_64. |
We validate on RHEL7 and 8 and I haven't seen this. How did you install RKE2? Can you confirm that you installed from RPM, and have the required rke2-selinux packages (and their dependencies) installed? Edit: I see that you have customized the We don't specifically call this out in the docs at the moment, but we do have other issues regarding it: #474 (comment) |
We installed rke2 from the tar file. How do i verify that rke2-selinux dependencies are installed ? If you can point me to some docs that will be great. We changed the data-dir location since the root directory we have have only limited capacity and not allowed to grow. Since the data dir will grow over time, i updated it to new location on separate EBS volume that we use. Should I remove data-dir and give it a try ? Also how do i customize the location of the etcd database ? I did not see any option in rke2 to customize the location of the etcd database |
It is recommended that RKE2 be installed from RPM on selinux-enabled systems, as this ensures that all the selinux dependencies are installed. The tarball install does not use the same paths for RKE2 binaries as the RPM, so even if you installed the rke2-selinux RPM alongside the tarball, it still would not fix your problem.
I would recommend mounting the secondary EBS volume at /var/lib/rancher and then using the default data-dir value so that you don't have to try to build your own selinux policy. Ensure that this path is mounted when you install RKE2 so that the selinux labels are set properly.
The etcd database cannot currently be individually relocated; it will always be at $DATADIR/server/db. |
Hi Brad - I am unable to install rke2 using rpm. how do I fix this issue ? I did not provide any version or install type before running the install script. Are the rpm repos available now ? failure: repodata/repomd.xml.asc from rancher-rke2-1.20-stable: [Errno 256] No more mirrors to try. |
Hi Brad - on side note, I removed the data-dir and selinux options from config file and just keeping profile cis-1.5 option in it. I was able to get to a point where it starts all the services but when I check the node status with kubectl it says node is NotReady. I found that the calico install is not run to install calico. I dropped the two tigera yaml files in server/manifestss folder but looks like rke2 did not pick them up to install calico. i ran the tigera yaml files manually but do not see the calico node created in kube-system or tigera-operator namespace. How do I install calico on rke2 ? I followed this docs to install calico. |
That's not a file that exists as part of our Yum repo; I'm not sure why your system is looking for it. Is this what originally led you to installing via tarball? Can you compare your repo file?
For the CNI issue - have you disabled canal on all of your servers? Are there any errors in the rke2-server logs regarding deployment of those manifests? |
Hi Brad - I have the exact yum repo file except with reference to 1.19 but still not sure why yum install is failing on that specific url.
|
Hi Brad - Sorry for the late reply. I finally was able to install RKE2 rpms with local install. This time I see in kubelet log file that the container runtime is not ready even though I have copied the calico manifests files to "/var/lib/rancher/rke2/server/manifests" directory. Does RKE2 run these manifests every time we start the rke2 server service or does it only run these manifests only once ? Any idea why should kubelet say container runtime is not ready ? RPMS installed: Error in kubelet Error in containerd |
This appears to be an error in runc, but should be fixed in the version included in RKE2. Do you only see this error when attempting to use your own CNI plugin, or do you get the same thing with Canal? |
Hi Brad, These are the steps I performed before starting the rke2-server via the service. looks like I am missing some install step here. please advice.
After this I started the rke2 server . It starts kubelet which starts the pods for api server, etc but the pod fail too start due to CNI not initialized and the error that I gave earlier. Does this mean that the containerd is not installed properly or CNI plugin is not installed ? I thought calico would install CNI plugins. Do i need to follow the steps to install CNI plugin mentioned on this page |
Wait, why are you running your own containerd? RKE2 includes its own containerd, and the selinux policies we install will only work for paths used by our containerd, not a user-provided containerd. |
that was good to know. I will skip the containerd install and try. So you do not think I need to install the CNI plugin for Calico that I mentioned earlier ? |
Hi Brad - I skipped the containerd manual install but installed the CNI plugin as mentioned in the calico page but I still get the same error. Am I missing any step here ? |
At this point I would probably just follow the quick-start instructions and get a basic installation working. Once that is done, try again replacing canal with calico. |
Hi Brad - As suggested , I started from the quick start and was finally able to build cluster with RKE2 and calico. However I see that the DNS service is not running in kube-system and see errors in the coredns pod. Any idea why this would happen ?
|
Wait, is it not running, or is it running with errors? I don't recognize any of those IP addresses - they're not in any of the normal cluster CIDR ranges. Are you able to identify them within your environment? |
I think this CIDR (192.*) is from the calico manifests. I am going to update the CALICO_IPV4POOL_CIDR value to the cluster CIDR in manifests and reinstall calico. will update you soon. |
I updated the CALICO_IPV4POOL_CIDR value to cluster CIDR in calico manifests and reinstalled calico but still getting the same error. The core dns pods are up but showing the above errors that I mentioned. Any idea why coredns is failing ? |
Did you rebuild the cluster? It's pretty hard to change cidrs once the cluster is up. |
Yes, I had to rebuild the cluster. Do we need to install any DNS add-on ? according to Kubernetes docs, I could be missing some add-on. |
CoreDNS is your dns addon. Are you getting the exact same messages? Did you see the same messages when using the default CNI? |
Correct, CoreDNS is failing with timeout errors. I did not look into CoreDNS when I tried default Canal. what do you mean by default CNI ? |
You are right Brad, I do not see any issues if I use default canal. I see this coredns timeout only when I use calico |
Also when I run ctr client , I only get client version and it times out and fails to return server version. I see the containerd process is also running. Why it fails to show server version ? ctr version ctr: failed to dial "/run/containerd/containerd.sock": context deadline exceeded |
Hi Brad - I just wanted to check to find out if Calico works on RKE2 ? |
Hi Brad - I was able to build RKE2 K8 cluster with calico. But i am facing a weird issue now. After the cluster was successfully build and tested, I found that the RKE2 binaries like rke2 has disappeared from out install dir. The install dir only has containerd binaries. We are installing RKE2 in custom folder and not in /var/lib/rancher/rke folder. Due to this I could not restart the cluster since the rke2 binary is missing. Please advice. |
Which binaries are you missing? The main RKE2 binary should install to /usr/local/bin/rke2 or /usr/bin/rke2, depending on whether you're using the tarball or RPM. Everything else gets extracted from the runtime image to $DATADIR/data/$RELEASE/bin/ during startup. |
I used the INSTALL_RKE2_ARTIFACT_PATH to install rke2 binaries in /app/rke2 folder. Now they are missing from there after 12 hours. I successfully tested the cluster after running the rke2 server and everything was running fine. |
INSTALL_RKE2_ARTIFACT_PATH is just the path where the tarballs or RPMs and checksums should be found when the install script is run, it is NOT the location that RKE2 is installed to. We don't delete them at the end of the install script, so I am guessing something or someone else is responsible for their removal. Did you perhaps put them in a temporary directory that is cleaned up nightly? |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
We are unable to build cluster with CIS profile (cis-1.5) enabled. I think it is failing in the initial CIS benchmarks checks and aborts with the below error due to unmet requirements. We are using containerd for the container runtime. Where is the initial setup or requirements documentated so that the initial CIS checks pass and rke2 can build cluster successfully ?
Error:
missing required user: unknow user etcd
invalid kernel parameter value vm.overcommit_memory=0 - expected 1
invalid kernel parameter value kernel.panic=0 - expected 10
Version: v1.19.7+rke2r1
Config:
write-kubeconfig-mode: "0600"
write-kubeconfig: /app/rke2/kube-config.yaml
data-dir: /app/rke2
cluster-cidr: "10.42.0.0/16"
service-cidr: "10.43.0.0/16"
disable:
cloud-provider-name: "aws"
tls-san:
node-name: ""
node-label:
profile: "cis-1.5"
selinux: true
The text was updated successfully, but these errors were encountered: