Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 not starting up services on RHEL8 #1539

Closed
jcrosel opened this issue Aug 5, 2021 · 19 comments
Closed

RKE2 not starting up services on RHEL8 #1539

jcrosel opened this issue Aug 5, 2021 · 19 comments

Comments

@jcrosel
Copy link

jcrosel commented Aug 5, 2021

Environmental Info:
RKE2 Version:
[myuser@vm1 ~]$ rke2 -v
rke2 version v1.21.3+rke2r1 (2ed0b0d)
go version go1.16.6b7

Node(s) CPU architecture, OS, and Version:
Linux vm1 4.18.0-147.51.2.el8_1.x86_64 #1 SMP Thu Jul 8 06:09:25 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
1 VM, Red Hat Enterprise Linux 8, latest patch level, 2
vCPU, 8GB RAM

/etc/rancher/rke2/config.yaml:
debug: true
selinux: true

Describe the bug:
rke2-server does not seem to be able to start its components.

Steps To Reproduce:

Added some iptable entries
sudo iptables -A INPUT -p tcp --dport 6443 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 2379 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 2380 -j ACCEPT

Installed Docker
sudo dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
sudo dnf install docker-ce --nobest -y
sudo systemctl start docker
sudo systemctl enable docker
sudo usermod -aG docker $USER

Installed and started RKE2 server
Run the installer: curl -sfL https://get.rke2.io | sh -
Enable the rke2-server service: systemctl enable rke2-server.service
Start the service: systemctl start rke2-server.service

Expected behavior:
RKE2 server to startup components like etcd, kube-apiserver

Actual behavior:
connection errors in log (see attached log file) for 2379 (etcd) and 6443 (kube-apiserver)

Additional context / logs:
log-rke2 - Copy.txt

@brandond
Copy link
Member

brandond commented Aug 5, 2021

Why are you installing and starting Docker before RKE2? RKE2 uses its own embedded containerd; there is no need to install docker beforehand and in fact you are better off not.

It also looks like you've not installed the required selinux packages; normally the installer does this for you so I'm confused how this could happen:
Aug 05 13:15:38 vm1 rke2[15268]: time="2021-08-05T13:15:38Z" level=warning msg="SELinux is enabled for rke2 but process is not running in context 'container_runtime_t', rke2-selinux policy may need to be applied"

If removing docker and installing the selinux packages does not resolve the error, see if there's anything interesting in the containerd log at /var/lib/rancher/rke2/agent/containerd/containerd.log.

@jcrosel
Copy link
Author

jcrosel commented Aug 6, 2021

@brandond thats good to know. I was just expecting it, as it was a requirement for rke (binary) and rke in rancher, as far as I know.
I now completly started from scratch, these are the steps I did:
rke2-server.log

systemctl disable firewalld
vim /etc/NetworkManager/conf.d/rke2-canal.conf
systemctl reload NetworkManager
dnf upgrade -y
reboot
curl -sfL https://get.rke2.io | sh -
systemctl enable rke2-server.service
systemctl start rke2-server.service

Still not working, seems to be like before.
Log file attached:
rke2-server.log

Could it be that the issue is the RHEL image in Azure? It seems to be locked to 8.1 and cant/shouldnt be updated.
Thats the latest version Red Hat provides in Azure market place.

@brandond
Copy link
Member

brandond commented Aug 6, 2021

etcd still isn't starting, can you check the containerd log file as requested above?

@jcrosel
Copy link
Author

jcrosel commented Aug 9, 2021

something seems to be off for CNI.
containerd.log

@brandond have you ever experienced something similar?

@brandond
Copy link
Member

brandond commented Aug 9, 2021

time="2021-08-06T06:36:38.554910732Z" level=info msg="CreateContainer within sandbox \"214255c37689a276a424b37dffd2b03b9f7c641b68045f2180b2413dd5275d51\" for &ContainerMetadata{Name:kube-proxy,Attempt:0,} returns container id \"84b2f6932660193fa18d0a555b43e1b4444592ff7b6898aaf9e5c86f01cc05bd\""
time="2021-08-06T06:36:38.555421235Z" level=info msg="StartContainer for \"84b2f6932660193fa18d0a555b43e1b4444592ff7b6898aaf9e5c86f01cc05bd\""
time="2021-08-06T06:36:38.722580775Z" level=info msg="StartContainer for \"84b2f6932660193fa18d0a555b43e1b4444592ff7b6898aaf9e5c86f01cc05bd\" returns successfully"
time="2021-08-06T06:36:52.714149951Z" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:etcd-adbsg-fzag-k8s-vm1,Uid:985840a449fab27fbd1831f57843061a,Namespace:kube-system,Attempt:0,}"
time="2021-08-06T06:36:52.792435138Z" level=info msg="starting signal loop" namespace=k8s.io path=/run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/e26f8170d80ada8ab2531a87962fec1edbfd7b708c416ff914d2c5e1d6cf662e pid=6507
time="2021-08-06T06:36:52.908778362Z" level=info msg="shim disconnected" id=e26f8170d80ada8ab2531a87962fec1edbfd7b708c416ff914d2c5e1d6cf662e
time="2021-08-06T06:36:52.908851662Z" level=error msg="copy shim log" error="read /proc/self/fd/28: file already closed"
time="2021-08-06T06:36:52.943644879Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:etcd-adbsg-fzag-k8s-vm1,Uid:985840a449fab27fbd1831f57843061a,Namespace:kube-system,Attempt:0,} failed, error" error="failed to create containerd task: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown"

I've only seen this one other time, in #851 (comment) - and this user was setting a custom-data dir and also using a standalone containerd, both of which complicate things on selinux-enabled hosts. Can you start over on a clean host that doesn't have system docker or containerd installed, and ensure that you're using the RPM install with all the required selinux packages?

@jcrosel
Copy link
Author

jcrosel commented Aug 12, 2021

I did start again from a clean RHEL 8.1 VM in Azure.
I used the script in the quick start guide to install it. That uses rpm as well, right?

@ansilh
Copy link
Contributor

ansilh commented Oct 8, 2021

Hi @brandond , I'm able to consistently reproduce this issue on airgap setup with HTTP proxy + internal registry enabled node.

  • RHEL 8.2 - 4.18.0-193.el8.x86_64
  • SELinux - enforcing
  • RPMs
# rpm -qa |grep rke2
rke2-selinux-0.8-2.el8.noarch
rke2-server-1.20.10~rke2r1-0.el8.x86_64
rke2-common-1.20.10~rke2r1-0.el8.x86_64
  • YUM proxy config
# grep proxy /etc/yum.conf
proxy=http://squid.ansil.io:3128
  • Registry mirror config
# cat /etc/rancher/rke2/registries.yaml
mirrors:
  docker.io:
    endpoint:
      - "https://registry.ansil.io"
    rewrite:
      "^rancher/(.*)": "proxy/rancher/$1"
  • Installation step
export HTTP_PROXY=squid.ansil.io:3128
export HTTPS_PROXY=$HTTP_PROXY
INSTALL_RKE2_CHANNEL=v1.20 ./install.sh

One thing I noticed during installation is, below error appeared on the rpm post-installation step.

Failed to resolve typeattributeset statement at /var/lib/selinux/targeted/tmp/modules/400/rke2/cil:17
semodule:  Failed!
  • Error in kubelet

E1008 07:13:42.882484    1535 kuberuntime_sandbox.go:70] CreatePodSandbox for pod "etcd-rke2-rhel8.ansil.io_kube-system(ef11ca6fb492d20a062f350f941bb147)" failed: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: failed to set /proc/self/attr/keycreate on procfs: write /proc/self/attr/keycreate: invalid argument: unknown

@ansilh
Copy link
Contributor

ansilh commented Oct 8, 2021

There is something not right in the SElinux policy file.

# rpm -q --scripts rke2-selinux-0.8-2.el8.noarch
postinstall scriptlet (using /bin/sh):
semodule -n -i /usr/share/selinux/packages/rke2.pp

When I tried to load the module manually, I get the same error

# semodule -n -i /usr/share/selinux/packages/rke2.pp
Failed to resolve typeattributeset statement at /var/lib/selinux/targeted/tmp/modules/400/rke2/cil:17
semodule:  Failed!

@ansilh
Copy link
Contributor

ansilh commented Oct 8, 2021

Here is the line that causing the issue

# cat /usr/share/selinux/packages/rke2.pp   | /usr/libexec/selinux/hll/pp > rke2.cli
# sed -n 17p rke2.cli
(typeattributeset cil_gen_require container_kvm_var_run_t)  <<----

@ansilh
Copy link
Contributor

ansilh commented Oct 8, 2021

Didn't see any such types in the loaded container module. :(

# semodule -c --extract=container
Module 'container' does not exist at the default priority '400'. Extracting at highest existing priority '200'.

# grep container_kvm_var_run_t container.cil

@ansilh
Copy link
Contributor

ansilh commented Oct 8, 2021

After upgrading the RPM to container-selinux-2.159.0-1.module_el8.5.0+733+9bb5dffa.noarch, I'm able to load the rke2 policy package.

# rpm -Uvh container-selinux-2.159.0-1.module_el8.5.0+733+9bb5dffa.noarch.rpm
warning: container-selinux-2.159.0-1.module_el8.5.0+733+9bb5dffa.noarch.rpm: Header V3 RSA/SHA256 Signature, key ID 8483c65d: NOKEY
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:container-selinux-2:2.159.0-1.mod################################# [ 50%]
Cleaning up / removing...
   2:container-selinux-2:2.124.0-1.mod################################# [100%]
# semodule -n -i /usr/share/selinux/packages/rke2.pp

Rebooted the node and then all system containers came up.

Cc: @brandond

@brandond
Copy link
Member

brandond commented Oct 8, 2021

Yes, upstream pulled in some... ill advised... updates to the container-selinux policy that we have had to work around:
containers/container-selinux#149 (comment)

cc @dweomer

@dweomer
Copy link
Contributor

dweomer commented Oct 8, 2021

Hmm, should be compatible with 2.159.x

@ansilh
Copy link
Contributor

ansilh commented Oct 11, 2021

Do we need to mention this in the support matrix or rke2 docs?
Looks like we need RHEL 8.5 to get SELinux working in this case.

@mayank-reynencourt
Copy link

Hi all,

i'm also facing same issue while deploying rke2 using ansible on RHEL 8.2,

does anyone get some solution for that?

@belgaied2
Copy link

I am having the same issue on RedHat 8.4 on AWS, it looks like the error is:

sudo systemctl status rke2-server
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
   Loaded: loaded (/usr/lib/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2022-03-03 15:36:37 UTC; 728ms ago
     Docs: https://github.com/rancher/rke2#readme
  Process: 17941 ExecStopPost=/bin/sh -c systemd-cgls /system.slice/rke2-server.service | grep -Eo '[0-9]+ (containerd|kubelet)' | awk '{print $1}' | xargs -r kill (code=exited, status=0/SUCCESS)
  Process: 17938 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=1/FAILURE)

It looks like what is failing is the command before the execution of rke2:

[ec2-user@ip-172-31-44-245 ~]$ /bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
[ec2-user@ip-172-31-44-245 ~]$ echo $?
1

causing the service's executable to not run ?

On the same node, just running rke2 server results in a successful bootstrapping.

Steps to reproduce:

  • AWS EC2 instance using the AMI ami-06ec8443c2a35b0ba on eu-central-1 , this instance is based on RHEL 8.4 with kernel 4.18.0-305.el8.x86_64 .
  • All settings default: selinux is enforcing, firewalld is not enabled out-of-the-box
  • Then try to install RKE2:
[ec2-user@ip-172-31-44-245 ~]$ curl -sfL https://get.rke2.io | sudo sh -
[ec2-user@ip-172-31-44-245 ~]$ sudo systemctl enable rke2-server
[ec2-user@ip-172-31-44-245 ~]$ sudo systemctl start rke2-server

Using this user-data should result in same behavior:

#cloud-config
runcmd:
  - curl -sfL https://get.rke2.io | sudo sh -
  - sudo systemctl enable rke2-server
  - sudo systemctl start rke2-server

@ansilh
Copy link
Contributor

ansilh commented Mar 3, 2022

@belgaied2 Not the same issue we discussed in the original thread.

Looks like you need to make sure the known issues are addressed as per the rke2 doc.
https://docs.rke2.io/known_issues/#networkmanager

In some operating systems like RHEL 8.4, NetworkManager includes two extra services 
called nm-cloud-setup.service and nm-cloud-setup.timer. 

These services add a routing table that interfere with the CNI plugin's configuration. 
Unfortunately, there is no config that can avoid that as explained in the [issue](https://github.com/rancher/rke2/issues/1053). 
Therefore, if those services exist, they should be disabled and the node must be rebooted.

@belgaied2
Copy link

Thanks for the clarification!

@stale
Copy link

stale bot commented Sep 4, 2022

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants