Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

entrypoint cleanup + non-systemd-host fix #2767

Merged
merged 6 commits into from
May 19, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion images/base/files/etc/systemd/system/kubelet.service
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,10 @@ Restart=always
StartLimitInterval=0
# NOTE: kind deviates from upstream here with a lower RestartSec
RestartSec=1s
# and here
# And by adding the [Service] lines below
CPUAccounting=true
MemoryAccounting=true
Slice=kubelet.slice

[Install]
WantedBy=multi-user.target
7 changes: 7 additions & 0 deletions images/base/files/etc/systemd/system/kubelet.slice
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
[Unit]
Description=slice used to run Kubernetes / Kubelet
Before=slices.target

[Slice]
MemoryAccounting=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these parameters ***Accounting needed twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes these parameters are for the slice. group vs specific unit.

https://www.freedesktop.org/software/systemd/man/systemd.slice.html

CPUAccounting=true
94 changes: 50 additions & 44 deletions images/base/files/usr/local/bin/entrypoint
Original file line number Diff line number Diff line change
Expand Up @@ -55,43 +55,43 @@ validate_userns() {
}

overlayfs_preferrable() {
if [[ -z "$userns" ]]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tabs => double space (like the rest of the file)

# If we are outside userns, we can always assume overlayfs is preferrable
return 0
fi

# Debian 10 and 11 supports overlayfs in userns with a "permit_mount_in_userns" kernel patch,
# but known to be unstable, so we avoid using it https://github.com/moby/moby/issues/42302
if [[ -e "/sys/module/overlay/parameters/permit_mounts_in_userns" ]]; then
echo "INFO: UserNS: kernel seems supporting overlayfs with permit_mounts_in_userns, but avoiding due to instability."
return 1
fi

# Check overlayfs availability, by attempting to mount it.
#
# Overlayfs inside userns is known to be available for the following environments:
# - Kernel >= 5.11 (but 5.11 and 5.12 have issues on SELinux hosts. Fixed in 5.13.)
# - Ubuntu kernel
# - Debian kernel (but avoided due to instability, see the /sys/module/overlay/... check above)
# - Sysbox
tmp=$(mktemp -d)
mkdir -p "${tmp}/l" "${tmp}/u" "${tmp}/w" "${tmp}/m"
if ! mount -t overlay -o lowerdir="${tmp}/l,upperdir=${tmp}/u,workdir=${tmp}/w" overlay "${tmp}/m"; then
echo "INFO: UserNS: kernel does not seem to support overlayfs."
rm -rf "${tmp}"
return 1
fi
umount "${tmp}/m"
rm -rf "${tmp}"

# Detect whether SELinux is Enforcing (or Permitted) by grepping /proc/self/attr/current .
# Note that we cannot use `getenforce` command here because /sys/fs/selinux is typically not mounted for containers.
if grep -q "_t:" "/proc/self/attr/current"; then
# When the kernel is before v5.13 and SELinux is enforced, fuse-overlayfs might be safer, so we print a warning (but not an error).
# https://github.com/torvalds/linux/commit/7fa2e79a6bb924fa4b2de5766dab31f0f47b5ab6
echo "WARN: UserNS: SELinux might be Enforcing. If you see an error related to overlayfs, try setting \`KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER=fuse-overlayfs\` ." >&2
fi
return 0
if [[ -z "$userns" ]]; then
# If we are outside userns, we can always assume overlayfs is preferrable
return 0
fi

# Debian 10 and 11 supports overlayfs in userns with a "permit_mount_in_userns" kernel patch,
# but known to be unstable, so we avoid using it https://github.com/moby/moby/issues/42302
if [[ -e "/sys/module/overlay/parameters/permit_mounts_in_userns" ]]; then
echo "INFO: UserNS: kernel seems supporting overlayfs with permit_mounts_in_userns, but avoiding due to instability."
return 1
fi

# Check overlayfs availability, by attempting to mount it.
#
# Overlayfs inside userns is known to be available for the following environments:
# - Kernel >= 5.11 (but 5.11 and 5.12 have issues on SELinux hosts. Fixed in 5.13.)
# - Ubuntu kernel
# - Debian kernel (but avoided due to instability, see the /sys/module/overlay/... check above)
# - Sysbox
tmp=$(mktemp -d)
mkdir -p "${tmp}/l" "${tmp}/u" "${tmp}/w" "${tmp}/m"
if ! mount -t overlay -o lowerdir="${tmp}/l,upperdir=${tmp}/u,workdir=${tmp}/w" overlay "${tmp}/m"; then
echo "INFO: UserNS: kernel does not seem to support overlayfs."
rm -rf "${tmp}"
return 1
fi
umount "${tmp}/m"
rm -rf "${tmp}"

# Detect whether SELinux is Enforcing (or Permitted) by grepping /proc/self/attr/current .
# Note that we cannot use `getenforce` command here because /sys/fs/selinux is typically not mounted for containers.
if grep -q "_t:" "/proc/self/attr/current"; then
# When the kernel is before v5.13 and SELinux is enforced, fuse-overlayfs might be safer, so we print a warning (but not an error).
# https://github.com/torvalds/linux/commit/7fa2e79a6bb924fa4b2de5766dab31f0f47b5ab6
echo "WARN: UserNS: SELinux might be Enforcing. If you see an error related to overlayfs, try setting \`KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER=fuse-overlayfs\` ." >&2
fi
return 0
}

configure_containerd() {
Expand Down Expand Up @@ -208,6 +208,8 @@ fix_cgroup() {
return
fi
echo 'INFO: detected cgroup v1'
# We're looking for the cgroup-path for the cpu controller for the
# current process. this tells us what cgroup-path the container is in.
local current_cgroup
current_cgroup=$(grep -E '^[^:]*:([^:]*,)?cpu(,[^,:]*)?:.*' /proc/self/cgroup | cut -d: -f3)
if [ "$current_cgroup" = "/" ]; then
Expand All @@ -225,16 +227,14 @@ fix_cgroup() {
# See: https://d2iq.com/blog/running-kind-inside-a-kubernetes-cluster-for-continuous-integration
# Capture initial state before modifying
#
# Basically we're looking for the cgroup-path for the cpu controller for the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

things got re-arranged, and this comment got out of place.

# current process. this tells us what cgroup-path the container is in.
# Then we collect the subsystems that are active on this path.
# Then we collect the subsystems that are active on our current process.
# We assume the cpu controller is in use on all node containers,
# and other controllers use the same sub-path.
#
# See: https://man7.org/linux/man-pages/man7/cgroups.7.html
echo 'INFO: fix cgroup mounts for all subsystems'
local cgroup_subsystems
cgroup_subsystems=$(findmnt -lun -o source,target -t cgroup | grep "${current_cgroup}" | awk '{print $2}')
cgroup_subsystems=$(findmnt -lun -o source,target -t cgroup | grep -F "${current_cgroup}" | awk '{print $2}')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-F / --fixed-strings means match fixed strings, not patterns / regexes.
we're intending to match literal values

# Unmount the cgroup subsystems that are not known to runtime used to
# run the container we are in. Those subsystems are not properly scoped
# (i.e. the root cgroup is exposed, rather than something like docker/xxxx).
Expand All @@ -245,7 +245,7 @@ fix_cgroup() {
#
# See https://github.com/kubernetes/kubernetes/issues/109182
local unsupported_cgroups
unsupported_cgroups=$(findmnt -lun -o source,target -t cgroup | grep_allow_nomatch -v "${current_cgroup}" | awk '{print $2}')
unsupported_cgroups=$(findmnt -lun -o source,target -t cgroup | grep_allow_nomatch -v -F "${current_cgroup}" | awk '{print $2}')
if [ -n "$unsupported_cgroups" ]; then
local mnt
echo "$unsupported_cgroups" |
Expand Down Expand Up @@ -298,9 +298,15 @@ fix_cgroup() {
mount --make-rprivate /sys/fs/cgroup
echo "${cgroup_subsystems}" |
while IFS= read -r subsystem; do
mount_kubelet_cgroup_root "/kubelet" "${subsystem}"
mount_kubelet_cgroup_root "/kubelet.slice" "${subsystem}"
mount_kubelet_cgroup_root /kubelet "${subsystem}"
mount_kubelet_cgroup_root /kubelet.slice "${subsystem}"
done
# workaround for hosts not running systemd
# we only do this for kubelet.slice because it's not relevant when not using
# the systemd cgroup driver
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functional change here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AkihiroSuda @kolyshkin can you please chime in here, I'm really in deep waters here and I think that a another pair of eyes will be nice for this change ... the cgroup glue keeps growing ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work, wishful thinking :-)

#2766 (comment)

I need to get a v1 host going again and take a peek at what exactly systemd is up to with the pseudo-subsystem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I setup colima on an old personal machine to debug, this and a real slice works 0ffcf8d.

if [[ ! "${cgroup_subsystems}" = */sys/fs/cgroup/systemd* ]]; then
mount_kubelet_cgroup_root /kubelet.slice /sys/fs/cgroup/systemd
fi
}

fix_machine_id() {
Expand Down