客户遇到一个很奇怪的问题,明明在sysctl.conf里面配置了net.netfilter.nf_conntrack_max参数,但是重启以后,这个参数还是没有生效,这个问题是什么原因呢?
答案在红帽的知识库里面
我们知道,sysctl的配置说是在系统启动的时候,由systemd-sysctl.service加载的。而客户环境里面,又有docker,docker因为会调用iptables的nat功能,从而隐形的加载内核模块nf_conntrack。我们猜测,应该是docker服务和systemd-sysctl服务,在系统启动的时候,相互配合有问题,才造成了我们内核模块的参数加载失败的问题。
接下来,我们做个实验来看看。
我们装一台centos7,关闭firewalld,再安装社区版本的docker-ce。注意,我们在这里,并没有激活docker服务的自动启动。
# disable firewalld
systemctl disable --now firewalld
# install docker ce
yum install -y yum-utils
yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
yum install -y docker-ce docker-ce-cli containerd.io
# reboot is important
reboot
系统重启以后,我们看看是否加载了内核模块nf_conntrack,并且看看有没有对应的参数。
# check nf_conntrack module status
lsmod | grep nf_conntrack
# nothing
sysctl net.netfilter.nf_conntrack_max
# sysctl: cannot stat /proc/sys/net/netfilter/nf_conntrack_max: No such file or directory
可以看到,没有加载nf_conntrack模块,而且没有对应的参数。那么我们手动启动docker服务,然后再看看。
# enable docker service and check nf_conntrack again
systemctl start docker
lsmod | grep nf_conntrack
# nf_conntrack_netlink 36396 0
# nf_conntrack_ipv4 15053 2
# nf_defrag_ipv4 12729 1 nf_conntrack_ipv4
# nf_conntrack 139264 6 nf_nat,nf_nat_ipv4,xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_netlink,nf_conntrack_ipv4
# libcrc32c 12644 2 nf_nat,nf_conntrack
sysctl net.netfilter.nf_conntrack_max
# net.netfilter.nf_conntrack_max = 65536
# check we didn't set the parameter for net.netfilter.nf_conntrack_max
find /etc -type f -exec grep -H nf_conntrack_max {} \;
# nothing
可以看到,加载了docker服务以后,内核模块nf_conntrack已经加载了,而且参数net.netfilter.nf_conntrack_max已经设置了。我们查找了一下/etc目录下面,发现没有配置net.netfilter.nf_conntrack_max的参数。
那么我们可以得出结论,不做任何配置,nf_conntrack内核模块,只有在docker服务启动的时候,才会加载到内核。如果systemd-sysctl这个服务,在系统启动的时候,早于docker服务,那么他就无法设置参数net.netfilter.nf_conntrack_max,就会造成我们所看到的故障。
接下来,我们确认一下我们的猜测,我们罗列一下,有哪些服务是在docker服务启动之前,必须要启动的。或者说,docker服务依赖哪些别的服务。
# check systemd init sequence
systemctl enable --now docker
# we can see, docker start after systemd-sysctl
systemctl list-dependencies docker
# docker.service
# ● ├─containerd.service
# ● ├─docker.socket
# ● ├─system.slice
# ● ├─basic.target
# ● │ ├─microcode.service
# ● │ ├─rhel-dmesg.service
# ● │ ├─[email protected]
# ● │ ├─paths.target
# ● │ ├─slices.target
# ● │ │ ├─-.slice
# ● │ │ └─system.slice
# ● │ ├─sockets.target
# ● │ │ ├─dbus.socket
# ● │ │ ├─systemd-initctl.socket
# ● │ │ ├─systemd-journald.socket
# ● │ │ ├─systemd-shutdownd.socket
# ● │ │ ├─systemd-udevd-control.socket
# ● │ │ └─systemd-udevd-kernel.socket
# ● │ ├─sysinit.target
# ● │ │ ├─dev-hugepages.mount
# ● │ │ ├─dev-mqueue.mount
# ● │ │ ├─kmod-static-nodes.service
# ● │ │ ├─plymouth-read-write.service
# ● │ │ ├─plymouth-start.service
# ● │ │ ├─proc-sys-fs-binfmt_misc.automount
# ● │ │ ├─rhel-autorelabel-mark.service
# ● │ │ ├─rhel-autorelabel.service
# ● │ │ ├─rhel-domainname.service
# ● │ │ ├─rhel-import-state.service
# ● │ │ ├─rhel-loadmodules.service
# ● │ │ ├─sys-fs-fuse-connections.mount
# ● │ │ ├─sys-kernel-config.mount
# ● │ │ ├─sys-kernel-debug.mount
# ● │ │ ├─systemd-ask-password-console.path
# ● │ │ ├─systemd-binfmt.service
# ● │ │ ├─systemd-firstboot.service
# ● │ │ ├─systemd-hwdb-update.service
# ● │ │ ├─systemd-journal-catalog-update.service
# ● │ │ ├─systemd-journal-flush.service
# ● │ │ ├─systemd-journald.service
# ● │ │ ├─systemd-machine-id-commit.service
# ● │ │ ├─systemd-modules-load.service
# ● │ │ ├─systemd-random-seed.service
# ● │ │ ├─systemd-sysctl.service
# ● │ │ ├─systemd-tmpfiles-setup-dev.service
# ● │ │ ├─systemd-tmpfiles-setup.service
# ● │ │ ├─systemd-udev-trigger.service
# ● │ │ ├─systemd-udevd.service
# ● │ │ ├─systemd-update-done.service
# ● │ │ ├─systemd-update-utmp.service
# ● │ │ ├─systemd-vconsole-setup.service
# ● │ │ ├─cryptsetup.target
# ● │ │ ├─local-fs.target
# ● │ │ │ ├─-.mount
# ● │ │ │ ├─rhel-readonly.service
# ● │ │ │ ├─systemd-fsck-root.service
# ● │ │ │ └─systemd-remount-fs.service
# ● │ │ └─swap.target
# ● │ └─timers.target
# ● │ └─systemd-tmpfiles-clean.timer
# ● └─network-online.target
我们可以很清晰的看到,systemd-sysctl.service服务是在docker服务启动之前,必须要启动的。
红帽知识库给出了解决办法,就是使用系统自带的rhel-loadmodules.service服务,这个服务,在systemd-sysctl.service启动之前启动,我们来看看他的内容。
systemctl cat rhel-loadmodules.service
# # /usr/lib/systemd/system/rhel-loadmodules.service
# [Unit]
# Description=Load legacy module configuration
# DefaultDependencies=no
# Conflicts=shutdown.target
# After=systemd-readahead-collect.service systemd-readahead-replay.service
# Before=sysinit.target shutdown.target
# ConditionPathExists=|/etc/rc.modules
# ConditionDirectoryNotEmpty=|/etc/sysconfig/modules/
# [Service]
# ExecStart=/usr/lib/systemd/rhel-loadmodules
# Type=oneshot
# TimeoutSec=0
# RemainAfterExit=yes
# [Install]
# WantedBy=sysinit.target
我们可以看到,rhel-loadmodules.service服务,检查/etc/sysconfig/modules/目录是否存在,如果存在,就运行程序/usr/lib/systemd/rhel-loadmodules。那我们看看/usr/lib/systemd/rhel-loadmodules的内容。
cat /usr/lib/systemd/rhel-loadmodules
# #!/bin/bash
# # Load other user-defined modules
# for file in /etc/sysconfig/modules/*.modules ; do
# [ -x $file ] && $file
# done
# # Load modules (for backward compatibility with VARs)
# if [ -f /etc/rc.modules ]; then
# /etc/rc.modules
# fi
/usr/lib/systemd/rhel-loadmodules的内容很简单,就是遍历/etc/sysconfig/modules/目录下的所有*.modules文件,如果文件存在,就运行它。
那么我们就有了解决办法,创建对应的module文件,让这些内核模块,在系统启动的早期就加载了,这样之后的systemd-sysctl.service服务就可以正常的去设置参数了。
echo "modprobe nf_conntrack" >> /etc/sysconfig/modules/nf_conntrack.modules && chmod 775 /etc/sysconfig/modules/nf_conntrack.modules
echo "net.netfilter.nf_conntrack_max = 2097152" >> /etc/sysctl.d/99-nf_conntrack.conf
reboot
重启以后,我们坚持一下系统状态,如我们所预期的,一切正常。
systemctl status rhel-loadmodules.service
# ● rhel-loadmodules.service - Load legacy module configuration
# Loaded: loaded (/usr/lib/systemd/system/rhel-loadmodules.service; enabled; vendor preset: enabled)
# Active: active (exited) since Sun 2022-01-02 08:08:22 UTC; 40s ago
# Process: 350 ExecStart=/usr/lib/systemd/rhel-loadmodules (code=exited, status=0/SUCCESS)
# Main PID: 350 (code=exited, status=0/SUCCESS)
# Tasks: 0
# Memory: 0B
# CGroup: /system.slice/rhel-loadmodules.service
# Jan 02 08:08:22 vultr.guest systemd[1]: Started Load legacy module configuration.
sysctl net.netfilter.nf_conntrack_max
# net.netfilter.nf_conntrack_max = 2097152
面对sysctl参数无法加载的情况,我们第一时间想到的,可能就是rc-local.service服务,这个服务读取/etc/rc.local文件,并执行这个文件中的内容。不过,我们发现在这里面的sysctl -w 命令,依然不起作用,我们猜测,这是由于rc-local.service没有在docker.service之前执行导致的。
那么我们就来确认一下。
systemctl cat rc-local
# # /usr/lib/systemd/system/rc-local.service
# # This file is part of systemd.
# #
# # systemd is free software; you can redistribute it and/or modify it
# # under the terms of the GNU Lesser General Public License as published by
# # the Free Software Foundation; either version 2.1 of the License, or
# # (at your option) any later version.
# # This unit gets pulled automatically into multi-user.target by
# # systemd-rc-local-generator if /etc/rc.d/rc.local is executable.
# [Unit]
# Description=/etc/rc.d/rc.local Compatibility
# ConditionFileIsExecutable=/etc/rc.d/rc.local
# After=network.target
# [Service]
# Type=forking
# ExecStart=/etc/rc.d/rc.local start
# TimeoutSec=0
# RemainAfterExit=yes
systemctl list-dependencies rc-local
# rc-local.service
# ● ├─system.slice
# ● └─basic.target
# ● ├─microcode.service
# ● ├─rhel-dmesg.service
# ● ├─[email protected]
# ● ├─paths.target
# ● ├─slices.target
# ● │ ├─-.slice
# ● │ └─system.slice
# ● ├─sockets.target
# ● │ ├─dbus.socket
# ● │ ├─systemd-initctl.socket
# ● │ ├─systemd-journald.socket
# ● │ ├─systemd-shutdownd.socket
# ● │ ├─systemd-udevd-control.socket
# ● │ └─systemd-udevd-kernel.socket
# ● ├─sysinit.target
# ● │ ├─dev-hugepages.mount
# ● │ ├─dev-mqueue.mount
# ● │ ├─kmod-static-nodes.service
# ● │ ├─plymouth-read-write.service
# ● │ ├─plymouth-start.service
# ● │ ├─proc-sys-fs-binfmt_misc.automount
# ● │ ├─rhel-autorelabel-mark.service
# ● │ ├─rhel-autorelabel.service
# ● │ ├─rhel-domainname.service
# ● │ ├─rhel-import-state.service
# ● │ ├─rhel-loadmodules.service
# ● │ ├─sys-fs-fuse-connections.mount
# ● │ ├─sys-kernel-config.mount
# ● │ ├─sys-kernel-debug.mount
# ● │ ├─systemd-ask-password-console.path
# ● │ ├─systemd-binfmt.service
# ● │ ├─systemd-firstboot.service
# ● │ ├─systemd-hwdb-update.service
# ● │ ├─systemd-journal-catalog-update.service
# ● │ ├─systemd-journal-flush.service
# ● │ ├─systemd-journald.service
# ● │ ├─systemd-machine-id-commit.service
# ● │ ├─systemd-modules-load.service
# ● │ ├─systemd-random-seed.service
# ● │ ├─systemd-sysctl.service
# ● │ ├─systemd-tmpfiles-setup-dev.service
# ● │ ├─systemd-tmpfiles-setup.service
# ● │ ├─systemd-udev-trigger.service
# ● │ ├─systemd-udevd.service
# ● │ ├─systemd-update-done.service
# ● │ ├─systemd-update-utmp.service
# ● │ ├─systemd-vconsole-setup.service
# ● │ ├─cryptsetup.target
# ● │ ├─local-fs.target
# ● │ │ ├─-.mount
# ● │ │ ├─rhel-readonly.service
# ● │ │ ├─systemd-fsck-root.service
# ● │ │ └─systemd-remount-fs.service
# ● │ └─swap.target
# ● └─timers.target
# ● └─systemd-tmpfiles-clean.timer
我们可以看到,rc-local服务,和docker服务,没有前后关系,这就导致了rc-local无法确保在docker.serivce之后运行,这样就会导致相关内核模块没有加载,进而导致sysctl命令无法加载。
systemctl list-unit-files | grep docker
# docker.service enabled
# docker.socket disabled
# https://stackoverflow.com/questions/29309717/is-there-any-way-to-list-systemd-services-in-linux-in-the-order-of-they-were-l#fromHistory
# systemd-analyze plot > startup_order.svg
yum install -y graphviz
systemd-analyze dot | dot -Tsvg > systemd.svg
# Color legend: black = Requires
# dark blue = Requisite
# dark grey = Wants
# red = Conflicts
# green = After
# CONNTRACK_MAX = 连接跟踪表大小 (HASHSIZE) * Bucket 大小 (bucket size)
modinfo nf_conntrack
# filename: /lib/modules/3.10.0-1160.49.1.el7.x86_64/kernel/net/netfilter/nf_conntrack.ko.xz
# license: GPL
# retpoline: Y
# rhelversion: 7.9
# srcversion: 358A2186187A7E81339334C
# depends: libcrc32c
# intree: Y
# vermagic: 3.10.0-1160.49.1.el7.x86_64 SMP mod_unload modversions
# signer: CentOS Linux kernel signing key
# sig_key: 77:15:99:7F:C4:81:91:84:C7:45:27:B6:08:4B:C7:F9:BB:15:62:7D
# sig_hashalgo: sha256
# parm: tstamp:Enable connection tracking flow timestamping. (bool)
# parm: acct:Enable connection tracking flow accounting. (bool)
# parm: nf_conntrack_helper:Enable automatic conntrack helper assignment (default 1) (bool)
# parm: expect_hashsize:uint
cat /proc/sys/net/nf_conntrack_max
# 65536
cat /proc/sys/net/netfilter/nf_conntrack_max
# 65536
cat /proc/sys/net/netfilter/nf_conntrack_buckets
# 16384
# so the bucket size = 4
echo "`cat /proc/sys/net/netfilter/nf_conntrack_max` / `cat /proc/sys/net/netfilter/nf_conntrack_buckets`" | bc
# 4