You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an eks cluster with Kubernetes version 1.25, and I am testing to upgrade to version 1.28.
When running a GPU node using the aws-eks-gpu-node-1.28 AMI Image, nvidia-driver is not installed properly.
If you use the aws-eks-gpu-node-1.25 AMI Image, the scripts in /etc/eks will be executed normally and the nvidia-driver will be installed.
When you install a node through aws-eks-gpu-node-1.28 AMI Image, it looks like this.
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
$ journalctl -u configure-nvidia.service
~~
4월 04 11:12:59 localhost systemd[1]: Starting Configure NVIDIA instance types...
4월 04 11:12:59 localhost configure-nvidia.sh[2177]: + gpu-ami-util has-nvidia-devices
4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
4월 04 11:13:00 localhost configure-nvidia.sh[2177]: + /etc/eks/nvidia-kmod-load.sh
4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service: main process exited, code=exited, status=1/FAILURE
4월 04 11:13:00 localhost systemd[1]: Failed to start Configure NVIDIA instance types.
4월 04 11:13:00 localhost systemd[1]: Unit configure-nvidia.service entered failed state.
4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service failed.
The service contents of the aws-eks-gpu-node-1.28 AMI Image are as follows.
The service contents of the aws-eks-gpu-node-1.25 AMI Image are as follows.
[Unit]
Description=Configure NVIDIA instance types
# the script needs to use IMDS, so wait for the network to be up
# to avoid any flakiness due to races
After=network-online.target
Wants=network-online.target
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service
[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh
[Install]
WantedBy=multi-user.target docker.service containerd.service
The difference between the two seems to be that the unit does not check whether network-online.target is present.
It seems that the normal nvidia-driver installation fails as the query to 169.254.169.254 port 80 fails before the network comes up normally.
I wonder if the deletion was intentional.
I have an eks cluster with Kubernetes version 1.25, and I am testing to upgrade to version 1.28.
When running a GPU node using the aws-eks-gpu-node-1.28 AMI Image, nvidia-driver is not installed properly.
If you use the aws-eks-gpu-node-1.25 AMI Image, the scripts in /etc/eks will be executed normally and the nvidia-driver will be installed.
When you install a node through aws-eks-gpu-node-1.28 AMI Image, it looks like this.
The service contents of the aws-eks-gpu-node-1.28 AMI Image are as follows.
The service contents of the aws-eks-gpu-node-1.25 AMI Image are as follows.
The difference between the two seems to be that the unit does not check whether network-online.target is present.
It seems that the normal nvidia-driver installation fails as the query to 169.254.169.254 port 80 fails before the network comes up normally.
I wonder if the deletion was intentional.
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.11aws eks describe-cluster --name <name> --query cluster.version
): v1.25 and v1.28uname -a
):cat /etc/eks/release
on a node):BUILD_TIME="Fri Feb 17 21:58:10 UTC 2023"
BUILD_KERNEL="5.10.165-143.735.amzn2.x86_64"
ARCH="x86_64"
The text was updated successfully, but these errors were encountered: