Differences in installation results depending on the version of the aws-eks-gpu-node AMI image #1748

loki-shin · 2024-04-04T11:47:21Z

I have an eks cluster with Kubernetes version 1.25, and I am testing to upgrade to version 1.28.
When running a GPU node using the aws-eks-gpu-node-1.28 AMI Image, nvidia-driver is not installed properly.

If you use the aws-eks-gpu-node-1.25 AMI Image, the scripts in /etc/eks will be executed normally and the nvidia-driver will be installed.

When you install a node through aws-eks-gpu-node-1.28 AMI Image, it looks like this.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

$ journalctl -u configure-nvidia.service
~~
 4월 04 11:12:59 localhost systemd[1]: Starting Configure NVIDIA instance types...
 4월 04 11:12:59 localhost configure-nvidia.sh[2177]: + gpu-ami-util has-nvidia-devices
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: + /etc/eks/nvidia-kmod-load.sh
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
 4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service: main process exited, code=exited, status=1/FAILURE
 4월 04 11:13:00 localhost systemd[1]: Failed to start Configure NVIDIA instance types.
 4월 04 11:13:00 localhost systemd[1]: Unit configure-nvidia.service entered failed state.
 4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service failed.

The service contents of the aws-eks-gpu-node-1.28 AMI Image are as follows.

$ cat /etc/systemd/system/configure-nvidia.service
[Unit]
Description=Configure NVIDIA instance types
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh

[Install]
WantedBy=multi-user.target docker.service containerd.service

The service contents of the aws-eks-gpu-node-1.25 AMI Image are as follows.

[Unit]
Description=Configure NVIDIA instance types
# the script needs to use IMDS, so wait for the network to be up
# to avoid any flakiness due to races
After=network-online.target
Wants=network-online.target
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh

[Install]
WantedBy=multi-user.target docker.service containerd.service

The difference between the two seems to be that the unit does not check whether network-online.target is present.
It seems that the normal nvidia-driver installation fails as the query to 169.254.169.254 port 80 fails before the network comes up normally.
I wonder if the deletion was intentional.

Environment:

AWS Region: ap-northeast-2
Instance Type(s): g4dn.xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.11
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): v1.25 and v1.28
AMI Version: aws-eks-gpu-node-1.25 / aws-eks-gpu-node-1.28
Kernel (e.g. uname -a):
- Linux ip-172-31-13-206.ap-northeast-2.compute.internal 5.10.210-201.855.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Tue Mar 12 19:03:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):
- BASE_AMI_ID="ami-09bffa74b1e396075"
  BUILD_TIME="Fri Feb 17 21:58:10 UTC 2023"
  BUILD_KERNEL="5.10.165-143.735.amzn2.x86_64"
  ARCH="x86_64"

The text was updated successfully, but these errors were encountered:

cartermckinnon · 2024-04-11T18:31:50Z

This is fixed in the latest release 👍

cartermckinnon closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences in installation results depending on the version of the aws-eks-gpu-node AMI image #1748

Differences in installation results depending on the version of the aws-eks-gpu-node AMI image #1748

loki-shin commented Apr 4, 2024

cartermckinnon commented Apr 11, 2024

Differences in installation results depending on the version of the aws-eks-gpu-node AMI image #1748

Differences in installation results depending on the version of the aws-eks-gpu-node AMI image #1748

Comments

loki-shin commented Apr 4, 2024

cartermckinnon commented Apr 11, 2024