Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences in installation results depending on the version of the aws-eks-gpu-node AMI image #1748

Closed
loki-shin opened this issue Apr 4, 2024 · 1 comment

Comments

@loki-shin
Copy link

I have an eks cluster with Kubernetes version 1.25, and I am testing to upgrade to version 1.28.
When running a GPU node using the aws-eks-gpu-node-1.28 AMI Image, nvidia-driver is not installed properly.

If you use the aws-eks-gpu-node-1.25 AMI Image, the scripts in /etc/eks will be executed normally and the nvidia-driver will be installed.

When you install a node through aws-eks-gpu-node-1.28 AMI Image, it looks like this.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
$ journalctl -u configure-nvidia.service
~~
 4월 04 11:12:59 localhost systemd[1]: Starting Configure NVIDIA instance types...
 4월 04 11:12:59 localhost configure-nvidia.sh[2177]: + gpu-ami-util has-nvidia-devices
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: + /etc/eks/nvidia-kmod-load.sh
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: true
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
 4월 04 11:13:00 localhost configure-nvidia.sh[2177]: curl: (7) Failed to connect to 169.254.169.254 port 80 after 0
 4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service: main process exited, code=exited, status=1/FAILURE
 4월 04 11:13:00 localhost systemd[1]: Failed to start Configure NVIDIA instance types.
 4월 04 11:13:00 localhost systemd[1]: Unit configure-nvidia.service entered failed state.
 4월 04 11:13:00 localhost systemd[1]: configure-nvidia.service failed.

The service contents of the aws-eks-gpu-node-1.28 AMI Image are as follows.

$ cat /etc/systemd/system/configure-nvidia.service
[Unit]
Description=Configure NVIDIA instance types
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh

[Install]
WantedBy=multi-user.target docker.service containerd.service


The service contents of the aws-eks-gpu-node-1.25 AMI Image are as follows.

[Unit]
Description=Configure NVIDIA instance types
# the script needs to use IMDS, so wait for the network to be up
# to avoid any flakiness due to races
After=network-online.target
Wants=network-online.target
Before=docker.service containerd.service nvidia-fabricmanager.service nvidia-persistenced.service

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/etc/eks/configure-nvidia.sh

[Install]
WantedBy=multi-user.target docker.service containerd.service

The difference between the two seems to be that the unit does not check whether network-online.target is present.
It seems that the normal nvidia-driver installation fails as the query to 169.254.169.254 port 80 fails before the network comes up normally.
I wonder if the deletion was intentional.

Environment:

  • AWS Region: ap-northeast-2
  • Instance Type(s): g4dn.xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.11
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): v1.25 and v1.28
  • AMI Version: aws-eks-gpu-node-1.25 / aws-eks-gpu-node-1.28
  • Kernel (e.g. uname -a):
  • Release information (run cat /etc/eks/release on a node):
    • BASE_AMI_ID="ami-09bffa74b1e396075"
      BUILD_TIME="Fri Feb 17 21:58:10 UTC 2023"
      BUILD_KERNEL="5.10.165-143.735.amzn2.x86_64"
      ARCH="x86_64"
@cartermckinnon
Copy link
Member

This is fixed in the latest release 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants