-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(3.11.x) Job submission failure with Amazon Linux 2023 #6571
Comments
Hi team, my team is blocked on this issue as well, do you have an estimate for when the patch will be released? |
Following up here as well. Does this issue manifest for the official AMIs for ParallelCluster for Ubuntu 2020 or 2204? Thanks! |
Hi @adebayoj , at the best of our knowledge the issue affects only Amazon Linux 2023. |
@Bingjiling we are actively working on fixing this issue. |
@gmarciani Thanks for the update! I will try to use Amazon Linux 2 AMI instead. |
AWS ParallelCluster 3.12.0 has been release |
The issue
We have discovered an issue in the official ParallelCluster AMI for Amazon Linux 2023 that consistently leads to job submission failure on p4 compute nodes.
If your cluster is affected by this issue, you will experience job submission failures caused by compute nodes failing to bootstrap. The bootstrap error is:
After consistent bootstrap errors, the cluster is eventually set to protected mode, where the partitions are deactivated. See here how to recover from protected mode.
We are investigating the root cause preventing the nvidia-fabricmanager service to start. The issue is impacting NVIDIA drivers 550.90.07 on Amazon Linux 2023. This version of NVIDIA drivers is included in 3.11.0 and 3.11.1 ParallelCluster AMIs.
Affected versions (OSes, schedulers)
Mitigation
You can find a detailed explanation and the mitigation of the problem here: (3.11.x) Job submission failure with Amazon Linux 2023
The text was updated successfully, but these errors were encountered: