-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cleanup efa installer archive before install #6870
cleanup efa installer archive before install #6870
Conversation
Sorry - why was this closed? |
Hi could you please reopen this? We just spent a ton of money (and many hours) bringing up clusters with broken efa because a subset of nodes didn't have efa. Why? Because of this issue: Error downloading packages:
gcc-7.3.1-17.amzn2.x86_64: Insufficient space in download directory /var/cache/yum/x86_64/2/amzn2-core/packages
* free 11 M
* needed 22 M
Error: Failed to install packages.
Error: failed to install EFA packages, exiting
/var/lib/cloud/instances/i-08e860474f7d2683a/boothooks/part-001: line 7: pop: command not found
/usr/bin/cloud-init-per: line 63: /opt/amazon/efa/bin/fi_info: No such file or directory The archive needs to be cleaned up. This can't keep happening. I opened this over a year ago and I don't understand why it's been ignored and closed. What do you need from me? |
@vsoch, please give us some time, we'll prioritize reviewing this PR. |
Thank you! Much appreciated. I looked at the code and I think the change needs to be added to two additional files - I'll do that shortly. |
747bf6d
to
d4eb786
Compare
Updated to include the same 2023 files. I also tested this today (this evening) and it fixed the issue I posted above - my cluster came up with all efa nodes. I'll need to try the experiments for the two clusters that failed tonight, but with less funds now, tomorrow. Thanks for the help and TBA speedy review! |
d4eb786
to
a47a32a
Compare
Hi @cPu1 your bot closed the PR again! |
Sorry about that, I am not in favor of having this stale bot. We'll discuss this more. As for the PR, we will try to get this reviewed and released by next week. |
Thank you! And no worries about stale bot - it can be very helpful. I'm subscribed to the thread and am good to ping when it needs to be reopened. |
@vsoch, can you please rebase? This is good to merge. |
Currently, the UserData section that runs during cloud init happens before any root volumes are expanded with growpart. Although the best solution would be to ensure the filesystem resize happens before these scripts are run, a quick means to fix the current issue is simply to cleanup the efa installer tar.gz, which is very large. I have tested this with hpc7g for a size 2 and size 8 cluster (previously both not working) and can confirm the devices are functioning after. Signed-off-by: vsoch <[email protected]>
Problem: the node consistently runs out of disk space when adding efa, resulting in an unusable cluster with scattered nodes where the installer failed. Solution: the installer archive itself is huge, and we can simply remove it and avoid this error. Signed-off-by: vsoch <[email protected]>
a47a32a
to
2079032
Compare
@cPu1 all set! |
Thanks, Vanessasaurus! |
Thank you @cPu1 , rawr! 🦖 |
Currently, the UserData section that runs during cloud init happens before any root volumes are expanded with growpart. Although the best solution would be to ensure the filesystem resize happens before these scripts are run, a quick means to fix the current issue is simply to cleanup the efa installer tar.gz, which is very large. I have tested this with hpc7g for a size 2 and size 8 cluster (previously both not working) and can confirm the devices are functioning after.
And logs for a running node (what they should look like!)
This will close #6869
Checklist
README.md
, or theuserdocs
directory)area/nodegroup
) and kind (e.g.kind/improvement
)BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯