-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: NGC+ Image Template #235
Conversation
bc415d0
to
a9df79b
Compare
@@ -19,6 +19,6 @@ cd /tmp && \ | |||
./packages/build-deb-packages.sh -t -k -d && \ | |||
cd / && \ | |||
dpkg -i \ | |||
/tmp/gdrcopy-$GDR_VER/libgdrapi_$GDR_VER-1_amd64.Ubuntu20_04.deb && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary anymore? The base NGC images I used for the GH200 nodes already had libgdr* in /usr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this for our other image offerings. (i.e. Pytorch 2.0.1 and Pytorch 1.12) But this can be removed once we make these NGC images the default.
f0e961a
to
f3a4b2d
Compare
ea7b10d
to
34d6e1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sick. These ngc images are exemplary of how our docker files ought to look. So much easier for users and developers alike.
Nice work.
* fix: Tensorboard Profilers (#240) * Add tensorboard profilers back into images * we don't need 3.9 yet * wrong tag * build hpc/ngc together and update makefile * version matrix and update comment * profiler arg relocation * address some duplicates * formatting and libnss * yaml formatting * use actual yaml linter * relocate again * backport additional-requirements-torch and bump VERSION * additional-requirements for tf * bash syntax * cleanup dockerfiles, remove duplicate publishing steps, correct a dockerfile * try different syntax * semicolons * version pin and revert * pip * try python 3.10 * maybe it's a concurrency thing * no more version pin * ngc dockerfile cleanup * bump version file, minor formatting, publish artifacts * debian frontend google * google_cloud_cli... * cloud cli? * minor cleanup * version-matrix update and lots of formatting * unparametrize deepspeed * oops * update nvidia drivers to 535.161.07 (#246) minor version upgrade * version bump * feat: NGC+ Image Template (#235) * add templates for NGC+ images * add image matrix * backport lots of improvements to scripts * remove tf2.8 images * Removed a duplicate line * Added WITH_NCCL option to the Dockerfile-ngc-tf --------- Co-authored-by: Michael Kardash <[email protected]> Co-authored-by: Hamid Zare <[email protected]>
* fix: Tensorboard Profilers (#240) * Add tensorboard profilers back into images * update nvidia drivers to 535.161.07 (#246) minor version upgrade * feat: NGC+ Image Template (#235) * add templates for NGC+ images * add image matrix * backport lots of improvements to scripts * remove tf2.8 images * fix: dependabot alert for `jupyterlab-3.6.7`. (#241) * Add support to build the tf2-gpu image for Libfabric(OFI) (#251) * Add support to build the tf2-gpu image for Libfabric(OFI), which incorporates the AWS libfabric plug-in for NCCL to use on Slingshot 11(SS11) networks. * Increment the VERSION to 0.31.1 * feat: update ngc version (#253) * feat: update ngc version * feat: Update naming (#252) * renaming a bunch of stuff, removing py3.8 and old cuda, changes to CI job names --------- Co-authored-by: Michael Kardash <[email protected]> Co-authored-by: Hamid Zare <[email protected]> Co-authored-by: Ilia Glazkov <[email protected]> Co-authored-by: Jerry G <[email protected]>
Description
version-matrix.yaml
tf28
andCUDA111
imagesinstall_google_cloud.sdk
This PR also moves some Dockerfile steps into scripts and moves some framework-based pip installs into requirements files. These updates are backported to our existing images.
Checklist
bumpenvs
procedure in the determined repo. See README.