Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: NGC+ Image Template #235

Merged
merged 31 commits into from
Mar 7, 2024
Merged

feat: NGC+ Image Template #235

merged 31 commits into from
Mar 7, 2024

Conversation

MikhailKardash
Copy link
Contributor

@MikhailKardash MikhailKardash commented Nov 18, 2023

Description

  1. Add experimental ngc+ images
  2. Add version-matrix.yaml
  3. Remove tf28 and CUDA111 images
  4. Update install_google_cloud.sdk

This PR also moves some Dockerfile steps into scripts and moves some framework-based pip installs into requirements files. These updates are backported to our existing images.

Checklist

  • Bump VERSION to make the pushed images are tagged with the right version.
  • Licenses should be included for new code which was copied and/or modified from any external code.
  • Test the images by running the test bumpenvs procedure in the determined repo. See README.

@cla-bot cla-bot bot added the cla-signed label Nov 18, 2023
@MikhailKardash MikhailKardash changed the title Ngc images feat: NGC+ Image Template Nov 18, 2023
@MikhailKardash MikhailKardash marked this pull request as draft November 20, 2023 17:46
Makefile Outdated Show resolved Hide resolved
Dockerfile-ngc Outdated Show resolved Hide resolved
Dockerfile-ngc Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@MikhailKardash MikhailKardash marked this pull request as ready for review November 30, 2023 22:43
Dockerfile-ngc-pytorch Outdated Show resolved Hide resolved
@@ -19,6 +19,6 @@ cd /tmp && \
./packages/build-deb-packages.sh -t -k -d && \
cd / && \
dpkg -i \
/tmp/gdrcopy-$GDR_VER/libgdrapi_$GDR_VER-1_amd64.Ubuntu20_04.deb && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary anymore? The base NGC images I used for the GH200 nodes already had libgdr* in /usr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this for our other image offerings. (i.e. Pytorch 2.0.1 and Pytorch 1.12) But this can be removed once we make these NGC images the default.

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Copy link
Contributor

@rb-determined-ai rb-determined-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sick. These ngc images are exemplary of how our docker files ought to look. So much easier for users and developers alike.

Nice work.

@MikhailKardash MikhailKardash merged commit 03ae7d7 into main Mar 7, 2024
1 of 2 checks passed
@keita-determined keita-determined deleted the ngc_images branch March 8, 2024 23:05
soohoonchoi added a commit that referenced this pull request Mar 29, 2024
* fix: Tensorboard Profilers (#240)

* Add tensorboard profilers back into images

* we don't need 3.9 yet

* wrong tag

* build hpc/ngc together and update makefile

* version matrix and update comment

* profiler arg relocation

* address some duplicates

* formatting and libnss

* yaml formatting

* use actual yaml linter

* relocate again

* backport additional-requirements-torch and bump VERSION

* additional-requirements for tf

* bash syntax

* cleanup dockerfiles, remove duplicate publishing steps, correct a dockerfile

* try different syntax

* semicolons

* version pin and revert

* pip

* try python 3.10

* maybe it's a concurrency thing

* no more version pin

* ngc dockerfile cleanup

* bump version file, minor formatting, publish artifacts

* debian frontend google

* google_cloud_cli...

* cloud cli?

* minor cleanup

* version-matrix update and lots of formatting

* unparametrize deepspeed

* oops

* update nvidia drivers to 535.161.07 (#246)

minor version upgrade

* version bump

* feat: NGC+ Image Template (#235)

* add templates for NGC+ images

* add image matrix

* backport lots of improvements to scripts

* remove tf2.8 images

* Removed a duplicate line

* Added WITH_NCCL option to the Dockerfile-ngc-tf

---------

Co-authored-by: Michael Kardash <[email protected]>
Co-authored-by: Hamid Zare <[email protected]>
soohoonchoi added a commit that referenced this pull request Apr 30, 2024
* fix: Tensorboard Profilers (#240)

* Add tensorboard profilers back into images

* update nvidia drivers to 535.161.07 (#246)

minor version upgrade

* feat: NGC+ Image Template (#235)

* add templates for NGC+ images

* add image matrix

* backport lots of improvements to scripts

* remove tf2.8 images

* fix: dependabot alert for `jupyterlab-3.6.7`. (#241)

* Add support to build the tf2-gpu image for Libfabric(OFI) (#251)

* Add support to build the tf2-gpu image for Libfabric(OFI), which incorporates the AWS libfabric plug-in for NCCL to use on Slingshot 11(SS11) networks.

* Increment the VERSION to 0.31.1

* feat: update ngc version (#253)

* feat: update ngc version

* feat: Update naming (#252)

* renaming a bunch of stuff, removing py3.8 and old cuda, changes to CI job names

---------

Co-authored-by: Michael Kardash <[email protected]>
Co-authored-by: Hamid Zare <[email protected]>
Co-authored-by: Ilia Glazkov <[email protected]>
Co-authored-by: Jerry G <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants