Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: NGC+ Image Template #235

Merged
merged 31 commits into from
Mar 7, 2024
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
ab233a1
we don't need 3.9 yet
MikhailKardash Jul 27, 2023
b58146d
wrong tag
MikhailKardash Jan 31, 2024
f81da5c
build hpc/ngc together and update makefile
MikhailKardash Jan 31, 2024
c214eb4
version matrix and update comment
MikhailKardash Feb 1, 2024
899829b
profiler arg relocation
MikhailKardash Feb 2, 2024
f42fa32
address some duplicates
MikhailKardash Feb 2, 2024
33858eb
formatting and libnss
MikhailKardash Feb 2, 2024
5529fd0
yaml formatting
MikhailKardash Feb 2, 2024
e656007
use actual yaml linter
MikhailKardash Feb 2, 2024
3343e67
relocate again
MikhailKardash Feb 2, 2024
af5e249
backport additional-requirements-torch and bump VERSION
MikhailKardash Feb 2, 2024
ad22068
additional-requirements for tf
MikhailKardash Feb 2, 2024
0e796c5
bash syntax
MikhailKardash Feb 2, 2024
941251f
cleanup dockerfiles, remove duplicate publishing steps, correct a doc…
MikhailKardash Feb 3, 2024
fca8aec
try different syntax
MikhailKardash Feb 5, 2024
c9c4034
semicolons
MikhailKardash Feb 5, 2024
8cae0ac
version pin and revert
MikhailKardash Feb 5, 2024
30ebec9
pip
MikhailKardash Feb 5, 2024
72c4c4c
try python 3.10
MikhailKardash Feb 5, 2024
e20cae1
maybe it's a concurrency thing
MikhailKardash Feb 5, 2024
b8116e8
no more version pin
MikhailKardash Feb 5, 2024
38358f1
ngc dockerfile cleanup
MikhailKardash Feb 6, 2024
a4a8f55
bump version file, minor formatting, publish artifacts
MikhailKardash Feb 6, 2024
dff93be
debian frontend google
MikhailKardash Feb 6, 2024
44a6fa3
google_cloud_cli...
MikhailKardash Feb 6, 2024
ec21921
cloud cli?
MikhailKardash Feb 6, 2024
46b2e32
minor cleanup
MikhailKardash Feb 6, 2024
c262fce
version-matrix update and lots of formatting
MikhailKardash Feb 7, 2024
cce01b3
unparametrize deepspeed
MikhailKardash Feb 7, 2024
3210d9b
oops
MikhailKardash Feb 8, 2024
44745fc
version bump
MikhailKardash Mar 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 20 additions & 14 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -188,20 +188,21 @@ workflows:
with-mpi: [0, 1]
image-type:
- tf2-cpu
- tf28-cpu
- pt-cpu
- pt2-cpu
- tf2-gpu
- tf28-gpu
- pt-gpu
- pt2-gpu
- tensorflow-ngc
rb-determined-ai marked this conversation as resolved.
Show resolved Hide resolved
- pytorch13-tf210-rocm56
- pytorch20-tf210-rocm56
exclude:
- with-mpi: 1
image-type:
- pytorch13-tf210-rocm56
- pytorch20-tf210-rocm56
image-type: pytorch13-tf210-rocm56
- with-mpi: 1
image-type: pytorch20-tf210-rocm56
- with-mpi: 1
image-type: tensorflow-ngc
- build-and-publish-docker:
name: build-and-publish-docker-<<matrix.image-type>>-<<matrix.with-mpi>>
context: determined-production
Expand All @@ -219,8 +220,9 @@ workflows:
- gpu.nvidia.small.multi
with-mpi: [0]
image-type:
- deepspeed-gpu
- gpt-neox-deepspeed-gpu
- deepspeed
- gpt-neox-deepspeed
- pytorch-ngc
- publish-cloud-images:
context: determined-production
filters:
Expand Down Expand Up @@ -253,21 +255,24 @@ workflows:
with-mpi: [0, 1]
image-type:
- tf2-cpu
- tf28-cpu
- pt-cpu
- pt2-cpu
- tf2-gpu
- tf28-gpu
- pt-gpu
- pt2-gpu
- tensorflow-ngc
- pytorch13-tf210-rocm56
- pytorch20-tf210-rocm56
exclude:
- dev-mode: true
with-mpi: 1
image-type:
- pytorch13-tf210-rocm56
- pytorch20-tf210-rocm56
image-type: pytorch13-tf210-rocm56
- dev-mode: true
with-mpi: 1
image-type: pytorch20-tf210-rocm56
- dev-mode: true
with-mpi: 1
image-type: tensorflow-ngc

- build-and-publish-docker:
name: build-and-publish-docker-<<matrix.image-type>>-<<matrix.with-mpi>>-dev
Expand All @@ -287,8 +292,9 @@ workflows:
- gpu.nvidia.small.multi
with-mpi: [0]
image-type:
- deepspeed-gpu
- gpt-neox-deepspeed-gpu
- deepspeed
- gpt-neox-deepspeed
- pytorch-ngc

- publish-cloud-images:
name: publish-cloud-images-dev
Expand Down
37 changes: 2 additions & 35 deletions Dockerfile-base-cpu
Original file line number Diff line number Diff line change
Expand Up @@ -4,41 +4,10 @@ ARG UBUNTU_VERSION
RUN rm -f /etc/apt/sources.list.d/*
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8 PIP_NO_CACHE_DIR=1

RUN mkdir -p /var/run/sshd
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
autoconf \
automake \
autotools-dev \
build-essential \
ca-certificates \
curl \
daemontools \
libkrb5-dev \
libssl-dev \
libtool \
git \
krb5-user \
g++ \
cmake \
make \
openssh-client \
openssh-server \
pkg-config \
wget \
nfs-common \
libnuma1 \
libnuma-dev \
libpmi2-0-dev \
unattended-upgrades \
&& unattended-upgrade \
&& rm -rf /var/lib/apt/lists/* \
&& rm /etc/ssh/ssh_host_ecdsa_key \
&& rm /etc/ssh/ssh_host_ed25519_key \
&& rm /etc/ssh/ssh_host_rsa_key

COPY dockerfile_scripts /tmp/det_dockerfile_scripts

RUN /tmp/det_dockerfile_scripts/install_deb_packages.sh

ENV PATH="/opt/conda/bin:${PATH}"
ARG CONDA="${PATH}"
ENV PYTHONUNBUFFERED=1 PYTHONFAULTHANDLER=1 PYTHONHASHSEED=0
Expand Down Expand Up @@ -84,8 +53,6 @@ ENV PATH=${PATH:-$CONDA:${WITH_MPI:+$UCX_PATH_DIR:$OMPI_PATH_DIR}}
ENV OMPI_ALLOW_RUN_AS_ROOT ${WITH_MPI:+1}
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM ${WITH_MPI:+1}



# We uninstall these packages after installing. This ensures that we can
# successfully install these packages into containers running as non-root.
# `pip` does not uninstall dependencies, so we still have all the dependencies
Expand Down
38 changes: 2 additions & 36 deletions Dockerfile-base-gpu
Original file line number Diff line number Diff line change
Expand Up @@ -7,45 +7,11 @@ ENV LANG=C.UTF-8 LC_ALL=C.UTF-8 PIP_NO_CACHE_DIR=1
# We need to create sym links for the Slurm PMI headers if we are using
# Ubuntu 18.04 because they are not installed in a standard location.
ARG UBUNTU_VERSION
RUN mkdir -p /var/run/sshd
RUN apt-get update \
&& DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
autoconf \
automake \
autotools-dev \
build-essential \
ca-certificates \
curl \
daemontools \
ibverbs-providers \
libibverbs1 \
libkrb5-dev \
librdmacm1 \
libssl-dev \
libtool \
git \
krb5-user \
g++ \
cmake \
make \
openssh-client \
openssh-server \
pkg-config \
wget \
nfs-common \
libnuma1 \
libnuma-dev \
libpmi2-0-dev \
unattended-upgrades \
&& unattended-upgrade \
&& rm -rf /var/lib/apt/lists/* \
&& rm /etc/ssh/ssh_host_ecdsa_key \
&& rm /etc/ssh/ssh_host_ed25519_key \
&& rm /etc/ssh/ssh_host_rsa_key \
&& if [ "$UBUNTU_VERSION" = "ubuntu18.04" ]; then ln -s /usr/include/slurm-wlm /usr/include/slurm; fi

COPY dockerfile_scripts /tmp/det_dockerfile_scripts

RUN /tmp/det_dockerfile_scripts/install_deb_packages.sh

ARG WITH_NCCL
# Install debuild util, etc. for later compiling GDRcopy libraries
RUN if [ "$WITH_NCCL" = "1" ]; then apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y devscripts debhelper; fi
Expand Down
8 changes: 2 additions & 6 deletions Dockerfile-default-cpu
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ RUN if [ "$TENSORFLOW_PIP" ]; then \
else \
pip install $TENSORFLOW_PIP; \
fi; \
pip install -r /tmp/det_dockerfile_scripts/additional-requirements-tf.txt; \
MikhailKardash marked this conversation as resolved.
Show resolved Hide resolved
else \
export HOROVOD_WITH_TENSORFLOW=0; \
fi
Expand All @@ -27,15 +28,10 @@ RUN if [ "$TORCH_PIP" ]; then \
else \
pip install $TORCH_PIP; \
fi; \
pip install -r /tmp/det_dockerfile_scripts/additional-requirements-torch.txt; \
fi
RUN if [ "$TORCHVISION_PIP" ]; then pip install $TORCHVISION_PIP; fi

ARG TORCH_TB_PROFILER_PIP
RUN if [ "$TORCH_TB_PROFILER_PIP" ]; then pip install $TORCH_TB_PROFILER_PIP; fi

ARG TF_PROFILER_PIP
RUN if [ "$TF_PROFILER_PIP" ]; then python -m pip install $TF_PROFILER_PIP; fi

ARG HOROVOD_WITH_TENSORFLOW
RUN if [ "$HOROVOD_WITH_TENSORFLOW" ]; then export HOROVOD_WITH_TENSORFLOW=$HOROVOD_WITH_TENSORFLOW; fi

Expand Down
14 changes: 6 additions & 8 deletions Dockerfile-default-gpu
Original file line number Diff line number Diff line change
Expand Up @@ -28,22 +28,20 @@ ARG TORCHVISION_PIP

RUN if [ "$TENSORFLOW_PIP" ]; then \
export HOROVOD_WITH_TENSORFLOW=1 \
&& python -m pip install $TENSORFLOW_PIP; \
&& python -m pip install $TENSORFLOW_PIP \
&& python -m pip install -r /tmp/det_dockerfile_scripts/additional-requirements-tf.txt; \
MikhailKardash marked this conversation as resolved.
Show resolved Hide resolved
else \
export HOROVOD_WITH_TENSORFLOW=0; \
fi
RUN if [ "$TORCH_PIP" ]; then python -m pip install $TORCH_PIP; fi
RUN if [ "$TORCH_PIP" ]; then \
python -m pip install $TORCH_PIP \
&& python -m pip install -r /tmp/det_dockerfile_scripts/additional-requirements-torch.txt; \
fi
RUN if [ "$TORCHVISION_PIP" ]; then python -m pip install $TORCHVISION_PIP; fi

ARG TF_CUDA_SYM
RUN if [ "$TF_CUDA_SYM" ]; then ln -s /usr/local/cuda/lib64/libcusolver.so.11 /opt/conda/lib/python3.8/site-packages/tensorflow/python/libcusolver.so.10; fi

ARG TORCH_TB_PROFILER_PIP
RUN if [ "$TORCH_TB_PROFILER_PIP" ]; then python -m pip install $TORCH_TB_PROFILER_PIP; fi

ARG TF_PROFILER_PIP
RUN if [ "$TF_PROFILER_PIP" ]; then python -m pip install $TF_PROFILER_PIP; fi

ARG TORCH_CUDA_ARCH_LIST
ARG APEX_GIT
RUN /tmp/det_dockerfile_scripts/install_apex.sh
Expand Down
24 changes: 24 additions & 0 deletions Dockerfile-ngc-hpc
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
ARG BASE_IMAGE
FROM ${BASE_IMAGE}

# Copy various shell scripts that group dependencies for install
COPY dockerfile_scripts /tmp/det_dockerfile_scripts

ARG AWS_PLUGIN_INSTALL_DIR=/container/aws
ARG WITH_AWS_TRACE
ARG INTERNAL_AWS_DS
ARG INTERNAL_AWS_PATH
RUN if [ "$WITH_OFI" = "1" ]; then /tmp/det_dockerfile_scripts/build_aws.sh "$WITH_OFI" "$WITH_AWS_TRACE"; fi

#USING OFI
ARG AWS_LIB_DIR=${AWS_PLUGIN_INSTALL_DIR}/lib
ENV LD_LIBRARY_PATH=${WITH_OFI:+$AWS_LIB_DIR:}$LD_LIBRARY_PATH

# Set an entrypoint that can scrape up the host libfabric.so and then
# run the user command. This is intended to enable performant execution
# on non-IB systems that have a proprietary libfabric.
RUN mkdir -p /container/bin && \
cp /tmp/det_dockerfile_scripts/scrape_libs.sh /container/bin
ENTRYPOINT ["/container/bin/scrape_libs.sh"]

RUN rm -r /tmp/*
33 changes: 33 additions & 0 deletions Dockerfile-pytorch-ngc
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
ARG BASE_IMAGE
FROM ${BASE_IMAGE}

# NGC images contain user owned files in /usr/lib
RUN chown root:root /usr/lib

# Copy various shell scripts that group dependencies for install
COPY dockerfile_scripts /tmp/det_dockerfile_scripts

RUN /tmp/det_dockerfile_scripts/install_deb_packages.sh
RUN /tmp/det_dockerfile_scripts/add_det_nobody_user.sh
RUN /tmp/det_dockerfile_scripts/install_libnss_determined.sh

# We uninstall these packages after installing. This ensures that we can
# successfully install these packages into containers running as non-root.
# `pip` does not uninstall dependencies, so we still have all the dependencies
# installed.
RUN python -m pip install determined && python -m pip uninstall -y determined

RUN python -m pip install -r /tmp/det_dockerfile_scripts/additional-requirements-torch.txt \
-r /tmp/det_dockerfile_scripts/additional-requirements.txt \
-r /tmp/det_dockerfile_scripts/notebook-requirements.txt
MikhailKardash marked this conversation as resolved.
Show resolved Hide resolved

ENV JUPYTER_CONFIG_DIR=/run/determined/jupyter/config
ENV JUPYTER_DATA_DIR=/run/determined/jupyter/data
ENV JUPYTER_RUNTIME_DIR=/run/determined/jupyter/runtime

RUN /tmp/det_dockerfile_scripts/install_google_cloud_sdk.sh

ARG DEEPSPEED_PIP
RUN if [ -n "$DEEPSPEED_PIP" ]; then /tmp/det_dockerfile_scripts/install_deepspeed.sh; fi
rb-determined-ai marked this conversation as resolved.
Show resolved Hide resolved

RUN rm -r /tmp/*
30 changes: 30 additions & 0 deletions Dockerfile-tensorflow-ngc
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
ARG BASE_IMAGE
FROM ${BASE_IMAGE}

# NGC images contain user owned files in /usr/lib
RUN chown root:root /usr/lib

# Copy various shell scripts that group dependencies for install
COPY dockerfile_scripts /tmp/det_dockerfile_scripts

RUN /tmp/det_dockerfile_scripts/install_deb_packages.sh
RUN /tmp/det_dockerfile_scripts/add_det_nobody_user.sh
RUN /tmp/det_dockerfile_scripts/install_libnss_determined.sh

# We uninstall these packages after installing. This ensures that we can
# successfully install these packages into containers running as non-root.
# `pip` does not uninstall dependencies, so we still have all the dependencies
# installed.
RUN python -m pip install determined && python -m pip uninstall -y determined

RUN python -m pip install -r /tmp/det_dockerfile_scripts/additional-requirements-tf.txt \
-r /tmp/det_dockerfile_scripts/additional-requirements.txt \
-r /tmp/det_dockerfile_scripts/notebook-requirements.txt

ENV JUPYTER_CONFIG_DIR=/run/determined/jupyter/config
ENV JUPYTER_DATA_DIR=/run/determined/jupyter/data
ENV JUPYTER_RUNTIME_DIR=/run/determined/jupyter/runtime

RUN /tmp/det_dockerfile_scripts/install_google_cloud_sdk.sh

RUN rm -r /tmp/*
Loading