Skip to content

Commit

Permalink
[gpu] performance and functionality improvements (#1265)
Browse files Browse the repository at this point in the history
* [gpu] performance and functionality improvements

* Capturing disk usage statistics to reduce excessive disk space
* created exit handler to clean up environment on completion or failure
* created prepare function to prepare for the installation
* when sufficient memory is available, configure a ramdisk
* reduce noise by turning off -x in utility functions
* added descriptive comments before the obscurely coded
  compare_versions_lte and compare_versions_lt functions
* removed some intermediate driver versions
* added cuda url for 12.6
* execute_with_retries now logs on failure, captures runtime and
  cleans before installing on debian
* saving OS installation and NV .run files and their temp files to ramdisk
* piping source .xz file directly xz instead of saving to disk first
* new utility function "is_debuntu" checks for the frequently used
  conditon of whether the running OS is either debian or ubuntu
* added support for specifying an http proxy (thank you प्रकाश)
* moving load of kernel module to later in the code and exercising
  modprobe of all modules to avoid regression
* fixed problem with attempting to fetch from incorrect vault
  directory when rocky kernel package is not found in primary repo
* using correct cran-r signing key for ubuntu18
* corrected file check condition for /etc/apt/trusted.gpg

* do not update all packages on rocky ; move preparation to prepare function

* increasing memory to make use of ramdisk

* using something a little smaller

* create mount_ramdisk function and call it ; fix up the version comparison functions ; create ge and le comparisons for OSs

* iterating better, caching results of system calls ; renamed to repair_old_backports

* comparing correct version numbers

* rocky uses a tmpfs on /tmp in the base image

* tested on rocky and ubuntu

* tested harder on rocky

* cuda 11 no longer available for debian 12

* cuda v11 no longer supported on debian12

* corrected use of ubuntu regex for rocky version

* re-enabling spark job tests

* correct a couple of edge cases

* added instructions for manually running tests

* open a monitor session by default

* cleaning up cuda and cudnn url generation

* condition better

* cleaned up generation of NVIDIA_CUDA_URL

* updated versions and GPU accelerators in the documentation

* ensure this test to be skipped based on cuda version rather than dataproc version alone

* fix for /usr/local/cuda-12.4/bin/nvcc: No such file or directory

* correcting path to run-bazel-tests.sh

* runing variable definition

* cleaned up skip conditions

* order of operations

* works with 2.0-rocky8

* remove redundant conditional check

* supported version limits are tightened up a bit ; clean up rocky vault install code

* corrected syntax errors

* failure to run dnf here should not fail the entire installer

* order matters here

* 2.2-ubuntu22 works with cuda 11, other 2.2 do not

* 2.2-ubuntu22 works with cuda 11, other 2.2 do not

* fixes ubuntu22 kernel version mismatch error

* disabling rocky9 builds due to out of date base dataproc image

* cuda 2.0 not supported in debian12

* some 2.0-rocky8 single instance tests fail

* intended to use <= and not >=

* simplify gpu resource script

* setting default discoveryScript ; testing pyspark in its own function

* remove spark: prefix from property names

* comment out quite a few tests

* new version numbers

* fixed a syntax error with documentation

* musn't forget the commas

* half as many tasks with twice as much cpu and gpu each

* pause before first ssh ; correct variable name
  • Loading branch information
cjac authored Nov 28, 2024
1 parent da3d8c1 commit 169e98e
Show file tree
Hide file tree
Showing 8 changed files with 735 additions and 271 deletions.
40 changes: 40 additions & 0 deletions gpu/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# This Dockerfile builds the container from which manual tests are run
# This process needs to be executed manually from a git clone
#
# See manual-test-runner.sh for instructions

FROM gcr.io/cloud-builders/gcloud

RUN useradd -m -d /home/ia-tests -s /bin/bash ia-tests

# Installed here are packages on which the tests depend
RUN apt-get -qq update \
&& apt-get -y -qq install \
apt-transport-https apt-utils \
ca-certificates libmime-base64-perl gnupg \
curl jq less screen > /dev/null 2>&1 && apt-get clean

# Install bazel signing key, repo and package
ENV bazel_kr_path=/usr/share/keyrings/bazel-release.pub.gpg
ENV bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8"

RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg \
| gpg --dearmor -o "${bazel_kr_path}" \
&& echo "deb [arch=amd64 signed-by=${bazel_kr_path}] ${bazel_repo_data}" \
| dd of=/etc/apt/sources.list.d/bazel.list status=none \
&& apt-get update -qq

RUN apt-get autoremove -y -qq && \
apt-get install -y -qq default-jdk python3-setuptools bazel > /dev/null 2>&1 && \
apt-get clean

# Install here any utilities you find useful when troubleshooting
RUN apt-get -y -qq install emacs-nox vim uuid-runtime > /dev/null 2>&1 && apt-get clean

WORKDIR /init-actions

USER ia-tests
COPY --chown=ia-tests:ia-tests . ${WORKDIR}

ENTRYPOINT ["/bin/bash"]
#CMD ["/bin/bash"]
28 changes: 14 additions & 14 deletions gpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,14 @@ for CUDA, the nvidia kernel driver, cuDNN, and NCCL.
Specifying a supported value for the `cuda-version` metadata variable
will select the following values for Driver, CuDNN and NCCL. At the
time of writing, the default value for cuda-version, if unspecified is
12.4. In addition to 12.4, we have also tested with 11.8.
12.4. In addition to 12.4, we have also tested with 11.8, 12.0 and 12.6.

CUDA | Full Version | Driver | CuDNN | NCCL | Supported OSs
CUDA | Full Version | Driver | CuDNN | NCCL | Tested Dataproc Image Versions
-----| ------------ | --------- | --------- | ------- | -------------------
11.8 | 11.8.0 | 525.147 | 8.6.0.163 | 2.15.5 | All
12.4 | 12.4.1 | 550.90.07 | 9.1.0.70 | 2.21.5 | ALL
11.8 | 11.8.0 | 560.35.03 | 8.6.0.163 | 2.15.5 | 2.0, 2.1, 2.2-ubuntu22
12.0 | 12.0.0 | 550.90.07 | 8.8.1.3, | 2.16.5 | 2.0, 2.1, 2.2-rocky9, 2.2-ubuntu22
12.4 | 12.4.1 | 550.90.07 | 9.1.0.70 | 2.23.4 | 2.1-ubuntu20, 2.1-rocky8, 2.2
12.6 | 12.6.2 | 560.35.03 | 9.5.1.17 | 2.23.4 | 2.1-ubuntu20, 2.1-rocky8, 2.2

All variants in the preceeding table have been manually tested to work
with the installer. Supported OSs at the time of writing are:
Expand All @@ -28,7 +30,6 @@ with the installer. Supported OSs at the time of writing are:
* Ubuntu 18.04, 20.04, and 22.04 LTS
* Rocky 8 and 9


## Using this initialization action

**:warning: NOTICE:** See
Expand All @@ -47,16 +48,15 @@ attached GPU adapters.
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--master-accelerator type=nvidia-tesla-v100 \
--worker-accelerator type=nvidia-tesla-v100,count=4 \
--master-accelerator type=nvidia-tesla-t4 \
--worker-accelerator type=nvidia-tesla-t4,count=4 \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh
```

1. Use the `gcloud` command to create a new cluster with NVIDIA GPU drivers
and CUDA installed by initialization action as well as the GPU
monitoring service. The monitoring service is supported on Dataproc 2.0+ Debian
and Ubuntu images. Please create a Github issue if support is needed for other
Dataproc images.
and Ubuntu images.

*Prerequisite:* Create GPU metrics in
[Cloud Monitoring](https://cloud.google.com/monitoring/docs/) using Google
Expand Down Expand Up @@ -90,8 +90,8 @@ attached GPU adapters.
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--master-accelerator type=nvidia-tesla-v100 \
--worker-accelerator type=nvidia-tesla-v100,count=4 \
--master-accelerator type=nvidia-tesla-t4 \
--worker-accelerator type=nvidia-tesla-t4,count=4 \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \
--metadata install-gpu-agent=true \
--scopes https://www.googleapis.com/auth/monitoring.write
Expand Down Expand Up @@ -136,12 +136,12 @@ attached GPU adapters.
#### GPU Scheduling in YARN:

YARN is the default Resource Manager for Dataproc. To use GPU scheduling feature
in Spark, it requires YARN version >= 2.10 or >=3.1.1. If intended to use Spark
in Spark, it requires YARN version >= 2.10 or >= 3.1.1. If intended to use Spark
with Deep Learning use case, it recommended to use YARN >= 3.1.3 to get support
for [nvidia-docker version 2](https://github.com/NVIDIA/nvidia-docker).
for [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit).

In current Dataproc set up, we enable GPU resource isolation by initialization
script without NVIDIA Docker, you can find more information at
script with NVIDIA container toolkit. You can find more information at
[NVIDIA Spark RAPIDS getting started guide](https://nvidia.github.io/spark-rapids/).

#### cuDNN
Expand Down
11 changes: 11 additions & 0 deletions gpu/bazel.screenrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
screen -L -t 2.0-debian10 1 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.0-debian10 ; exec /bin/bash'
#screen -L -t 2.0-rocky8 2 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.0-rocky8 ; exec /bin/bash'
#screen -L -t 2.0-ubuntu18 3 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.0-ubuntu18 ; exec /bin/bash'

#screen -L -t 2.1-debian11 4 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.1-debian11 ; exec /bin/bash'
#screen -L -t 2.1-rocky8 5 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.1-rocky8 ; exec /bin/bash'
#screen -L -t 2.1-ubuntu20 6 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.1-ubuntu20 ; exec /bin/bash'

#screen -L -t 2.2-debian12 7 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.2-debian12 ; exec /bin/bash'
#screen -L -t 2.2-rocky9 8 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.2-rocky9 ; exec /bin/bash'
#screen -L -t 2.2-ubuntu22 9 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.2-ubuntu22 ; exec /bin/bash'
7 changes: 7 additions & 0 deletions gpu/env.json.sample
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"PROJECT_ID":"example-yyyy-nn",
"PURPOSE":"cuda-pre-init",
"BUCKET":"my-bucket-name",
"IMAGE_VERSION":"2.2-debian12",
"ZONE":"us-west4-ñ"
}
Loading

0 comments on commit 169e98e

Please sign in to comment.