Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[gpu] performance and functionality improvements (#1265)
* [gpu] performance and functionality improvements * Capturing disk usage statistics to reduce excessive disk space * created exit handler to clean up environment on completion or failure * created prepare function to prepare for the installation * when sufficient memory is available, configure a ramdisk * reduce noise by turning off -x in utility functions * added descriptive comments before the obscurely coded compare_versions_lte and compare_versions_lt functions * removed some intermediate driver versions * added cuda url for 12.6 * execute_with_retries now logs on failure, captures runtime and cleans before installing on debian * saving OS installation and NV .run files and their temp files to ramdisk * piping source .xz file directly xz instead of saving to disk first * new utility function "is_debuntu" checks for the frequently used conditon of whether the running OS is either debian or ubuntu * added support for specifying an http proxy (thank you प्रकाश) * moving load of kernel module to later in the code and exercising modprobe of all modules to avoid regression * fixed problem with attempting to fetch from incorrect vault directory when rocky kernel package is not found in primary repo * using correct cran-r signing key for ubuntu18 * corrected file check condition for /etc/apt/trusted.gpg * do not update all packages on rocky ; move preparation to prepare function * increasing memory to make use of ramdisk * using something a little smaller * create mount_ramdisk function and call it ; fix up the version comparison functions ; create ge and le comparisons for OSs * iterating better, caching results of system calls ; renamed to repair_old_backports * comparing correct version numbers * rocky uses a tmpfs on /tmp in the base image * tested on rocky and ubuntu * tested harder on rocky * cuda 11 no longer available for debian 12 * cuda v11 no longer supported on debian12 * corrected use of ubuntu regex for rocky version * re-enabling spark job tests * correct a couple of edge cases * added instructions for manually running tests * open a monitor session by default * cleaning up cuda and cudnn url generation * condition better * cleaned up generation of NVIDIA_CUDA_URL * updated versions and GPU accelerators in the documentation * ensure this test to be skipped based on cuda version rather than dataproc version alone * fix for /usr/local/cuda-12.4/bin/nvcc: No such file or directory * correcting path to run-bazel-tests.sh * runing variable definition * cleaned up skip conditions * order of operations * works with 2.0-rocky8 * remove redundant conditional check * supported version limits are tightened up a bit ; clean up rocky vault install code * corrected syntax errors * failure to run dnf here should not fail the entire installer * order matters here * 2.2-ubuntu22 works with cuda 11, other 2.2 do not * 2.2-ubuntu22 works with cuda 11, other 2.2 do not * fixes ubuntu22 kernel version mismatch error * disabling rocky9 builds due to out of date base dataproc image * cuda 2.0 not supported in debian12 * some 2.0-rocky8 single instance tests fail * intended to use <= and not >= * simplify gpu resource script * setting default discoveryScript ; testing pyspark in its own function * remove spark: prefix from property names * comment out quite a few tests * new version numbers * fixed a syntax error with documentation * musn't forget the commas * half as many tasks with twice as much cpu and gpu each * pause before first ssh ; correct variable name
- Loading branch information