Document ML-image tag/GPU type/CUDA compatibility table #390

scottyhq · 2022-10-14T00:04:20Z

I think it would be very useful to have a table of cuda versions and the pangeo docker images they are compatible with.

The images have cuda libraries that are only compatible with certain GPUs (K80, T4, etc)

pangeo-docker-images/ml-notebook/packages.txt

Lines 96 to 97 in 118d497

    
           cuda-nvcc==11.6.124 
        
           cudatoolkit==11.7.0

Would be great to document this in the readme @ngam or @weiji14 any chance you'd like to create a short compatibility table? Or add a short message and link to the relevant NVIDIA docs?

weiji14 · 2022-10-14T02:38:53Z

The K80s (Kepler generation, 2014) are a special case since newer generation NVIDIA CUDA drivers (Maxwell onwards, 2015+) have better forward compatibility with cudatoolkit, see https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title. Given the discussion in 2i2c-org/infrastructure#1765 on updating all the Pangeo hub clusters to T4s (can't say I didn't see this coming), I think we can probably just leave it at that (unless we expect to be supporting K80s for much longer).

Pro tip: The general rule of thumb I follow is to always have the latest CUDA driver version, which should work with any cudatoolkit (CTK) version 6.5 or above (though I recommend at least cudatoolkit 9.0 that should work with V100s or above). However, you can't go above CUDA driver 470.57 for Tesla K80s 🙃

Brings in pangeo-data/pangeo-docker-images#389 Based on pangeo-data/pangeo-docker-images#390, start making T4 the default. Folks can still use K80 if they want. This makes it easier to use CUDA based GPU accelerated code. Follow-up to 2i2c-org#1766

yuvipanda · 2022-10-14T03:33:47Z

I'd say in that case the suggestion would be to document preferred GPUs, and say folks should prefer T4 (if that is actually the case - I don't know!). I'm not super well versed in GPU usage, so would love some guidance that says 'just use these GPUs' from folks in the knoew.

Brings in pangeo-data/pangeo-docker-images#389 Based on pangeo-data/pangeo-docker-images#390, start making T4 the default. Folks can still use K80 if they want. This makes it easier to use CUDA based GPU accelerated code. Follow-up to 2i2c-org#1766

ngam · 2022-10-14T13:16:05Z

The only issue we are facing here is the driver version (something @yuvipanda fixed in other PRs). There is nothing wrong in the images themselves, just a discrepancy between ptxas version given by cuda-nvcc and the hardware drivers available. Downgrading cuda-nvcc like above seems to fix it, with no other changes needed. This is okay, and will likely introduces no problem, because we are only getting some small binaries from cuda-nvcc (specifically ptxas) for XLA (jax + tensorflow) to work correctly.

The issue boils down to the efficiency of compiling the computational graph; it throws an error when there is a discrepancy, but it really ought not to, because the discrepancy only affects the parallel compilation (which is not that big of a deal for small projects). This will be fixed by jax upstream soon. This doesn't affect the tensorflow portion (as far as I could tell, because they use a slightly different way of doing it).

TLDR:

don't worry about the specifics of cuda-nvcc; think of the image sans cuda-nvcc as it is just a temporary insertion until we find a better solution (e.g. I asked people to please add ptxas to the regular cudatoolkit, jax will likely bundle ptxas in one way or another, etc.)
the rest of the image is tightly controlled by conda-forge's globall pinning

--

On T4 vs K80, I don't know much more than you do even as I dabble in GPUs regularly. It is a crazy world. I would say the following:

if T4 is actually cost effective (from the table rabernat shared, it seemed to be), then it is a really good GPU with modern compute specs (SM75). K80 is just old at this point, so unless somehow you have to make use of it (e.g. grants specifying K80, getting some sort of a deal, etc.), I'd just drop it.
All of the drama here is about jax (and a small part of tensorflow), this should not affect others, so we could document this corner-case issue and give the users enough tools to fix it themselves (e.g. conda install -c nvidia cuda-nvcc==1.2.3.4.* or setting an env variable to disable parallel compilation)

ngam · 2022-10-14T21:52:22Z

@weiji14 do you have access to these hubs? If not, the best person to document this is going to be @dhruvbalwada who will have hands-on experience with the systems/hubs in question.

weiji14 · 2022-10-14T22:10:04Z

@weiji14 do you have access to these hubs? If not, the best person to document this is going to be @dhruvbalwada who will have hands-on experience with the systems/hubs in question.

If you mean the m2lines hubs, then no. Probably best to ask @dhruvbalwada.

scottyhq · 2022-10-17T20:51:38Z

Thanks for that table @weiji14 I'm embarrassed to say I had to go to https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Tesla in order to realize the first letter of the short name corresponds to "Hardware Generation"...

K80 = Kepler
T4 = Turing
V100 = Volta

- The bug we reported upstream to eksctl has been fixed! So eksctl is now responsible for setting up the GPU driver, not us! eksctl-io/eksctl#5277. Yay for fixing things upstream! This woudl also mean that eksctl is responsible for keeping these versions up to date, and not us. - Based on pangeo-data/pangeo-docker-images#390 and many other discussions (linked to from there), NVidia T4s are now preferred over older K80s. We update the AWS GPU docs to recognize this. - Add PyTorch & Tensorflow images as options to the GPU profile here, so end users can choose! Fixes 2i2c-org#1784

- The bug we reported upstream to eksctl has been fixed! So eksctl is now responsible for setting up the GPU driver, not us! eksctl-io/eksctl#5277. Yay for fixing things upstream! This woudl also mean that eksctl is responsible for keeping these versions up to date, and not us. We bump up the required eksctl version to account for this. - Based on pangeo-data/pangeo-docker-images#390 and many other discussions (linked to from there), NVidia T4s are now preferred over older K80s. We update the AWS GPU docs to recognize this. - Add PyTorch & Tensorflow images as options to the GPU profile here, so end users can choose! Fixes 2i2c-org#1784

0.9 is a bit over 18 months old at this point. Matching versions of drivers, CUDA and packages can be a bit difficult with older versions (see, for example pangeo-data/pangeo-docker-images#390).

- The bug we reported upstream to eksctl has been fixed! So eksctl is now responsible for setting up the GPU driver, not us! eksctl-io/eksctl#5277. Yay for fixing things upstream! This woudl also mean that eksctl is responsible for keeping these versions up to date, and not us. We bump up the required eksctl version to account for this. - Based on pangeo-data/pangeo-docker-images#390 and many other discussions (linked to from there), NVidia T4s are now preferred over older K80s. We update the AWS GPU docs to recognize this. - Add PyTorch & Tensorflow images as options to the GPU profile here, so end users can choose! Fixes 2i2c-org#1784

dhruvbalwada · 2022-10-18T14:33:07Z

TBH I am very confused.

Here is the original problem (#387) - we had to downgrade the cuda-nvcc to match the driver version that is available. This was because of the XLA issue that ngam mentions above.

@yuvipanda - what is setting the driver version that we end up with (the version that shows up when one calls nvidia-smi)? Can this not be higher than 11.6 (latest), if we drop support for K80s?

weiji14 · 2022-10-18T15:51:13Z

Just to explain things a little bit, the cuda-driver is set by the system administrator (i.e. @yuvipanda), not by conda which is in the pangeo-docker-image layer (i.e. the one @scottyhq is maintaining) sitting on top of the driver. All those low-level cuda libraries (cuda-nvcc and cudatoolkit) are reliant on this driver, and high-level deep learning frameworks (e.g. jax, pytorch, etc) are built on top. Maybe this diagram will help to make sense:

So the typical way to fix these CUDA-related problems is to start from the bottom up like so:

Sort out cuda-driver version incompatibilities between K80s and T4s GPU hardwares (that only @yuvipanda can fix at the sysadmin level). My recommendation is to use the highest driver version supported by the GPU hardware whenever possible so that more software library versions (newer or older) can work. For K80s, the maximum driver version should be 470.57.02 according to the table in Document ML-image tag/GPU type/CUDA compatibility table #390 (comment), whereas for T4 GPUs, you could go up to 510.xx.xx+ if needed.
Problems with cuda-nvcc and jax are fixed in the docker image (i.e. this pangeo-docker-image, or @scottyhq). There was some discussion to tie cuda-nvcc to a particular cudatoolkit version (see Optional cudatoolkit dependency conda-forge/nvcc-feedstock#14) and @ngam can probably comment more about conda-forge's pinning strategies, but really, these are all compatibility bugs that will be resolved eventually on the software side (e.g. in tensorflow, jax, etc).

Now for the end-user scientist who just wants things to work in Oct 2022, just go with something that is about 2 versions behind. That means:

T4 GPU (compute capability 7.5) instead of H100 GPU (compute capability 9.x)
cudatoolkit ~11.6, instead of the latest 11.8
Python 3.9 instead of the about-to-be-released Python 3.11

But if you know what you're doing (and have your head around all these cuda-terms), then go wild 😆

dhruvbalwada · 2022-10-18T15:58:13Z

Thank you @weiji14, this is extremely helpful - glad to be learning more about this.

yuvipanda · 2022-10-18T15:58:37Z

@dhruvbalwada it's set by whatever version Google supports, and we're currently at the latest version they support! https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers has the table. T4 and K80 are handled separately, so dropping support for K80 won't do much for T4 here.

ngam · 2022-10-18T16:24:44Z

An alternative is to build all GPU-related images on top of NGC containers. The "devel" containers have all the tools, but I don't know about the licensing and obviously they're huge waste as base images

yuvipanda mentioned this issue Oct 14, 2022

Use ML images with correct CUDA versions 2i2c-org/infrastructure#1772

Merged

dhruvbalwada mentioned this issue Oct 17, 2022

Update README.md #393

Closed

yuvipanda mentioned this issue Oct 18, 2022

Cleanup AWS GPU docs & add T4s for uwhackweeks 2i2c-org/infrastructure#1787

Merged

yuvipanda mentioned this issue Oct 18, 2022

Bump nvidia-device-plugin to 0.12.3 eksctl-io/eksctl#5797

Closed

7 tasks

ngam mentioned this issue Oct 18, 2022

remove cuda-nvcc and document ptxas #398

Merged

scottyhq closed this as completed in #398 Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document ML-image tag/GPU type/CUDA compatibility table #390

Document ML-image tag/GPU type/CUDA compatibility table #390

scottyhq commented Oct 14, 2022

weiji14 commented Oct 14, 2022 •

edited

Loading

yuvipanda commented Oct 14, 2022

ngam commented Oct 14, 2022

ngam commented Oct 14, 2022

weiji14 commented Oct 14, 2022

scottyhq commented Oct 17, 2022

dhruvbalwada commented Oct 18, 2022 •

edited

Loading

weiji14 commented Oct 18, 2022

dhruvbalwada commented Oct 18, 2022

yuvipanda commented Oct 18, 2022

ngam commented Oct 18, 2022

Document ML-image tag/GPU type/CUDA compatibility table #390

Document ML-image tag/GPU type/CUDA compatibility table #390

Comments

scottyhq commented Oct 14, 2022

weiji14 commented Oct 14, 2022 • edited Loading

yuvipanda commented Oct 14, 2022

ngam commented Oct 14, 2022

ngam commented Oct 14, 2022

weiji14 commented Oct 14, 2022

scottyhq commented Oct 17, 2022

dhruvbalwada commented Oct 18, 2022 • edited Loading

weiji14 commented Oct 18, 2022

dhruvbalwada commented Oct 18, 2022

yuvipanda commented Oct 18, 2022

ngam commented Oct 18, 2022

weiji14 commented Oct 14, 2022 •

edited

Loading

dhruvbalwada commented Oct 18, 2022 •

edited

Loading