Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document ML-image tag/GPU type/CUDA compatibility table #390

Closed
scottyhq opened this issue Oct 14, 2022 · 11 comments · Fixed by #398
Closed

Document ML-image tag/GPU type/CUDA compatibility table #390

scottyhq opened this issue Oct 14, 2022 · 11 comments · Fixed by #398

Comments

@scottyhq
Copy link
Member

I think it would be very useful to have a table of cuda versions and the pangeo docker images they are compatible with.

from @yuvipanda in #387 (comment)

The images have cuda libraries that are only compatible with certain GPUs (K80, T4, etc)

cuda-nvcc==11.6.124
cudatoolkit==11.7.0

Would be great to document this in the readme @ngam or @weiji14 any chance you'd like to create a short compatibility table? Or add a short message and link to the relevant NVIDIA docs?

@weiji14
Copy link
Member

weiji14 commented Oct 14, 2022

The K80s (Kepler generation, 2014) are a special case since newer generation NVIDIA CUDA drivers (Maxwell onwards, 2015+) have better forward compatibility with cudatoolkit, see https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title. Given the discussion in 2i2c-org/infrastructure#1765 on updating all the Pangeo hub clusters to T4s (can't say I didn't see this coming), I think we can probably just leave it at that (unless we expect to be supporting K80s for much longer).

Pro tip: The general rule of thumb I follow is to always have the latest CUDA driver version, which should work with any cudatoolkit (CTK) version 6.5 or above (though I recommend at least cudatoolkit 9.0 that should work with V100s or above). However, you can't go above CUDA driver 470.57 for Tesla K80s 🙃

CUDA compatibility

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Oct 14, 2022
Brings in pangeo-data/pangeo-docker-images#389

Based on pangeo-data/pangeo-docker-images#390,
start making T4 the default. Folks can still use K80 if they want.

This makes it easier to use CUDA based GPU accelerated code.

Follow-up to 2i2c-org#1766
@yuvipanda
Copy link
Member

I'd say in that case the suggestion would be to document preferred GPUs, and say folks should prefer T4 (if that is actually the case - I don't know!). I'm not super well versed in GPU usage, so would love some guidance that says 'just use these GPUs' from folks in the knoew.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Oct 14, 2022
Brings in pangeo-data/pangeo-docker-images#389

Based on pangeo-data/pangeo-docker-images#390,
start making T4 the default. Folks can still use K80 if they want.

This makes it easier to use CUDA based GPU accelerated code.

Follow-up to 2i2c-org#1766
@ngam
Copy link
Contributor

ngam commented Oct 14, 2022

The only issue we are facing here is the driver version (something @yuvipanda fixed in other PRs). There is nothing wrong in the images themselves, just a discrepancy between ptxas version given by cuda-nvcc and the hardware drivers available. Downgrading cuda-nvcc like above seems to fix it, with no other changes needed. This is okay, and will likely introduces no problem, because we are only getting some small binaries from cuda-nvcc (specifically ptxas) for XLA (jax + tensorflow) to work correctly.

The issue boils down to the efficiency of compiling the computational graph; it throws an error when there is a discrepancy, but it really ought not to, because the discrepancy only affects the parallel compilation (which is not that big of a deal for small projects). This will be fixed by jax upstream soon. This doesn't affect the tensorflow portion (as far as I could tell, because they use a slightly different way of doing it).

TLDR:

  • don't worry about the specifics of cuda-nvcc; think of the image sans cuda-nvcc as it is just a temporary insertion until we find a better solution (e.g. I asked people to please add ptxas to the regular cudatoolkit, jax will likely bundle ptxas in one way or another, etc.)
  • the rest of the image is tightly controlled by conda-forge's globall pinning

--

On T4 vs K80, I don't know much more than you do even as I dabble in GPUs regularly. It is a crazy world. I would say the following:

  • if T4 is actually cost effective (from the table rabernat shared, it seemed to be), then it is a really good GPU with modern compute specs (SM75). K80 is just old at this point, so unless somehow you have to make use of it (e.g. grants specifying K80, getting some sort of a deal, etc.), I'd just drop it.
  • All of the drama here is about jax (and a small part of tensorflow), this should not affect others, so we could document this corner-case issue and give the users enough tools to fix it themselves (e.g. conda install -c nvidia cuda-nvcc==1.2.3.4.* or setting an env variable to disable parallel compilation)

@ngam
Copy link
Contributor

ngam commented Oct 14, 2022

@weiji14 do you have access to these hubs? If not, the best person to document this is going to be @dhruvbalwada who will have hands-on experience with the systems/hubs in question.

@weiji14
Copy link
Member

weiji14 commented Oct 14, 2022

@weiji14 do you have access to these hubs? If not, the best person to document this is going to be @dhruvbalwada who will have hands-on experience with the systems/hubs in question.

If you mean the m2lines hubs, then no. Probably best to ask @dhruvbalwada.

@scottyhq
Copy link
Member Author

Thanks for that table @weiji14 I'm embarrassed to say I had to go to https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Tesla in order to realize the first letter of the short name corresponds to "Hardware Generation"...

K80 = Kepler
T4 = Turing
V100 = Volta

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Oct 18, 2022
- The bug we reported upstream to eksctl has been fixed! So eksctl
  is now responsible for setting up the GPU driver, not us!
  eksctl-io/eksctl#5277. Yay for fixing
  things upstream! This woudl also mean that eksctl is responsible
  for keeping these versions up to date, and not us.
- Based on pangeo-data/pangeo-docker-images#390
  and many other discussions (linked to from there), NVidia T4s are
  now preferred over older K80s. We update the AWS GPU docs to
  recognize this.
- Add PyTorch & Tensorflow images as options to the GPU profile here,
  so end users can choose!

Fixes 2i2c-org#1784
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Oct 18, 2022
- The bug we reported upstream to eksctl has been fixed! So eksctl
  is now responsible for setting up the GPU driver, not us!
  eksctl-io/eksctl#5277. Yay for fixing
  things upstream! This woudl also mean that eksctl is responsible
  for keeping these versions up to date, and not us. We bump up the
  required eksctl version to account for this.
- Based on pangeo-data/pangeo-docker-images#390
  and many other discussions (linked to from there), NVidia T4s are
  now preferred over older K80s. We update the AWS GPU docs to
  recognize this.
- Add PyTorch & Tensorflow images as options to the GPU profile here,
  so end users can choose!

Fixes 2i2c-org#1784
yuvipanda added a commit to yuvipanda/eksctl that referenced this issue Oct 18, 2022
0.9 is a bit over 18 months old at this point. Matching versions
of drivers, CUDA and packages can be a bit difficult with
older versions (see, for example pangeo-data/pangeo-docker-images#390).
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Oct 18, 2022
- The bug we reported upstream to eksctl has been fixed! So eksctl
  is now responsible for setting up the GPU driver, not us!
  eksctl-io/eksctl#5277. Yay for fixing
  things upstream! This woudl also mean that eksctl is responsible
  for keeping these versions up to date, and not us. We bump up the
  required eksctl version to account for this.
- Based on pangeo-data/pangeo-docker-images#390
  and many other discussions (linked to from there), NVidia T4s are
  now preferred over older K80s. We update the AWS GPU docs to
  recognize this.
- Add PyTorch & Tensorflow images as options to the GPU profile here,
  so end users can choose!

Fixes 2i2c-org#1784
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Oct 18, 2022
- The bug we reported upstream to eksctl has been fixed! So eksctl
  is now responsible for setting up the GPU driver, not us!
  eksctl-io/eksctl#5277. Yay for fixing
  things upstream! This woudl also mean that eksctl is responsible
  for keeping these versions up to date, and not us. We bump up the
  required eksctl version to account for this.
- Based on pangeo-data/pangeo-docker-images#390
  and many other discussions (linked to from there), NVidia T4s are
  now preferred over older K80s. We update the AWS GPU docs to
  recognize this.
- Add PyTorch & Tensorflow images as options to the GPU profile here,
  so end users can choose!

Fixes 2i2c-org#1784
@dhruvbalwada
Copy link
Member

dhruvbalwada commented Oct 18, 2022

TBH I am very confused.

Here is the original problem (#387) - we had to downgrade the cuda-nvcc to match the driver version that is available. This was because of the XLA issue that ngam mentions above.

@yuvipanda - what is setting the driver version that we end up with (the version that shows up when one calls nvidia-smi)? Can this not be higher than 11.6 (latest), if we drop support for K80s?

@weiji14
Copy link
Member

weiji14 commented Oct 18, 2022

Just to explain things a little bit, the cuda-driver is set by the system administrator (i.e. @yuvipanda), not by conda which is in the pangeo-docker-image layer (i.e. the one @scottyhq is maintaining) sitting on top of the driver. All those low-level cuda libraries (cuda-nvcc and cudatoolkit) are reliant on this driver, and high-level deep learning frameworks (e.g. jax, pytorch, etc) are built on top. Maybe this diagram will help to make sense:

image

So the typical way to fix these CUDA-related problems is to start from the bottom up like so:

  1. Sort out cuda-driver version incompatibilities between K80s and T4s GPU hardwares (that only @yuvipanda can fix at the sysadmin level). My recommendation is to use the highest driver version supported by the GPU hardware whenever possible so that more software library versions (newer or older) can work. For K80s, the maximum driver version should be 470.57.02 according to the table in Document ML-image tag/GPU type/CUDA compatibility table #390 (comment), whereas for T4 GPUs, you could go up to 510.xx.xx+ if needed.
  2. Problems with cuda-nvcc and jax are fixed in the docker image (i.e. this pangeo-docker-image, or @scottyhq). There was some discussion to tie cuda-nvcc to a particular cudatoolkit version (see Optional cudatoolkit dependency conda-forge/nvcc-feedstock#14) and @ngam can probably comment more about conda-forge's pinning strategies, but really, these are all compatibility bugs that will be resolved eventually on the software side (e.g. in tensorflow, jax, etc).

Now for the end-user scientist who just wants things to work in Oct 2022, just go with something that is about 2 versions behind. That means:

  • T4 GPU (compute capability 7.5) instead of H100 GPU (compute capability 9.x)
  • cudatoolkit ~11.6, instead of the latest 11.8
  • Python 3.9 instead of the about-to-be-released Python 3.11

But if you know what you're doing (and have your head around all these cuda-terms), then go wild 😆

@dhruvbalwada
Copy link
Member

Thank you @weiji14, this is extremely helpful - glad to be learning more about this.

@yuvipanda
Copy link
Member

@dhruvbalwada it's set by whatever version Google supports, and we're currently at the latest version they support! https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers has the table. T4 and K80 are handled separately, so dropping support for K80 won't do much for T4 here.

@ngam
Copy link
Contributor

ngam commented Oct 18, 2022

An alternative is to build all GPU-related images on top of NGC containers. The "devel" containers have all the tools, but I don't know about the licensing and obviously they're huge waste as base images

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants