-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document ML-image tag/GPU type/CUDA compatibility table #390
Comments
The K80s (Kepler generation, 2014) are a special case since newer generation NVIDIA CUDA drivers (Maxwell onwards, 2015+) have better forward compatibility with cudatoolkit, see https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title. Given the discussion in 2i2c-org/infrastructure#1765 on updating all the Pangeo hub clusters to T4s (can't say I didn't see this coming), I think we can probably just leave it at that (unless we expect to be supporting K80s for much longer). Pro tip: The general rule of thumb I follow is to always have the latest CUDA driver version, which should work with any cudatoolkit (CTK) version 6.5 or above (though I recommend at least cudatoolkit 9.0 that should work with V100s or above). However, you can't go above CUDA driver 470.57 for Tesla K80s 🙃 |
Brings in pangeo-data/pangeo-docker-images#389 Based on pangeo-data/pangeo-docker-images#390, start making T4 the default. Folks can still use K80 if they want. This makes it easier to use CUDA based GPU accelerated code. Follow-up to 2i2c-org#1766
I'd say in that case the suggestion would be to document preferred GPUs, and say folks should prefer T4 (if that is actually the case - I don't know!). I'm not super well versed in GPU usage, so would love some guidance that says 'just use these GPUs' from folks in the knoew. |
Brings in pangeo-data/pangeo-docker-images#389 Based on pangeo-data/pangeo-docker-images#390, start making T4 the default. Folks can still use K80 if they want. This makes it easier to use CUDA based GPU accelerated code. Follow-up to 2i2c-org#1766
The only issue we are facing here is the driver version (something @yuvipanda fixed in other PRs). There is nothing wrong in the images themselves, just a discrepancy between ptxas version given by cuda-nvcc and the hardware drivers available. Downgrading The issue boils down to the efficiency of compiling the computational graph; it throws an error when there is a discrepancy, but it really ought not to, because the discrepancy only affects the parallel compilation (which is not that big of a deal for small projects). This will be fixed by jax upstream soon. This doesn't affect the tensorflow portion (as far as I could tell, because they use a slightly different way of doing it). TLDR:
-- On T4 vs K80, I don't know much more than you do even as I dabble in GPUs regularly. It is a crazy world. I would say the following:
|
@weiji14 do you have access to these hubs? If not, the best person to document this is going to be @dhruvbalwada who will have hands-on experience with the systems/hubs in question. |
If you mean the m2lines hubs, then no. Probably best to ask @dhruvbalwada. |
Thanks for that table @weiji14 I'm embarrassed to say I had to go to https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Tesla in order to realize the first letter of the short name corresponds to "Hardware Generation"...
|
- The bug we reported upstream to eksctl has been fixed! So eksctl is now responsible for setting up the GPU driver, not us! eksctl-io/eksctl#5277. Yay for fixing things upstream! This woudl also mean that eksctl is responsible for keeping these versions up to date, and not us. - Based on pangeo-data/pangeo-docker-images#390 and many other discussions (linked to from there), NVidia T4s are now preferred over older K80s. We update the AWS GPU docs to recognize this. - Add PyTorch & Tensorflow images as options to the GPU profile here, so end users can choose! Fixes 2i2c-org#1784
- The bug we reported upstream to eksctl has been fixed! So eksctl is now responsible for setting up the GPU driver, not us! eksctl-io/eksctl#5277. Yay for fixing things upstream! This woudl also mean that eksctl is responsible for keeping these versions up to date, and not us. We bump up the required eksctl version to account for this. - Based on pangeo-data/pangeo-docker-images#390 and many other discussions (linked to from there), NVidia T4s are now preferred over older K80s. We update the AWS GPU docs to recognize this. - Add PyTorch & Tensorflow images as options to the GPU profile here, so end users can choose! Fixes 2i2c-org#1784
0.9 is a bit over 18 months old at this point. Matching versions of drivers, CUDA and packages can be a bit difficult with older versions (see, for example pangeo-data/pangeo-docker-images#390).
- The bug we reported upstream to eksctl has been fixed! So eksctl is now responsible for setting up the GPU driver, not us! eksctl-io/eksctl#5277. Yay for fixing things upstream! This woudl also mean that eksctl is responsible for keeping these versions up to date, and not us. We bump up the required eksctl version to account for this. - Based on pangeo-data/pangeo-docker-images#390 and many other discussions (linked to from there), NVidia T4s are now preferred over older K80s. We update the AWS GPU docs to recognize this. - Add PyTorch & Tensorflow images as options to the GPU profile here, so end users can choose! Fixes 2i2c-org#1784
- The bug we reported upstream to eksctl has been fixed! So eksctl is now responsible for setting up the GPU driver, not us! eksctl-io/eksctl#5277. Yay for fixing things upstream! This woudl also mean that eksctl is responsible for keeping these versions up to date, and not us. We bump up the required eksctl version to account for this. - Based on pangeo-data/pangeo-docker-images#390 and many other discussions (linked to from there), NVidia T4s are now preferred over older K80s. We update the AWS GPU docs to recognize this. - Add PyTorch & Tensorflow images as options to the GPU profile here, so end users can choose! Fixes 2i2c-org#1784
TBH I am very confused. Here is the original problem (#387) - we had to downgrade the cuda-nvcc to match the driver version that is available. This was because of the XLA issue that ngam mentions above. @yuvipanda - what is setting the driver version that we end up with (the version that shows up when one calls nvidia-smi)? Can this not be higher than 11.6 (latest), if we drop support for K80s? |
Just to explain things a little bit, the So the typical way to fix these CUDA-related problems is to start from the bottom up like so:
Now for the end-user scientist who just wants things to work in Oct 2022, just go with something that is about 2 versions behind. That means:
But if you know what you're doing (and have your head around all these cuda-terms), then go wild 😆 |
Thank you @weiji14, this is extremely helpful - glad to be learning more about this. |
@dhruvbalwada it's set by whatever version Google supports, and we're currently at the latest version they support! https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers has the table. T4 and K80 are handled separately, so dropping support for K80 won't do much for T4 here. |
An alternative is to build all GPU-related images on top of NGC containers. The "devel" containers have all the tools, but I don't know about the licensing and obviously they're huge waste as base images |
from @yuvipanda in #387 (comment)
The images have cuda libraries that are only compatible with certain GPUs (K80, T4, etc)
pangeo-docker-images/ml-notebook/packages.txt
Lines 96 to 97 in 118d497
Would be great to document this in the readme @ngam or @weiji14 any chance you'd like to create a short compatibility table? Or add a short message and link to the relevant NVIDIA docs?
The text was updated successfully, but these errors were encountered: