-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda 11 cannot be supported #536
Comments
We're in the process of updating our docker images. We're also still in the process of fully testing torch 1.7 + cuda 11. However, can you try this docker image we just pushed to docker hub? The core error I am seeing here seems to be coming from the Megatron example code which is trying to load an extension from Apex for fused-layer-norm. You might try re-installing apex via these steps: https://github.com/nvidia/apex#linux |
The above image I linked does not include apex since deepspeed core does not require it anymore. However, the Megatron example does rely on a few cuda/cpp extensions that are included in apex. I've pushed another docker image that installs the latest apex as well, you can grab it here: |
The dockerfiles for these two images are below (the 2nd image just adds the apex lines at the end):
|
Thank you for your answer, I will try it and look forward to a good result. |
@jeffra Hello, I used the docker image you gave me above, and when testing the Megatron Example, I reported an error when Running bash scripts/pretrain_gpt2.sh for a single GPU:
When I use the 'bash scripts/ds_zero-offload_10B_pretrain_gpt2_model_parallel.sh' I got another error:
I think it should be some version related problems. I don't know how to adapt to the version related problems. Could you give me some Suggestions?Thank you very much |
Hi @hujian233, I see. Just realized this is compute capability 8.0 (ampere). The previous image I linked pre-builds our cpp/cuda extensions. Let's try an image where the ops will be built just-in-time (JIT) instead. I've pushed it here: We're still doing initial testing on ampere with DeepSpeed. I just tested the previous image on an A100 and saw a sort of similar (but not exact same) error. After switching to the JIT compiled version it worked. |
Hi @jeffra ,thanks for your answer. I actually solved the above two problems yesterday, although I don't know why. I just do this:
the second problem ,I set the docker shm-size 2048m and then the error is gone
|
@jeffra hello, I installed the latest version of DeepSpeed 0.3.4 and run the ds_zero-offload_10B_pretrain_gpt2_model_parallel.sh, I get an error:
When This error occurs after I upgrade, I go back to 0.3.2 and still have this error, as if I can't do PIP directly. |
Hi @hujian233, in some of my local testing with this image I am also seeing strange issues with PIP and conda. I believe my issue was related to the docker build installing deepspeed with root and then at runtime (in my environment at least) I am running with a user without write permissions to |
@jeffra , I can't use the this is the log before:
|
@jeffra Hello, I now have some idea of the reason. I can use The following is a demonstration of two different versions of environment variables: I guess,maybe the torch1.7.0,cuda11, CPP/Cuda Extensions can be build in your V100 gpu,but can't build in my 3090 gpu. I'll find out why, but could you first send me a docker image of the latest version that contains the pre-builds CPP/Cuda extension. I also recommend that you pre-builds the extensions when you package the images so that people can choose whether or not they want to recompile by JIT. Looking forward to your reply. Thank you very much. |
@hujian233 there's a recent PR from @stas00 that might help here as well. Can you give it a try? |
Actually I don't believe the previous linked PR is related here. However, PR #572 should fix your issues I believe. |
@jeffra ,Hi,sounds great, I had tried the same thing last night using the cuda 11.1 and torch 1.8.0 without version check. It compiled successs. I will try the latest code, thank you very much. |
@jeffra ,I am excited. Yeah, it worked very well in th RTX3090 with cuda11.0,pytorch1.8.0. or pytorch1.7.0 This question can be closed. Thanks again. |
I wanted to run Deepdpeed on RTX 3090, cudA 11 only on 3090, and in your docker release I updated Pytorch version to 1.7.0 and ran an error:
how can I use it in 3090?
The text was updated successfully, but these errors were encountered: