-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[deepspeed checkpointing] AttributeError: 'NoneType' object has no attribute 'numel' #598
Comments
Hi @g-karthik , that is an unfriendly way of reporting that the I will file an issue to catch and report the configuration error when model parallelism is not enabled. |
Ah I see, thanks @ShadenSmith! I made those
|
Btw as a more foundational question, why couldn't we have upfront contiguous memory allocation for activations in the case of no-model-parallelism? Same question for offloading activations to CPU with I understand the behavior of |
Hm, can you share a bit about the model code that uses checkpointing? My first response still applies, but now I'm wondering if there's a bigger issue that we're not catching. This looks like |
@ShadenSmith yep, this is kinda what it looks like inside the model's
I assumed |
Ah @ShadenSmith I think I know what the issue is, but correct me if I'm wrong. The default However, in the corresponding deepspeed Perhaps these enumerations need to be modified and made more generic to account for cases where a layer being checkpointed could have some args as EDIT: Yep, I ran a test where I simply replaced Also, do you happen to have performed any side-by-side benchmarking of |
@ShadenSmith happy new year! Just bumping this back up your notifications! |
Hey @g-karthik , happy new year and thanks for the ping! I hope you had a nice holiday. I spent some time away to focus on python scalability :-). Thanks for the deep dive on debugging this issue. I think you're right. DeepSpeed's activation checkpointing should work as a drop-in replacement in the simple case without partitioning, etc. I don't know if we've done any benchmarking beyond at large scale with model parallelism where it's critical for huge activations. Maybe @samyam knows more. |
@ShadenSmith The I've been using |
Hi @g-karthik I submitted a PR with a fix and corresponding unit test. Are you able to give that a try? It would be great to ensure that my new unit test covers your case. |
I went ahead and merged the PR; please re-open if the issue is still present for you! |
@ShadenSmith thanks for the fix! Will test this out later this week and revert back to you! |
@ShadenSmith sorry for the delay, I did test this fix and it works, thanks a lot! Note that this is with Hugging Face's transformer models without usage of Keeping all else equal, I do not see any meaningful difference between DeepSpeed's activation checkpointing and I also have a couple other questions:
I'd asked these last two questions earlier on in the above thread but I didn't quite understand why they're tied to the |
@ShadenSmith also take a look at this DS config JSON from the Clearly, even when model-parallel degree = 1, the features So shouldn't these be supported when |
Hi @g-karthik, these features are supported if mpu=None. You should be able to turn on all of the above features you referred to without providing any mpu object, and it will default to model-parallelism = 1. You can offload activation checkpoints to CPU with mpu=None. To do this you need to set both partition_activations and cpu_checkpointing to True, which can be done even when mpu is None. This is a superficial restriction in the code that we need to fix. To enable cpu_checkpointing you need to first enable partition_actications.
Offload to CPU
Sometimes it will help the memory allocator clean up the memory and avoid OOM or speed up memory allocations
It can be. Yout just need to set partition_activations to true as well. This is a superficial restriction as you have noticed and we need to fix it.
Can you provide some more context on this question?
This is a superficial restriction. You can get around it by just setting partition_activations to True even when running without mpu. It will just assume MP=1 and do the right thing and allow you to offload your activations |
It should be supported with mpu=None. Are you running into errors? |
@samyam thanks for your response! Yes, I am indeed running into errors when setting those flags to true, see: #284 (comment) This error is with the latest DeepSpeed. And as a reminder, I use Hugging Face's GPT-2. |
@g-karthik, can you please share repro steps with HF GPT-2? I am trying to do that now. |
@tjruwase I replied here #284 (comment), we can continue this discussion on this thread since this issue has to do with deepspeed's activation checkpointing. The other thread is closely intertwined with this one though, since reproducing those claims requires deepspeed's activation checkpointing to work fine. |
@g-karthik, thanks for reminding/showing me to enable deepspeed.checkpointing in HF GPT2. However, after doing that I am still unable to repro the reported error. I am seeing some issues that suggest that you and I are running different code stacks. Below are my observations:
So how are you avoiding these two issues? Can you share more details of your repro setup? Below is my environment:
|
@tjruwase Indeed, my stack is different, as are my dependencies - I'm using torch 1.4, CUDA 10.1, transformers 3.1. I don't think it matters for this issue though. About 1 - I have a separate deepspeed_config.json file which has the As for 2 - I created a local copy of the HF transformers I don't currently use the HF |
@tjruwase did you get a chance to work through this? |
@g-karthik, sorry due to our latest release, I did not have bandwidth to explore the proposed workarounds. I will resume this work this week. Stay tuned. |
So I took a public GPT-2 class implementation (not Megatron-LM) and I added deepspeed checkpointing to it for all 48 layers.
In my train script for this class, I added the following line:
My deepspeed config JSON is as follows:
When I try running my script, I get the following error:
Any ideas what's going on?
@tjruwase @ShadenSmith
The text was updated successfully, but these errors were encountered: