Loss NAN for Deit Base #29

ChengyueGongR · 2021-01-09T01:09:56Z

I have reproduced the small and tiny model but met with problems for reproducing the base model with 224 and 384 image size. With a large probability, the loss came to NAN after training with few epochs.
My setting is 16 GPUs and the batch size is 64 on each GPU and I do not change any hyper-parameters in run_with_submitit.py . Do you have any idea to solve this problem?
Thanks for your help.

The text was updated successfully, but these errors were encountered:

vtddggg · 2021-01-09T13:47:11Z

I also met this problem. I use 32 GPUs for training.

vtddggg · 2021-01-11T08:35:09Z

Give a simple solution for my case. I found the transformer training is sensitive to the learning rate. We must keep a vary small learning rate in the training stage (it would be best for lr < 0.0015). Otherwise the gradient will become nan caused by amp. So, you can first try to reduce the learning rate.

Another alternative way is to dis-enable the amp. I have tried this way and it also works for me.

fmassa · 2021-01-11T09:15:25Z

Hi,

I've just run the default command-line for DeiT base with the state of the codebase as of yesterday on 16 GPUs, and training is going fine so far (we are already at epoch 100 without issues).

We are using PyTorch 1.7.0 with CUDA 10.1, and the default hyperparameters work fine for us on this setup.

Can you share your PyTorch / torchvision / CUDA versions?

ChengyueGongR · 2021-01-11T18:52:53Z

@vtddggg Thanks for your advice, it works for me.

xwjabc · 2021-01-16T23:26:06Z

Hi @vtddggg @ChengyueGongR , I have also encountered the NaN issue. I wonder if there is an easy way to disable amp?
It seems with torch.cuda.amp.autocast(): in engine.py should removed. However, I wonder if we need to rewrite the loss_scaler? Thanks!

cxxgtxy · 2021-01-30T10:25:52Z

I meet with the same problem

haoweiz23 · 2021-02-18T02:14:56Z

Hi,

I've just run the default command-line for DeiT base with the state of the codebase as of yesterday on 16 GPUs, and training is going fine so far (we are already at epoch 100 without issues).

We are using PyTorch 1.7.0 with CUDA 10.1, and the default hyperparameters work fine for us on this setup.

Can you share your PyTorch / torchvision / CUDA versions?

Hi, I meet the NaN problem. My version is :
torch == 1.7.1
cuda == 11.2
Gpu num = 8

HubHop · 2021-03-09T23:25:59Z

Same issue with torch 1.7.1, cuda 11.2, torchvision 0.8.2

cxxgtxy · 2021-03-11T12:44:38Z

Give a simple solution for my case. I found the transformer training is sensitive to the learning rate. We must keep a vary small learning rate in the training stage (it would be best for lr < 0.0015). Otherwise the gradient will become nan caused by amp. So, you can first try to reduce the learning rate.

Another alternative way is to dis-enable the amp. I have tried this way and it also works for me.

Changing the learning rate of the base model degrades its performance with a clear margin.

HubHop · 2021-03-11T13:03:14Z

Give a simple solution for my case. I found the transformer training is sensitive to the learning rate. We must keep a vary small learning rate in the training stage (it would be best for lr < 0.0015). Otherwise the gradient will become nan caused by amp. So, you can first try to reduce the learning rate.
Another alternative way is to dis-enable the amp. I have tried this way and it also works for me.

Changing the learning rate of the base model degrades its performance with a clear margin.

I tried to disable the amp and replace the scaler with vanilla loss.backward() and optimizer.step(), unfortunately this won't help. Setting longer warmup epoch also has this problem.

As far as I know, the input would never be NAN. My experiment shows the NAN happens in the input x after the first Transformer block at a specific iteration step in epoch 7.

wangpichao · 2021-03-16T04:30:42Z

Give a simple solution for my case. I found the transformer training is sensitive to the learning rate. We must keep a vary small learning rate in the training stage (it would be best for lr < 0.0015). Otherwise the gradient will become nan caused by amp. So, you can first try to reduce the learning rate.
Another alternative way is to dis-enable the amp. I have tried this way and it also works for me.

Changing the learning rate of the base model degrades its performance with a clear margin.

Have you solved the problem?

cheerss · 2021-04-06T13:30:11Z

Same problem for me. "Loss is nan, stop training". My packages are pytorch-1.7.1/CUDA-10.1

Also, deit-tiny and deit-small work well, but deit-base does not.

TouvronHugo · 2021-04-07T08:45:19Z

Hi Everyone,
Concerning DeiT I have not encountered any NaN problems with the Tiny, Small and Base models.
Nevertheless, in our last paper Going deeper with Image Transformers with deeper architectures we had this problem. In order to solve it we used the LayerScale method with an adjustment of the stochastic depth coefficient. This may help you to solve your stability problems.
Best,
Hugo

TouvronHugo · 2021-04-19T07:50:53Z

As there is no more activity, I am closing the issue, don't hesitate to reopen it if necessary

liuzhuang13 · 2021-05-26T20:04:02Z

In my experiments on the base model, I found that if I disable RepeatedAug, i.e., run with "--no-repeated-aug", the NAN error would disappear. This helps even if I run with "--drop-path 0", which increases the probability of NAN on the default setting in my observation.

ShoufaChen · 2021-07-07T07:11:37Z

Hi, @liuzhuang13

Thanks for your suggestion. May I ask will the performance drop with --no-repeated-aug?

netw0rkf10w · 2021-07-11T10:41:56Z

Same issue when submitting the following command

python -u main.py --model deit_base_patch16_224 --batch-size 64 --data-path ~/data/imagenet --output_dir ~/experiments/deit_base

to 16 GPUs (4 Slurm nodes).

There seems to be a tricky bug somewhere in the code...

(Edited to add more information: PyTorch 1.9, CUDA 10.2)

liuzhuang13 · 2021-07-11T23:09:50Z

Hi, @liuzhuang13

Thanks for your suggestion. May I ask will the performance drop with --no-repeated-aug?

@ShoufaChen Yes, in my observation the performance drops from 81.5% to 81.1%, when I remove repeated aug. In my later experiments, I found if you don't use AMP, the NAN will also disappear, but the training will be much slower

liuzhuang13 · 2021-07-15T00:19:25Z

@ShoufaChen I just found that in DeiT it is still not converging even if I disable AMP, though it won't report NAN. The case I mentioned where disabling AMP can solve the NAN problem is for another architecture. Not sure this is useful but just FYI

ShoufaChen · 2021-07-15T07:45:24Z

@liuzhuang13 Great thanks for your information.

Andy1621 · 2021-08-06T08:46:05Z

Without AMP, the loss will not become NAN. However, it will run very slowly for training.
I have found that the loss becomes NAN in attention, and simply use FP32 for attention will solve the problem. In my experiments, replacing the attention block with the followed code, the model can be resumed normally. (sometimes you will need to change random seed)

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        # NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
        self.scale = qk_scale or head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

    def forward(self, x):
        B, N, C = x.shape
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)

        with torch.cuda.amp.autocast(enabled=False):
            q, k, v = qkv[0].float(), qkv[1].float(), qkv[2].float()   # make torchscript happy (cannot use tensor as tuple)
            attn = (q @ k.transpose(-2, -1)) * self.scale
            attn = attn.softmax(dim=-1)
            attn = self.attn_drop(attn)
            x = (attn @ v).transpose(1, 2).reshape(B, N, C)

        x = self.proj(x)
        x = self.proj_drop(x)
        return x

I hope everyone who meets NAN can try this code and let me know if it works.

HeegonJin · 2022-01-08T10:15:31Z

@Andy1621 Your suggestion did not solve my case. I detoured the same problem by changing the random seed.

Andy1621 · 2022-03-02T11:49:41Z

@HeegonJin You can try to use FP32, which will cause more GPU cost. But it's stable. Or you can use smaller learning rate or weak data augmentation.

cxxgtxy · 2022-03-28T09:27:42Z

It seems that this issue is not addressed. I have run many times, the NAN of the DeiT-Base is easy to be reproduced.

fmassa added the awaiting response label Jan 11, 2021

TouvronHugo closed this as completed Apr 19, 2021

danczs mentioned this issue Apr 29, 2021

You do not use NativeScaler and ModelEma for training? danczs/Visformer#1

Open

TouvronHugo mentioned this issue Jul 19, 2021

DeiT Training Errors out after 8 epochs #102

Closed

Andy1621 mentioned this issue Aug 6, 2021

A resolution for NAN #109

Closed

Andy1621 mentioned this issue Mar 2, 2022

Unstable as backbone for semantic segmentation Visual-Attention-Network/VAN-Classification#9

Closed

Andy1621 mentioned this issue May 20, 2022

Hi, very glad to see this new version of Swin-Trans. Could I have a question about using mixed-precision training microsoft/CSWin-Transformer#35

Open

vtddggg mentioned this issue Oct 12, 2022

A question about "Loss is nan" vtddggg/Robust-Vision-Transformer#7

Closed

WangWenhao0716 mentioned this issue Dec 7, 2024

Training Stops with Loss=Nan at a certain epoch. WangWenhao0716/TransHP#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss NAN for Deit Base #29

Loss NAN for Deit Base #29

ChengyueGongR commented Jan 9, 2021

vtddggg commented Jan 9, 2021

vtddggg commented Jan 11, 2021 •

edited

Loading

fmassa commented Jan 11, 2021

ChengyueGongR commented Jan 11, 2021

xwjabc commented Jan 16, 2021

cxxgtxy commented Jan 30, 2021

haoweiz23 commented Feb 18, 2021 •

edited

Loading

HubHop commented Mar 9, 2021

cxxgtxy commented Mar 11, 2021

HubHop commented Mar 11, 2021

wangpichao commented Mar 16, 2021

cheerss commented Apr 6, 2021 •

edited

Loading

TouvronHugo commented Apr 7, 2021

TouvronHugo commented Apr 19, 2021

liuzhuang13 commented May 26, 2021

ShoufaChen commented Jul 7, 2021

netw0rkf10w commented Jul 11, 2021 •

edited

Loading

liuzhuang13 commented Jul 11, 2021

liuzhuang13 commented Jul 15, 2021

ShoufaChen commented Jul 15, 2021

Andy1621 commented Aug 6, 2021 •

edited

Loading

HeegonJin commented Jan 8, 2022

Andy1621 commented Mar 2, 2022

cxxgtxy commented Mar 28, 2022

Loss NAN for Deit Base #29

Loss NAN for Deit Base #29

Comments

ChengyueGongR commented Jan 9, 2021

vtddggg commented Jan 9, 2021

vtddggg commented Jan 11, 2021 • edited Loading

fmassa commented Jan 11, 2021

ChengyueGongR commented Jan 11, 2021

xwjabc commented Jan 16, 2021

cxxgtxy commented Jan 30, 2021

haoweiz23 commented Feb 18, 2021 • edited Loading

HubHop commented Mar 9, 2021

cxxgtxy commented Mar 11, 2021

HubHop commented Mar 11, 2021

wangpichao commented Mar 16, 2021

cheerss commented Apr 6, 2021 • edited Loading

TouvronHugo commented Apr 7, 2021

TouvronHugo commented Apr 19, 2021

liuzhuang13 commented May 26, 2021

ShoufaChen commented Jul 7, 2021

netw0rkf10w commented Jul 11, 2021 • edited Loading

liuzhuang13 commented Jul 11, 2021

liuzhuang13 commented Jul 15, 2021

ShoufaChen commented Jul 15, 2021

Andy1621 commented Aug 6, 2021 • edited Loading

HeegonJin commented Jan 8, 2022

Andy1621 commented Mar 2, 2022

cxxgtxy commented Mar 28, 2022

vtddggg commented Jan 11, 2021 •

edited

Loading

haoweiz23 commented Feb 18, 2021 •

edited

Loading

cheerss commented Apr 6, 2021 •

edited

Loading

netw0rkf10w commented Jul 11, 2021 •

edited

Loading

Andy1621 commented Aug 6, 2021 •

edited

Loading