Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]: Add DINO on MMDetection 2.x #8362

Closed
wants to merge 33 commits into from

Conversation

Li-Qingyun
Copy link
Contributor

@Li-Qingyun Li-Qingyun commented Jul 13, 2022

Motivation

Implement the results in the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection" with mmdetection. The original code has released.

Remark

This is my first time to submit a PR to reproduces an algorithm, and I'm still lack of experience. I hope that developers who pay attention to this PR can comment more problems in my implementation and share some results. I will also follow up and share some of my own experimental results and progress arrangements.

@CLAassistant
Copy link

CLAassistant commented Jul 13, 2022

CLA assistant check
All committers have signed the CLA.

@czczup
Copy link
Contributor

czczup commented Jul 25, 2022

Hi, thanks very much for your code.
May I ask if you have met this problem?

image

@Li-Qingyun
Copy link
Contributor Author

Hi, thanks very much for your code. May I ask if you have met this problem?

image

When did this happen ? at the beginning or after a few epochs of training ?

@Li-Qingyun
Copy link
Contributor Author

@czczup I haven't fully trained yet, but I have tested 1000 iters training (2 GPU dist_train). I didn't met it. Could you please provide more information? Thanks for your attention.

@czczup
Copy link
Contributor

czczup commented Jul 25, 2022

Hi, thanks very much for your code. May I ask if you have met this problem?
image

When did this happen ? at the beginning or after a few epochs of training ?

Thanks for your reply. I use a single GPU to train (debug) with the config you provided. It happened at the beginning of training (50 iters).
As you can see in this log:
image

It seems that bbox_head.label_embedding.weight did not receive a gradient during some iterations.
I could use find_unused_parameters=True to avoid this problem, but it would affect training efficiency.

@Li-Qingyun
Copy link
Contributor Author

Li-Qingyun commented Jul 25, 2022

@czczup Hi, the newest config file contains other incompatibility problems.

norm_cfg=dict(type='FrozenBatchNorm2d'), # should use BN instead
conv_cfg=dict(type='Conv2d', bias=True), # should be deleted

I made modification in my local workstation, which I haven't committed.
Did you meet these problems.

@jshilong jshilong self-requested a review July 25, 2022 06:42
@Li-Qingyun
Copy link
Contributor Author

@czczup I'm working on experiments for aligning training results, there is still a few modification which i have not committed.

You can comment the FrozenBatchNorm2d, use BN instead. I modified the bn file of mmcv to registry the FrozenBatchNorm2d, but i prepare to delete it at last.
In the original DINO, the convs of neck has bias, which is not recommended in mmdetection. To align inference accuracy, i had to modify the conv building func in mmcv. You can comment this line.

@Li-Qingyun
Copy link
Contributor Author

@czczup The label_embedding is used in DINOHead.forward_train() and DnQueryGenerator.forward(). I haven't found the bug yet, and maybe a few days later it'll be found.
I'm pushing full speed ahead with the training. And more details will be committed. Thanks for your attention and discussion.

@czczup
Copy link
Contributor

czczup commented Jul 25, 2022

@Li-Qingyun
When the latest config is used, the label_embedding problem disappeared. Thanks for helping.
I checked with the paper and the source code published by the authors, and it seems that most of the settings are aligned, except for the learning_rate (1e-4 in paper but 2e-4 in this config) and noise_scale (0.5&0.4 in paper but 0.5&1.0 in this config).
I'm not sure if there's any detail I missed. I plan to train DINO-R50 for 12 epochs with 8 GPUs today.

@Li-Qingyun
Copy link
Contributor Author

@czczup Hi, the original repo has a sh file to launch training, in which the box_noise_scale is set 1.

The description is at the last paragraph of the D.3 (Supplementary Materials) of the paper.

Multi-scale setting. For our 4-scale models, we extract features from stages 2, 3, and 4 of the backbone and add an extra feature by down-sampling the output of the stage 4. An additional feature map of the backbone stage 1 is used for our
5-scale models. For hyper-parameters, we set λ1 = 1.0 and λ2 = 2.0 and use 100 CDN pairs which contain 100 positive queries and 100 negative queries.

The learning rate was indeed misaligned, thank you very much!

@czczup
Copy link
Contributor

czczup commented Jul 25, 2022

I met a new problem, it happens in the middle of training, such as 2000 iterations. I'm trying to fix it.
I notice that it happens when dn_cls_scores is None, so that keys in loss_dict are different across GPUs.

AssertionError: loss log variables are different across GPUs! 
rank 2 len(log_vars): 21 keys: interm_loss_cls,interm_loss_bbox,interm_loss_iou,loss_cls,loss_bbox,loss_iou,d0.loss_cls,d0.loss_bbox,d0.loss_iou,d1.loss_cls,d1.loss_bbox,d1.loss_iou,d2.loss_cls,d2.loss_bbox,d2.loss_iou,d3.loss_cls,d3.loss_bbox,d3.loss_iou,d4.loss_cls,d4.loss_bbox,d4.loss_iou

@Li-Qingyun
Copy link
Contributor Author

@czczup Thanks! Maybe it's the special case when the current sample has no target, I'll have a check.

@Li-Qingyun
Copy link
Contributor Author

@czczup Hi, I'm waiting for the queue of lab's slurm cluster, so can't experiment in time. Could you have a test of setting filter_empty_gt=True?

data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(filter_empty_gt=True, pipeline=train_pipeline),  # Modify here
    val=dict(pipeline=test_pipeline),
    test=dict(pipeline=test_pipeline))

@Li-Qingyun
Copy link
Contributor Author

@zhaoguoqing12 你好,这个问题我们已经在 390b25a 基本解决

请问你是运行最新版本再次出现该问题吗。你可以检查你本地仓库的版本,目前最新版本为 d380deb,如果你本地不是最新版本,可以通过 git pull https://github.com/Li-Qingyun/mmdetection.git add-dino 更新本地仓库。如果是最新版本出现的问题,请麻烦提供一下环境版本,本地修改,运行命令 等信息

@zhaoguoqing12
Copy link

@zhaoguoqing12你好,这个问题我们已经在390b25a基本解决方案

你可以查看你本地仓库的最新版本,当前最新d380deb,你本地不是最新版本,git pull https://github.com/Li-Qingyun/mmdetection.git add-dino 可以更新本地仓库。是如果出现的更新版本修改问题,请解决一下最新的环境修改,本地提供的信息,运行命令等

我确实不是最新版本,但是我并不是在训练dino出现的这个问题。

@Li-Qingyun
Copy link
Contributor Author

@zhaoguoqing12 如果是你自己的仓库出现相同的问题,你需要保证 loss_dict.keys() 是固定不变的。例如:在目前版本的DINO中loss_dict应当有39项目,我的检查方法为,在 loss_dict 被 reduce 之前:

if len(loss_dict) != 39:
    import ipdb; ipdb.set_trace()

通过移动帧发现 loss_dict.keys() 中缺少 dn 相关的 loss,所以在获取dn loss的函数中增加 无目标情况的分支,来保证所有的情况下,loss_dict 都稳定地由 39 项构成。目前这里的操作能暂时解决该问题,后续review阶段可能会用更好的方式替代。(例如,在没有目标的情况下,loss_giou和loss_bbox就自然为0,这种情况下补充的 loss_dn为0,但没有计算图)

我发现 DINO 源码中其实也有防止类似报错的操作,但 loss 部分基本是完全重构的,所以当时并没有注意到~

所以提供给你的思路是,去检查一下你报错信息中的这个 loss_dict 的获取。可以用上面的方式定位到报错情况,然后对特殊情况进行补位处理。

@zhaoguoqing12
Copy link

@Li-Qingyun 好的 感谢 我会尝试的

@Li-Qingyun
Copy link
Contributor Author

Li-Qingyun commented Aug 25, 2022

We start to refactor modules of DETR-like models to enhance the usability and readability of our codebase. For not affecting the progress of experiments in the current version and the works of those who follow our PR. I create a new PR of refactor:

#9149

I'll merge the refactor PR to this PR when it's finished. The followers can continue to conduct experiments based on our current PR, and the relevant bugs will continue to be fixed too. Besides, the experimental results will continue to be released.

Thank you for your attention!

@Li-Qingyun Li-Qingyun changed the title [WIP]: Add DINO [WIP]: Add DINO on MMDetection 2.x Feb 19, 2023
@Li-Qingyun
Copy link
Contributor Author

Li-Qingyun commented Feb 19, 2023

We decide to complete the development of DINO on MMDetection 2.x in this pr. The code remains using the old style (the code has been refactored in MMDetection 3.x for a new style). The users are recommended to use #9149 in MMDetection 3.x. This support is mainly for the users who may have to use old versions.

TO-DO List

  • refactor CdnQueryGenerator
  • Add DocString
    • detector class
    • Dino Transformer
    • Dino Transformer Decoder
    • DinoHead
    • CdnQueryGenerator
  • Add unit test
  • Add readme
  • Add model to zoo
  • Add metafiles

@Li-Qingyun Li-Qingyun closed this Oct 29, 2024
@Li-Qingyun Li-Qingyun mentioned this pull request Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants