Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental results #50

Closed
lshssel opened this issue Nov 27, 2023 · 12 comments
Closed

Experimental results #50

lshssel opened this issue Nov 27, 2023 · 12 comments
Assignees

Comments

@lshssel
Copy link

lshssel commented Nov 27, 2023

Hi,
Because our group only assigned me a 2080Ti, the training took a long time, for MOWODB's task 1, it took 43 hours.
Unfortunately, on training to the 35th epoch, wandb crashes, so its curve also stops at the 35th epoch.
However, the program is still running without errors, and the file "checkpoint0040.pth" is also generated in the end, and the program can run smoothly when I use this file to train task 2.

Below are the wandb graphs and hyperparameters, which don't work very well, and I may need to tune the parameters as close to the original performance as possible.

K_AP50 is 52.476, U_R50 is 21.042

屏幕截图 2023-11-27 181524 屏幕截图 2023-11-27 181915 屏幕截图 2023-11-27 181925
@lshssel
Copy link
Author

lshssel commented Dec 2, 2023

Second experiment
屏幕截图 2023-12-02 121712
屏幕截图 2023-12-02 121802

@orrzohar orrzohar self-assigned this Dec 2, 2023
@orrzohar
Copy link
Owner

orrzohar commented Dec 3, 2023

Hi @lshssel,

Hmmm batch_size 1 will be more difficult to fine-tune, but let's try.
Given your experiments, I would try:
lr=2e-5, lr_drop=60, epochs=70

The main idea is that you want the improvement to saturate and then reduce the learning rate. Training continually after the improvement has saturated doesn't help at all (U_R just goes down and AP50 doesn't go up) but if you reduce the lr_drop too soon (before AP50 starts to saturate) then the K_AP50 is 'frozen' too soon and doesn't improve enough.

Best,
Orr

@lshssel
Copy link
Author

lshssel commented Dec 3, 2023

Thanks for the reply, I'll try later

@lshssel lshssel closed this as completed Dec 12, 2023
@orrzohar
Copy link
Owner

Hi @lshssel,
Were your results sufficiently improved?
If so, can you give me all the details re. your system/hyperparamers so I can add it for future users on the README?
Best,
Orr

@lshssel
Copy link
Author

lshssel commented Dec 17, 2023

one 2080Ti,for all experiments,batch_size=1
about epochs,the first value is in "main_open_world.py",and the second value is in "M_OWOD_BENCHMARK.sh".
t1.2 : lr=4e-5 lr_backbone=4e-6 epochs=51=41 lr_drop=35 K_AP50=58.36 U_R50=16.50
t1.3 : lr=2e-5 lr_backbone=4e-6 epochs=51=41 lr_drop=35 K_AP50=57.99 U_R50=19.27
t1.4 : lr=2e-5 lr_backbone=4e-6 epochs=56=46 lr_drop=40 K_AP50=57.60 U_R50=18.55
t1.6 : lr=2e-5 lr_backbone=4e-6 epochs=61=41 lr_drop=40 K_AP50=57.17 U_R50=19.34

Looking forward to your suggestions!

@lshssel
Copy link
Author

lshssel commented Dec 17, 2023

屏幕截图 2023-12-17 115858 屏幕截图 2023-12-17 115923

@lshssel lshssel reopened this Dec 17, 2023
@orrzohar
Copy link
Owner

Hi @lshssel,
I would like to try something new with you. I had the idea that what is happening is that with different batch sizes, the objectness temperature does need to change.
Good news: no need for training. I would try to use the checkpoints t1.2, t1.3, and re-evaluate with different --obj_temp and sweep a few values (e.g.,0.9, 1.1, 1.2) - default is 1. Should be relatively quick - as you only need to evaluate (use the --eval flag).

Best,
Orr

@lshssel
Copy link
Author

lshssel commented Dec 18, 2023

I evaluated with t1.3checkpoint40 and t1.2 has been removed. When training, obj_temp = 1.3,obj_loss_coef=8e-4.
I also used obj_loss_coef with different values,but nothing has changed

obj_temp=1.1 K_AP50=57.1914 U_R50=19.2453
obj_temp=1.2 K_AP50=57.6161 U_R50=19.2581
obj_temp=1.3 K_AP50=57.9826 U_R50=19.2624 obj_loss_coef=8e-4
obj_temp=1.4 K_AP50=57.9075 U_R50=19.2367
obj_temp=1.5 K_AP50=57.8653 U_R50=19.2453

obj_loss_coef=4e-4 K_AP50=57.9826 U_R50=19.2624
obj_loss_coef=8e-4 K_AP50=57.9826 U_R50=19.2624
obj_loss_coef=1.6e-3 K_AP50=57.9826 U_R50=19.2624
obj_loss_coef=4e-3 K_AP50=57.9826 U_R50=19.2624

So t1.3 is probably the best result that a 2080Ti can show

@orrzohar
Copy link
Owner

orrzohar commented Dec 18, 2023

Hi @lshssel,
I want to ensure you understand you don't need to train with a different obj_temp -- you can change this just for evaluation. Unfortunately, it does seem that this is the best result with batch_size=1. Perhaps we could improve it a little more, but probably not much.

I want to add this to the readme. Would you mind providing all the hyperparameters you changed?

@lshssel
Copy link
Author

lshssel commented Dec 19, 2023

Hi,
Yes, I understand what you mean, I use different obj_temp values for evaluation.
As mentioned earlier, changing the value of the obj_temp did not improve performance.
If batch size=2, then cuda out of memory, so it can only be 1 on a 2080Ti(11G).
My hyperparameter is set to:
lr=2e-5 lr_backbone=4e-6 batch size=1, nothing else has changed.
Thank you again for your excellent work and answering my questions.

@WangPingA
Copy link

WangPingA commented Jun 6, 2024

59I0GK0~%B8_RT0_RCD6SF
Hello, I also used a 2080Ti card to complete the entire experiment. I conducted the experiment according to "lr=2e-5, lr'backbone=4e-6, batch size=1, obj_temp=1.3". My results are shown in the following figure. I don't know why some results are actually higher than the results mentioned in the paper. By the way, it took me about 8 days to complete the entire experiment

@orrzohar
Copy link
Owner

Hi @WangPingA,

When you train a model with a different batch size, your results will vary. That is because your gradient updates will not be the same, as you use a different batch size. Variations of +-2 seem reasonable.

lshssel also ran experiments with a 2080Ti, and got:
image

If you are interested in applications, then perhaps my recent work, FOMO, will interest you; it is much less compute-heavy to train and will have relatively strong open-world performance by leveraging foundation object detection model. An easy upgrade there is to switch owl-vit to owlv2.

Best,
Orr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants