Four 3090 cannot get the the authors' results,that why? #48

Rzx520 · 2023-11-20T01:09:51Z

          > As you can see, I got the same results as the @orrzohar show in the paper. I wonder how many cards you used with batch_size = 2. I think if you use a single card, the result may be worse than I got (I used four cards with batch_size = 3) @Rzx520 . By the way, what are your final results? Are they far from the authors' results?

I used four cards with batch_size = 3,the result is :

{"train_lr": 1.999999999999943e-05, "train_class_error": 15.52755644357749, "train_grad_norm": 119.24543388206256, "train_loss": 5.189852057201781, "train_loss_bbox": 0.2700958194790585, "train_loss_bbox_0": 0.29624945830832017, "train_loss_bbox_1": 0.27978440371434526, "train_loss_bbox_2": 0.275065722955665, "train_loss_bbox_3": 0.27241891570675625, "train_loss_bbox_4": 0.27063051075218725, "train_loss_ce": 0.18834440561282928, "train_loss_ce_0": 0.27234036786085974, "train_loss_ce_1": 0.23321395799885028, "train_loss_ce_2": 0.20806531186409408, "train_loss_ce_3": 0.19453731594314128, "train_loss_ce_4": 0.18820172232765492, "train_loss_giou": 0.3351372324140976, "train_loss_giou_0": 0.3679243937037491, "train_loss_giou_1": 0.3483400315024699, "train_loss_giou_2": 0.34171414935044225, "train_loss_giou_3": 0.3379105142249501, "train_loss_giou_4": 0.3368650070453053, "train_loss_obj_ll": 0.02471167313379382, "train_loss_obj_ll_0": 0.034151954339996814, "train_loss_obj_ll_1": 0.03029250531194649, "train_loss_obj_ll_2": 0.0288731191750343, "train_loss_obj_ll_3": 0.028083207809715446, "train_loss_obj_ll_4": 0.026900355121292352, "train_cardinality_error_unscaled": 0.44506890101437985, "train_cardinality_error_0_unscaled": 0.6769398279525907, "train_cardinality_error_1_unscaled": 0.5726976196583499, "train_cardinality_error_2_unscaled": 0.4929900999093851, "train_cardinality_error_3_unscaled": 0.46150593285633223, "train_cardinality_error_4_unscaled": 0.45256225438417086, "train_class_error_unscaled": 15.52755644357749, "train_loss_bbox_unscaled": 0.054019163965779084, "train_loss_bbox_0_unscaled": 0.059249891647616536, "train_loss_bbox_1_unscaled": 0.055956880831476395, "train_loss_bbox_2_unscaled": 0.055013144572493046, "train_loss_bbox_3_unscaled": 0.054483783067331704, "train_loss_bbox_4_unscaled": 0.05412610215448962, "train_loss_ce_unscaled": 0.09417220280641464, "train_loss_ce_0_unscaled": 0.13617018393042987, "train_loss_ce_1_unscaled": 0.11660697899942514, "train_loss_ce_2_unscaled": 0.10403265593204704, "train_loss_ce_3_unscaled": 0.09726865797157064, "train_loss_ce_4_unscaled": 0.09410086116382746, "train_loss_giou_unscaled": 0.1675686162070488, "train_loss_giou_0_unscaled": 0.18396219685187454, "train_loss_giou_1_unscaled": 0.17417001575123495, "train_loss_giou_2_unscaled": 0.17085707467522113, "train_loss_giou_3_unscaled": 0.16895525711247505, "train_loss_giou_4_unscaled": 0.16843250352265265, "train_loss_obj_ll_unscaled": 30.889592197686543, "train_loss_obj_ll_0_unscaled": 42.68994404527915, "train_loss_obj_ll_1_unscaled": 37.86563257517548, "train_loss_obj_ll_2_unscaled": 36.09139981038161, "train_loss_obj_ll_3_unscaled": 35.10401065181873, "train_loss_obj_ll_4_unscaled": 33.62544476769816, "test_metrics": {"WI": 0.05356004827184098, "AOSA": 5220.0, "CK_AP50": 58.3890380859375, "CK_P50": 25.75118307055908, "CK_R50": 71.51227713815234, "K_AP50": 58.3890380859375, "K_P50": 25.75118307055908, "K_R50": 71.51227713815234, "U_AP50": 2.7862398624420166, "U_P50": 0.409358215516747, "U_R50": 16.530874785591767}, "test_coco_eval_bbox": [14.451444625854492, 14.451444625854492, 77.8148193359375, 57.15019607543945, 66.93928527832031, 49.282108306884766, 27.985671997070312, 70.54130554199219, 55.28901290893555, 82.7206039428711, 26.307403564453125, 65.15182495117188, 21.9127197265625, 77.91541290283203, 73.61457061767578, 67.8846206665039, 49.1287841796875, 36.78118896484375, 69.1879653930664, 53.060150146484375, 79.1402359008789, 59.972835540771484, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.7862398624420166], "epoch": 40, "n_parameters": 39742295}

the authors' results is :
U-R:19.4,K-AP:59.5
Why is it that the author's performance cannot be achieved?
@Hatins @orrzohar

Originally posted by @Rzx520 in #26 (comment)

The text was updated successfully, but these errors were encountered:

orrzohar · 2023-11-20T02:04:43Z

Hi @Rzx520,
When you change optimization hyperparameters, the results will inevitably change. That is true for PROB and nearly all deep learning models.

Luckily, PROB is relatively robust and requires minimal hyperparameter tuning to match our performance, at least on all the systems I have encountered. Specifically with Titan RTX 3090, our results were already reproduced (see Issue #26). On a system of 3090x4, lr_drop needed to be increased to 40 to match our reported results. If you have a different number of GPUs, there may be a better one for your system.

I am happy to help with this process, but to do so, I need to see your training curves.

Best,
Orr

Rzx520 · 2023-11-21T09:33:33Z

The above result is the result of adjusting lr_drop to 40, so I am quite confused.

Rzx520 · 2023-11-21T10:20:15Z

#26 (comment)

orrzohar · 2023-11-22T03:14:51Z

Did you use the same number of GPUs is in #26 ?
If not, then if you share your training curves I could try and help you with hyperparameter optimization.

Rzx520 · 2023-11-22T07:41:09Z

Yes, I also used 4GPUs. Thank you very much. Since I turned off Wandb, I had to retrain to obtain the training curves.This may take a while as the server is being used.

Rzx520 · 2023-11-23T01:34:39Z

Above is the result of this parameter setting training. @orrzohar

################ Deformable DETR ################
parser.add_argument('--lr', default=2e-4, type=float)
parser.add_argument('--lr_backbone_names', default=["backbone.0"], type=str, nargs='+')
parser.add_argument('--lr_backbone', default=2e-5, type=float)
parser.add_argument('--lr_linear_proj_names', default=['reference_points', 'sampling_offsets'], type=str, nargs='+')
parser.add_argument('--lr_linear_proj_mult', default=0.1, type=float)
#parser.add_argument('--batch_size', default=5, type=int)
#parser.add_argument('--batch_size', default=3, type=int)
parser.add_argument('--batch_size', default=2, type=int)
parser.add_argument('--weight_decay', default=1e-4, type=float)
parser.add_argument('--epochs', default=51, type=int)
#parser.add_argument('--lr_drop', default=35, type=int)
parser.add_argument('--lr_drop', default=40, type=int)

parser.add_argument('--lr_drop_epochs', default=None, type=int, nargs='+')
parser.add_argument('--clip_max_norm', default=0.1, type=float,
                    help='gradient clipping max norm')
parser.add_argument('--sgd', action='store_true')

orrzohar · 2023-11-24T23:49:20Z

Hi @Rzx520,
You are overtraining the model, should reduce the lr_drop to ~~150k iterations (lr_drop~~30).
I am concerned that you are using the same system as in #26 but getting different optimization results, I wonder how the two systems differ.
Best,
Orr

Rzx520 · 2023-11-25T00:59:05Z

I am trying lr_drop=30, I will send it back when the training results are available.And I also wonder how the two systems differ,so I asked some questions in #26 (comment)

orrzohar · 2023-11-26T03:47:21Z

Hi @Rzx520,
I see, I do not know Hatins, so I have no way of facilitating communication.
I am very surprised that you both used 4x3090s, but each needed different results.

Rzx520 · 2023-11-26T07:41:38Z

Above is the result of this parameter setting training, lr_drop = 30. @orrzohar

################ Deformable DETR ################
parser.add_argument('--lr', default=2e-4, type=float)
parser.add_argument('--lr_backbone_names', default=["backbone.0"], type=str, nargs='+')
parser.add_argument('--lr_backbone', default=2e-5, type=float)
parser.add_argument('--lr_linear_proj_names', default=['reference_points', 'sampling_offsets'], type=str, nargs='+')
parser.add_argument('--lr_linear_proj_mult', default=0.1, type=float)
#parser.add_argument('--batch_size', default=5, type=int)
#parser.add_argument('--batch_size', default=3, type=int)
parser.add_argument('--batch_size', default=2, type=int)
parser.add_argument('--weight_decay', default=1e-4, type=float)
parser.add_argument('--epochs', default=51, type=int)
#parser.add_argument('--lr_drop', default=35, type=int)
parser.add_argument('--lr_drop', default=40, type=int)

parser.add_argument('--lr_drop_epochs', default=None, type=int, nargs='+')
parser.add_argument('--clip_max_norm', default=0.1, type=float,
help='gradient clipping max norm')
parser.add_argument('--sgd', action='store_true')

orrzohar · 2023-11-26T20:47:26Z

Hi @Rzx520,
I noticed that you used batch_size=2 not batch_size=3 like in Hatins in #26.
Why is that the case? That could be a reason for the U_R50 discrepancy.
A broad note: a general trend I see is that the smaller the batch size, the less training that can be done without hurting U_R50.

I also noticed that Hatins reported simulary poorer results when using batch_size=2.
Best,
Orr

Rzx520 · 2023-11-27T01:18:20Z

          > As you can see, I got the same results as the @orrzohar show in the paper. I wonder how many cards you used with batch_size = 2. I think if you use a single card, the result may be worse than I got (I used four cards with batch_size = 3) @Rzx520 . By the way, what are your final results? Are they far from the authors' results?
I used four cards with batch_size = 3,the result is :

{"train_lr": 1.999999999999943e-05, "train_class_error": 15.52755644357749, "train_grad_norm": 119.24543388206256, "train_loss": 5.189852057201781, "train_loss_bbox": 0.2700958194790585, "train_loss_bbox_0": 0.29624945830832017, "train_loss_bbox_1": 0.27978440371434526, "train_loss_bbox_2": 0.275065722955665, "train_loss_bbox_3": 0.27241891570675625, "train_loss_bbox_4": 0.27063051075218725, "train_loss_ce": 0.18834440561282928, "train_loss_ce_0": 0.27234036786085974, "train_loss_ce_1": 0.23321395799885028, "train_loss_ce_2": 0.20806531186409408, "train_loss_ce_3": 0.19453731594314128, "train_loss_ce_4": 0.18820172232765492, "train_loss_giou": 0.3351372324140976, "train_loss_giou_0": 0.3679243937037491, "train_loss_giou_1": 0.3483400315024699, "train_loss_giou_2": 0.34171414935044225, "train_loss_giou_3": 0.3379105142249501, "train_loss_giou_4": 0.3368650070453053, "train_loss_obj_ll": 0.02471167313379382, "train_loss_obj_ll_0": 0.034151954339996814, "train_loss_obj_ll_1": 0.03029250531194649, "train_loss_obj_ll_2": 0.0288731191750343, "train_loss_obj_ll_3": 0.028083207809715446, "train_loss_obj_ll_4": 0.026900355121292352, "train_cardinality_error_unscaled": 0.44506890101437985, "train_cardinality_error_0_unscaled": 0.6769398279525907, "train_cardinality_error_1_unscaled": 0.5726976196583499, "train_cardinality_error_2_unscaled": 0.4929900999093851, "train_cardinality_error_3_unscaled": 0.46150593285633223, "train_cardinality_error_4_unscaled": 0.45256225438417086, "train_class_error_unscaled": 15.52755644357749, "train_loss_bbox_unscaled": 0.054019163965779084, "train_loss_bbox_0_unscaled": 0.059249891647616536, "train_loss_bbox_1_unscaled": 0.055956880831476395, "train_loss_bbox_2_unscaled": 0.055013144572493046, "train_loss_bbox_3_unscaled": 0.054483783067331704, "train_loss_bbox_4_unscaled": 0.05412610215448962, "train_loss_ce_unscaled": 0.09417220280641464, "train_loss_ce_0_unscaled": 0.13617018393042987, "train_loss_ce_1_unscaled": 0.11660697899942514, "train_loss_ce_2_unscaled": 0.10403265593204704, "train_loss_ce_3_unscaled": 0.09726865797157064, "train_loss_ce_4_unscaled": 0.09410086116382746, "train_loss_giou_unscaled": 0.1675686162070488, "train_loss_giou_0_unscaled": 0.18396219685187454, "train_loss_giou_1_unscaled": 0.17417001575123495, "train_loss_giou_2_unscaled": 0.17085707467522113, "train_loss_giou_3_unscaled": 0.16895525711247505, "train_loss_giou_4_unscaled": 0.16843250352265265, "train_loss_obj_ll_unscaled": 30.889592197686543, "train_loss_obj_ll_0_unscaled": 42.68994404527915, "train_loss_obj_ll_1_unscaled": 37.86563257517548, "train_loss_obj_ll_2_unscaled": 36.09139981038161, "train_loss_obj_ll_3_unscaled": 35.10401065181873, "train_loss_obj_ll_4_unscaled": 33.62544476769816, "test_metrics": {"WI": 0.05356004827184098, "AOSA": 5220.0, "CK_AP50": 58.3890380859375, "CK_P50": 25.75118307055908, "CK_R50": 71.51227713815234, "K_AP50": 58.3890380859375, "K_P50": 25.75118307055908, "K_R50": 71.51227713815234, "U_AP50": 2.7862398624420166, "U_P50": 0.409358215516747, "U_R50": 16.530874785591767}, "test_coco_eval_bbox": [14.451444625854492, 14.451444625854492, 77.8148193359375, 57.15019607543945, 66.93928527832031, 49.282108306884766, 27.985671997070312, 70.54130554199219, 55.28901290893555, 82.7206039428711, 26.307403564453125, 65.15182495117188, 21.9127197265625, 77.91541290283203, 73.61457061767578, 67.8846206665039, 49.1287841796875, 36.78118896484375, 69.1879653930664, 53.060150146484375, 79.1402359008789, 59.972835540771484, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.7862398624420166], "epoch": 40, "n_parameters": 39742295}

the authors' results is : U-R:19.4,K-AP:59.5 Why is it that the author's performance cannot be achieved? @Hatins @orrzohar

Originally posted by @Rzx520 in #26 (comment)

What I tried at the beginning was batch size=3, and the results are shown above.Setting batch size to 2 is due to the parameter settings of OW-DETR.@orrzohar

Rzx520 · 2023-11-27T01:24:38Z

I have some gains now, which is that when I set lr to 1e-4, lr_ When lr_drop=35 and batch size=3, there are some gains, but K_ AP only reached 58.3, not 59.4. Can you provide some suggestions?

################ Deformable DETR ################
parser.add_argument('--lr', default=1e-4, type=float)
parser.add_argument('--lr_backbone_names', default=["backbone.0"], type=str, nargs='+')
parser.add_argument('--lr_backbone', default=2e-5, type=float)
parser.add_argument('--lr_linear_proj_names', default=['reference_points', 'sampling_offsets'], type=str, nargs='+')
parser.add_argument('--lr_linear_proj_mult', default=0.1, type=float)
#parser.add_argument('--batch_size', default=5, type=int)
parser.add_argument('--batch_size', default=3, type=int)
#parser.add_argument('--batch_size', default=2, type=int)
parser.add_argument('--weight_decay', default=1e-4, type=float)
parser.add_argument('--epochs', default=51, type=int)
#parser.add_argument('--lr_drop', default=30, type=int)
parser.add_argument('--lr_drop', default=35, type=int)
#parser.add_argument('--lr_drop', default=40, type=int)

parser.add_argument('--lr_drop_epochs', default=None, type=int, nargs='+')
parser.add_argument('--clip_max_norm', default=0.1, type=float,
                    help='gradient clipping max norm')
parser.add_argument('--sgd', action='store_true')

orrzohar · 2023-11-27T02:24:55Z

Hi @Rzx520,
Are you still using 4 x Titan RTX ?
Generally, to get higher K_AP50, you need to train for longer, but the longer you train U_R goes down. The trick is to hit the balance between the two. ?
Looking at your chart, I think you can reduce lr_drop to 30 as the last 5 epochs are saturated before the lr_drop. This will give you 5 additional epochs with the lower learning rate and will hopefully improve the results.

To clarify, to run this experiment you DO NOT need to restart from scratch -- your model should have saved the checkpoint for epoch 30 and then you only need to train for the last 10 epochs after the lr_drop. Just make sure the lr is indeed lowered.

Best,
Orr

Rzx520 · 2023-11-27T02:33:14Z

I am trying lr_drop=30, I will present the results here.@orrzohar

orrzohar · 2023-11-27T02:54:40Z

Hi @Rzx520,
OK great, thank you.
Would you mind confirming what system you are using for future reproducibility on simulary systems?
Best,
Orr

Rzx520 · 2023-11-28T01:28:22Z

lr_drop=30 and parser.add_argument('--eval_every', default=1, type=int) @orrzohar This effect is not as good as lr_drop = 35.

Linux ubuntu 5.15.0-86-generic #96~20.04.1-Ubuntu

orrzohar · 2023-12-03T04:23:59Z

Hi @Rzx520,
OK I am trying to compile everything we have seen thus far:

--lr 2e-4, --lr_drop 40, --epochs 51 --batch_size 2 -> AP50=58.4, U_R=16.5
--lr 1e-4, --lr_drop 35, --epochs 51 --batch_size 3 -> AP50=58.4, U_R=19.4
--lr 1e-4, --lr_drop 30, --epochs 51 --batch_size 3 -> less good then above

Is that correct? Also, have you tried (as lr_drop 35->30 had the adverse affect):
--lr 1e-4, --lr_drop 40, --epochs 51 --batch_size 3

Best,
Orr

Rzx520 · 2023-12-05T06:37:39Z

yes,I do.
--lr 1e-4, --lr_drop 30, --epochs 41 --batch_size 3

AP50=58.1 U_R=19.5

@orrzohar There's one issue, the above results epochs are not default values, but 41.

orrzohar · 2023-12-17T00:32:35Z

Hi @Rzx520,

Are the results above for:
--lr 1e-4, --lr_drop 40, --epochs 51 --batch_size 3?

And of course the hyper parameters are changed -- you changed the batch size as it did not fit on your GPUs. This will change other hyper parameters.
Best,
Orr

orrzohar · 2023-12-27T22:04:22Z

Hi @Rzx520,
I am closing this for now. If you can confirm what configuration you used to get the best results, I will add this to the README.
Best,
Orr

orrzohar self-assigned this Nov 20, 2023

Rzx520 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 26, 2023

orrzohar reopened this Nov 26, 2023

orrzohar closed this as completed Dec 27, 2023

orrzohar mentioned this issue Jun 27, 2024

CUDA out of memory #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Four 3090 cannot get the the authors' results,that why? #48

Four 3090 cannot get the the authors' results,that why? #48

Rzx520 commented Nov 20, 2023

orrzohar commented Nov 20, 2023

Rzx520 commented Nov 21, 2023

Rzx520 commented Nov 21, 2023

orrzohar commented Nov 22, 2023

Rzx520 commented Nov 22, 2023

Rzx520 commented Nov 23, 2023 •

edited

Loading

orrzohar commented Nov 24, 2023

Rzx520 commented Nov 25, 2023

orrzohar commented Nov 26, 2023

Rzx520 commented Nov 26, 2023

orrzohar commented Nov 26, 2023

Rzx520 commented Nov 27, 2023

Rzx520 commented Nov 27, 2023

orrzohar commented Nov 27, 2023

Rzx520 commented Nov 27, 2023

orrzohar commented Nov 27, 2023

Rzx520 commented Nov 28, 2023

orrzohar commented Dec 3, 2023

Rzx520 commented Dec 5, 2023

orrzohar commented Dec 17, 2023 •

edited

Loading

orrzohar commented Dec 27, 2023

Four 3090 cannot get the the authors' results,that why? #48

Four 3090 cannot get the the authors' results,that why? #48

Comments

Rzx520 commented Nov 20, 2023

orrzohar commented Nov 20, 2023

Rzx520 commented Nov 21, 2023

Rzx520 commented Nov 21, 2023

orrzohar commented Nov 22, 2023

Rzx520 commented Nov 22, 2023

Rzx520 commented Nov 23, 2023 • edited Loading

orrzohar commented Nov 24, 2023

Rzx520 commented Nov 25, 2023

orrzohar commented Nov 26, 2023

Rzx520 commented Nov 26, 2023

orrzohar commented Nov 26, 2023

Rzx520 commented Nov 27, 2023

Rzx520 commented Nov 27, 2023

orrzohar commented Nov 27, 2023

Rzx520 commented Nov 27, 2023

orrzohar commented Nov 27, 2023

Rzx520 commented Nov 28, 2023

orrzohar commented Dec 3, 2023

Rzx520 commented Dec 5, 2023

AP50=58.1 U_R=19.5

orrzohar commented Dec 17, 2023 • edited Loading

orrzohar commented Dec 27, 2023

Rzx520 commented Nov 23, 2023 •

edited

Loading

orrzohar commented Dec 17, 2023 •

edited

Loading