-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Four 3090 cannot get the the authors' results,that why? #48
Comments
Hi @Rzx520, Luckily, PROB is relatively robust and requires minimal hyperparameter tuning to match our performance, at least on all the systems I have encountered. Specifically with Titan RTX 3090, our results were already reproduced (see Issue #26). On a system of 3090x4, lr_drop needed to be increased to 40 to match our reported results. If you have a different number of GPUs, there may be a better one for your system. I am happy to help with this process, but to do so, I need to see your training curves. Best, |
The above result is the result of adjusting lr_drop to 40, so I am quite confused. |
Did you use the same number of GPUs is in #26 ? |
Yes, I also used 4GPUs. Thank you very much. Since I turned off Wandb, I had to retrain to obtain the training curves.This may take a while as the server is being used. |
Above is the result of this parameter setting training. @orrzohar
|
I am trying lr_drop=30, I will send it back when the training results are available.And I also wonder how the two systems differ,so I asked some questions in #26 (comment) |
Above is the result of this parameter setting training, lr_drop = 30. @orrzohar ################ Deformable DETR ################ parser.add_argument('--lr_drop_epochs', default=None, type=int, nargs='+') |
Hi @Rzx520, I also noticed that Hatins reported simulary poorer results when using batch_size=2. |
What I tried at the beginning was batch size=3, and the results are shown above.Setting batch size to 2 is due to the parameter settings of OW-DETR.@orrzohar |
I have some gains now, which is that when I set lr to 1e-4, lr_ When lr_drop=35 and batch size=3, there are some gains, but K_ AP only reached 58.3, not 59.4. Can you provide some suggestions?
|
Hi @Rzx520, To clarify, to run this experiment you DO NOT need to restart from scratch -- your model should have saved the checkpoint for epoch 30 and then you only need to train for the last 10 epochs after the lr_drop. Just make sure the lr is indeed lowered. Best, |
I am trying lr_drop=30, I will present the results here.@orrzohar |
Hi @Rzx520, |
lr_drop=30 and parser.add_argument('--eval_every', default=1, type=int) @orrzohar This effect is not as good as lr_drop = 35. Linux ubuntu 5.15.0-86-generic #96~20.04.1-Ubuntu |
Hi @Rzx520, --lr 2e-4, --lr_drop 40, --epochs 51 --batch_size 2 -> AP50=58.4, U_R=16.5 Is that correct? Also, have you tried (as lr_drop 35->30 had the adverse affect): Best, |
yes,I do. AP50=58.1 U_R=19.5@orrzohar There's one issue, the above results epochs are not default values, but 41. |
Hi @Rzx520, Are the results above for: And of course the hyper parameters are changed -- you changed the batch size as it did not fit on your GPUs. This will change other hyper parameters. |
Hi @Rzx520, |
I used four cards with batch_size = 3,the result is :
{"train_lr": 1.999999999999943e-05, "train_class_error": 15.52755644357749, "train_grad_norm": 119.24543388206256, "train_loss": 5.189852057201781, "train_loss_bbox": 0.2700958194790585, "train_loss_bbox_0": 0.29624945830832017, "train_loss_bbox_1": 0.27978440371434526, "train_loss_bbox_2": 0.275065722955665, "train_loss_bbox_3": 0.27241891570675625, "train_loss_bbox_4": 0.27063051075218725, "train_loss_ce": 0.18834440561282928, "train_loss_ce_0": 0.27234036786085974, "train_loss_ce_1": 0.23321395799885028, "train_loss_ce_2": 0.20806531186409408, "train_loss_ce_3": 0.19453731594314128, "train_loss_ce_4": 0.18820172232765492, "train_loss_giou": 0.3351372324140976, "train_loss_giou_0": 0.3679243937037491, "train_loss_giou_1": 0.3483400315024699, "train_loss_giou_2": 0.34171414935044225, "train_loss_giou_3": 0.3379105142249501, "train_loss_giou_4": 0.3368650070453053, "train_loss_obj_ll": 0.02471167313379382, "train_loss_obj_ll_0": 0.034151954339996814, "train_loss_obj_ll_1": 0.03029250531194649, "train_loss_obj_ll_2": 0.0288731191750343, "train_loss_obj_ll_3": 0.028083207809715446, "train_loss_obj_ll_4": 0.026900355121292352, "train_cardinality_error_unscaled": 0.44506890101437985, "train_cardinality_error_0_unscaled": 0.6769398279525907, "train_cardinality_error_1_unscaled": 0.5726976196583499, "train_cardinality_error_2_unscaled": 0.4929900999093851, "train_cardinality_error_3_unscaled": 0.46150593285633223, "train_cardinality_error_4_unscaled": 0.45256225438417086, "train_class_error_unscaled": 15.52755644357749, "train_loss_bbox_unscaled": 0.054019163965779084, "train_loss_bbox_0_unscaled": 0.059249891647616536, "train_loss_bbox_1_unscaled": 0.055956880831476395, "train_loss_bbox_2_unscaled": 0.055013144572493046, "train_loss_bbox_3_unscaled": 0.054483783067331704, "train_loss_bbox_4_unscaled": 0.05412610215448962, "train_loss_ce_unscaled": 0.09417220280641464, "train_loss_ce_0_unscaled": 0.13617018393042987, "train_loss_ce_1_unscaled": 0.11660697899942514, "train_loss_ce_2_unscaled": 0.10403265593204704, "train_loss_ce_3_unscaled": 0.09726865797157064, "train_loss_ce_4_unscaled": 0.09410086116382746, "train_loss_giou_unscaled": 0.1675686162070488, "train_loss_giou_0_unscaled": 0.18396219685187454, "train_loss_giou_1_unscaled": 0.17417001575123495, "train_loss_giou_2_unscaled": 0.17085707467522113, "train_loss_giou_3_unscaled": 0.16895525711247505, "train_loss_giou_4_unscaled": 0.16843250352265265, "train_loss_obj_ll_unscaled": 30.889592197686543, "train_loss_obj_ll_0_unscaled": 42.68994404527915, "train_loss_obj_ll_1_unscaled": 37.86563257517548, "train_loss_obj_ll_2_unscaled": 36.09139981038161, "train_loss_obj_ll_3_unscaled": 35.10401065181873, "train_loss_obj_ll_4_unscaled": 33.62544476769816, "test_metrics": {"WI": 0.05356004827184098, "AOSA": 5220.0, "CK_AP50": 58.3890380859375, "CK_P50": 25.75118307055908, "CK_R50": 71.51227713815234, "K_AP50": 58.3890380859375, "K_P50": 25.75118307055908, "K_R50": 71.51227713815234, "U_AP50": 2.7862398624420166, "U_P50": 0.409358215516747, "U_R50": 16.530874785591767}, "test_coco_eval_bbox": [14.451444625854492, 14.451444625854492, 77.8148193359375, 57.15019607543945, 66.93928527832031, 49.282108306884766, 27.985671997070312, 70.54130554199219, 55.28901290893555, 82.7206039428711, 26.307403564453125, 65.15182495117188, 21.9127197265625, 77.91541290283203, 73.61457061767578, 67.8846206665039, 49.1287841796875, 36.78118896484375, 69.1879653930664, 53.060150146484375, 79.1402359008789, 59.972835540771484, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.7862398624420166], "epoch": 40, "n_parameters": 39742295}
the authors' results is :
U-R:19.4,K-AP:59.5
Why is it that the author's performance cannot be achieved?
@Hatins @orrzohar
Originally posted by @Rzx520 in #26 (comment)
The text was updated successfully, but these errors were encountered: