-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why can't the two A100 trained models achieve your performance #47
Comments
Hi @Rzx520, |
Just reproduce your model, there are no special context. @orrzohar |
You mean you used the weights I released? |
No, I trained it with Two A100, and the dataset is from the introduction you provided |
In general, it is impossible to answer this question with no context, as I do not have access to 80Gb A100s and cannot run the experiment myself and see the variation. Even if you took my pre-trained weights, and only evaluated on a different system, you will see some variation of +-1, as validated by others in previous issues. To answer as best I can: If you re-trained from scratch on a different system, even more variation is to be expected, and some hyper-parameter tuning would most likely be needed. You may need to change the number of epochs you trained on, what epoch you drop the lr on, the lr itself, etc. The hyper parameters I used on my system will definitely not be optimal on all systems. Judging from your numbers, you may be over-training, however, I can only tell that from the training curves themselves. As you are using the 80GB GPUs, you probably can increase the batch size, which will make training more efficient and stable. But, when changing the batch size you will need to change the lr as well. NNs as a whole are effected by the training schedule, and if you use a different system, that schedule is affected which affects the final state of the model -- unless the model is very simple/insensitive, or robust to hyperparameter/randomness in training. If you would like, please share the relevant information and I would be happy to help you optimize the hyperparameters to get similar results. |
When I used Two A100, the parameter settings remained unchanged. I used the code and parameters you provided, including batch size,lr and other parameters. The only change was to change your 4 GPUs to 2 GPUs.@orrzohar |
|
You also used a different server, with different GPUs (I used A100 40Gb GPUs), and different number of GPUs, all of which will affect training. There are probably countless other differences which I do not know about (OS, dependences, etc.). Even as time has past, some default python dependences may have changed. When using a different system, there will be some variation and you will need to tune the hyperparameters. This is, in no small part, why we publish the weights of the models. When re-training from scratch, I cannot guarantee complete replication of performance. With some hyper parameter tuning, you should be able to get within +-1 of the reported values. (see #26, where results were reproduced on 3090s). Looking at your training curves, I would probably reduce the lr_drop to ~125k iterations, and proportionally reduce the overall number of epochs. You can overtrain the second stage of the training after the lr_drop, and select the optimal model in terms of U_R/K_mAP. This should be relatively easily as the model saves checkpoints every few epochs, so you could restart from that rather than all the way from scratch. |
Hi @Rzx520, Did the performance improve when reducing lr_drop? Best, |
Hi @Rzx520, Did the performance improve when reducing lr_drop? Best, |
yes,there will indeed be some improvement.
|
Hi @Rzx520, |
A100(80G),Nothing else has changed.
{"K_AP50": 59.38904571533203, "K_P50": 21.074942637087915, "K_R50": 72.52758104006436, "U_AP50": 0.6464414000511169, "U_P50": 0.4288344914478119, "U_R50": 16.88679245283019, "epoch": 40}, "test_coco_eval_bbox": [14.671942710876465, 14.671942710876465, 78.46551513671875, 58.18337631225586, 64.30726623535156, 50.592430114746094, 29.676156997680664, 71.94124603271484, 56.22311782836914, 82.22350311279297, 27.28054428100586, 71.0342788696289, 22.341707229614258, 82.27958679199219, 71.79204559326172, 68.34331512451172, 49.77190017700195, 35.397483825683594, 71.02239227294922, 50.98625564575195, 83.90058135986328, 62.01821517944336, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6464414000511169], "epoch": 40, "n_parameters": 39742295}@orrzohar Thanks
The text was updated successfully, but these errors were encountered: