Why can't the two A100 trained models achieve your performance #47

Rzx520 · 2023-10-11T01:44:38Z

A100(80G),Nothing else has changed.
{"K_AP50": 59.38904571533203, "K_P50": 21.074942637087915, "K_R50": 72.52758104006436, "U_AP50": 0.6464414000511169, "U_P50": 0.4288344914478119, "U_R50": 16.88679245283019, "epoch": 40}, "test_coco_eval_bbox": [14.671942710876465, 14.671942710876465, 78.46551513671875, 58.18337631225586, 64.30726623535156, 50.592430114746094, 29.676156997680664, 71.94124603271484, 56.22311782836914, 82.22350311279297, 27.28054428100586, 71.0342788696289, 22.341707229614258, 82.27958679199219, 71.79204559326172, 68.34331512451172, 49.77190017700195, 35.397483825683594, 71.02239227294922, 50.98625564575195, 83.90058135986328, 62.01821517944336, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6464414000511169], "epoch": 40, "n_parameters": 39742295}@orrzohar Thanks

orrzohar · 2023-10-11T01:47:19Z

Hi @Rzx520,
Could you provide some context to your question? what did you do? what task are you talking about?

Rzx520 · 2023-10-11T01:50:36Z

Just reproduce your model, there are no special context. @orrzohar

orrzohar · 2023-10-11T01:51:32Z

You mean you used the weights I released?

Rzx520 · 2023-10-11T01:52:53Z

No, I trained it with Two A100, and the dataset is from the introduction you provided

orrzohar · 2023-10-11T02:04:12Z

In general, it is impossible to answer this question with no context, as I do not have access to 80Gb A100s and cannot run the experiment myself and see the variation. Even if you took my pre-trained weights, and only evaluated on a different system, you will see some variation of +-1, as validated by others in previous issues.

To answer as best I can:

If you re-trained from scratch on a different system, even more variation is to be expected, and some hyper-parameter tuning would most likely be needed. You may need to change the number of epochs you trained on, what epoch you drop the lr on, the lr itself, etc. The hyper parameters I used on my system will definitely not be optimal on all systems. Judging from your numbers, you may be over-training, however, I can only tell that from the training curves themselves.

As you are using the 80GB GPUs, you probably can increase the batch size, which will make training more efficient and stable. But, when changing the batch size you will need to change the lr as well.

NNs as a whole are effected by the training schedule, and if you use a different system, that schedule is affected which affects the final state of the model -- unless the model is very simple/insensitive, or robust to hyperparameter/randomness in training.

If you would like, please share the relevant information and I would be happy to help you optimize the hyperparameters to get similar results.

Rzx520 · 2023-10-11T02:11:16Z

When I used Two A100, the parameter settings remained unchanged. I used the code and parameters you provided, including batch size,lr and other parameters. The only change was to change your 4 GPUs to 2 GPUs.@orrzohar

Rzx520 · 2023-10-11T02:19:45Z

This is the graph of UR50 and KAP50 after 41 epochs of training.@orrzohar Thanks

orrzohar · 2023-10-11T02:27:11Z

You also used a different server, with different GPUs (I used A100 40Gb GPUs), and different number of GPUs, all of which will affect training. There are probably countless other differences which I do not know about (OS, dependences, etc.). Even as time has past, some default python dependences may have changed.

When using a different system, there will be some variation and you will need to tune the hyperparameters. This is, in no small part, why we publish the weights of the models.

When re-training from scratch, I cannot guarantee complete replication of performance. With some hyper parameter tuning, you should be able to get within +-1 of the reported values. (see #26, where results were reproduced on 3090s).

Looking at your training curves, I would probably reduce the lr_drop to ~125k iterations, and proportionally reduce the overall number of epochs. You can overtrain the second stage of the training after the lr_drop, and select the optimal model in terms of U_R/K_mAP. This should be relatively easily as the model saves checkpoints every few epochs, so you could restart from that rather than all the way from scratch.

orrzohar · 2023-10-11T17:52:50Z

Hi @Rzx520,

Did the performance improve when reducing lr_drop?

Best,
Orr

orrzohar · 2023-10-19T02:19:32Z

Hi @Rzx520,

Did the performance improve when reducing lr_drop?
I'd like to know if I can add this to the readme.

Best,
Orr

Rzx520 · 2023-10-19T03:11:29Z

yes，there will indeed be some improvement.

orrzohar · 2023-10-19T03:13:33Z

Hi @Rzx520,
OK, great.
I am going to close this issue, and update the required hyper-parameters for 2xA100(80Gb).
Orr

orrzohar self-assigned this Oct 11, 2023

orrzohar closed this as completed Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why can't the two A100 trained models achieve your performance #47

Why can't the two A100 trained models achieve your performance #47

Rzx520 commented Oct 11, 2023 •

edited

Loading

orrzohar commented Oct 11, 2023

Rzx520 commented Oct 11, 2023

orrzohar commented Oct 11, 2023

Rzx520 commented Oct 11, 2023

orrzohar commented Oct 11, 2023 •

edited

Loading

Rzx520 commented Oct 11, 2023

Rzx520 commented Oct 11, 2023

orrzohar commented Oct 11, 2023 •

edited

Loading

orrzohar commented Oct 11, 2023

orrzohar commented Oct 19, 2023

Rzx520 commented Oct 19, 2023 via email •

edited

Loading

orrzohar commented Oct 19, 2023

Why can't the two A100 trained models achieve your performance #47

Why can't the two A100 trained models achieve your performance #47

Comments

Rzx520 commented Oct 11, 2023 • edited Loading

orrzohar commented Oct 11, 2023

Rzx520 commented Oct 11, 2023

orrzohar commented Oct 11, 2023

Rzx520 commented Oct 11, 2023

orrzohar commented Oct 11, 2023 • edited Loading

Rzx520 commented Oct 11, 2023

Rzx520 commented Oct 11, 2023

orrzohar commented Oct 11, 2023 • edited Loading

orrzohar commented Oct 11, 2023

orrzohar commented Oct 19, 2023

Rzx520 commented Oct 19, 2023 via email • edited Loading

orrzohar commented Oct 19, 2023

Rzx520 commented Oct 11, 2023 •

edited

Loading

orrzohar commented Oct 11, 2023 •

edited

Loading

orrzohar commented Oct 11, 2023 •

edited

Loading

Rzx520 commented Oct 19, 2023 via email •

edited

Loading