-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The adversarial training script is showing strange trend #8
Comments
Any help in this regard is highly appreciated. Is something happening after 30th epoch? |
Hi, thanks for your interest in our work. Yes I think this log doesn't look right to me. Can you try larger batchsize (4096 for example)? You can try the accumulate gradient to mimic the large batch size that we provided in the code. |
Hi, we located the problem: Line 510 in 99bd87d
to: where update_freq is your accumulated time. The update_freq should be set as 8 in your case to maintain the 4096 total batch size. Sorry we previously temperally changed our code for a fixed setting under a certain machine, but it should be fed with a argument. Will fix this. Please let me know if you have encountered further problem, will be happy to solve! |
So, then we are good with batch size of 64 if this line is changed, right? |
Oh, I just found out that you shrinked the total batch size (--nproc_per_node=4) in your orginal script, right? We set --nproc_per_node=8, in https://github.com/ytongbai/ViTs-vs-CNNs/blob/99bd87d1ea3a59724887b1b84fe6cda43267ed70/script/advdeit.sh That means you used half of the batch size than this: Can you try to maintain the original total batch size first? |
I am now using exactly your settings: python -m torch.distributed.launch --nproc_per_node=8 --master_port=12349 --use_env main_adv_deit.py --model deit_tiny_patch16_224_adv --batch-size=128 --data-path /datasets/imagenet-ilsvrc2012 --attack-iter 1 --attack-epsilon 4 --attack-step-size 4 --epoch 100 --reprob 0 --no-repeated-aug --sing singln --drop 0 --drop-path 0 --start_epoch 0 --warmup-epochs 10 --cutmix 0 --output_dir save/deit_adv/deit_tiny_patch16_224, , but was just curious what is the issue with this update freq? As I dont see you using this to divide the total dataset? how are you maintaining a virtual batch size of 4096 here? I understand 8x128x4 = 4096, however, did not see its use anywhere. |
Yeah try this first and check if the curve looks healthy. Please ignore that for now. I thought you already tried 1024 batch size and it collapse so I am thinking this current line is not flexible enough for you to perform accumulate gradient for a even larger batch size. But try this first and i'll be happy to help if you still got other questions :) |
Thanks for your quick response. Hope this works.! |
Hi The adversarial training script is showing strage trend, after certain epochs top-1 accuracy has fallen to 1.6% from around 21%. Is it normal?
I used the script for adv training as:
python -m torch.distributed.launch --nproc_per_node=4 --master_port=5672 --use_env main_adv_deit.py --model deit_small_patch16_224_adv --batch-size 128 --data-path /datasets/imagenet-ilsvrc2012 --attack-iter 1 --attack-epsilon 4 --attack-step-size 4 --epoch 100 --reprob 0 --no-repeated-aug --sing singln --drop 0 --drop-path 0 --start_epoch 0 --warmup-epochs 10 --cutmix 0 --output_dir save/deit_adv/deit_small_patch16_224
Here is the training log (till 40 epochs):
{"train_lr": 1.0000000000000031e-06, "train_loss": 6.885785259502969, "test_0_loss": 6.7725973782139715, "test_0_acc1": 0.806, "test_0_acc5": 2.804, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 0, "n_parameters": 22050664}
{"train_lr": 1.0000000000000031e-06, "train_loss": 6.885785259502969, "test_0_loss": 6.7725973782139715, "test_0_acc1": 0.806, "test_0_acc5": 2.804, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 0, "n_parameters": 22050664}
{"train_lr": 1.0000000000000031e-06, "train_loss": 6.846427675869634, "test_0_loss": 6.689390176393554, "test_0_acc1": 1.192, "test_0_acc5": 4.378, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 1, "n_parameters": 22050664}
{"train_lr": 0.00020090000000000288, "train_loss": 6.701197089479981, "test_0_loss": 5.865309043488896, "test_0_acc1": 5.43, "test_0_acc5": 14.672, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 2, "n_parameters": 22050664}
{"train_lr": 0.00040079999999998546, "train_loss": 6.543532955179588, "test_0_loss": 5.340847122768371, "test_0_acc1": 9.812, "test_0_acc5": 23.782, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 3, "n_parameters": 22050664}
{"train_lr": 0.0006006999999999715, "train_loss": 6.4769038225916455, "test_0_loss": 5.03732673006796, "test_0_acc1": 13.248, "test_0_acc5": 29.602, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 4, "n_parameters": 22050664}
{"train_lr": 0.0008006000000000287, "train_loss": 6.315360357340196, "test_0_loss": 5.300121459301969, "test_0_acc1": 10.944, "test_0_acc5": 25.546, "test_5_loss": 6.55244832365313, "test_5_acc1": 2.756, "test_5_acc5": 7.6525, "epoch": 5, "n_parameters": 22050664}
{"train_lr": 0.0010004999999999689, "train_loss": 6.190600837687318, "test_0_loss": 4.9149362563476755, "test_0_acc1": 14.35, "test_0_acc5": 31.418, "test_5_loss": 6.55244832365313, "test_5_acc1": 2.756, "test_5_acc5": 7.6525, "epoch": 6, "n_parameters": 22050664}
{"train_lr": 0.0012004000000000647, "train_loss": 6.088374964529566, "test_0_loss": 5.50498779843575, "test_0_acc1": 10.254, "test_0_acc5": 24.242, "test_5_loss": 6.55244832365313, "test_5_acc1": 2.756, "test_5_acc5": 7.6525, "epoch": 7, "n_parameters": 22050664}
{"train_lr": 0.0014002999999999238, "train_loss": 6.08913704293142, "test_0_loss": 4.774700349977363, "test_0_acc1": 14.72, "test_0_acc5": 32.384, "test_5_loss": 6.55244832365313, "test_5_acc1": 2.756, "test_5_acc5": 7.6525, "epoch": 8, "n_parameters": 22050664}
{"train_lr": 0.0016001999999999618, "train_loss": 6.150533516344121, "test_0_loss": 5.227625224198276, "test_0_acc1": 10.67, "test_0_acc5": 25.058, "test_5_loss": 6.55244832365313, "test_5_acc1": 2.756, "test_5_acc5": 7.6525, "epoch": 9, "n_parameters": 22050664}
{"train_lr": 0.0018001000000000126, "train_loss": 6.101692359891536, "test_0_loss": 5.141786843786161, "test_0_acc1": 11.414, "test_0_acc5": 26.346, "test_5_loss": 6.756372647642403, "test_5_acc1": 2.309, "test_5_acc5": 6.5675, "epoch": 10, "n_parameters": 22050664}
{"train_lr": 0.001951301233713633, "train_loss": 6.093319233182332, "test_0_loss": 4.774902591320924, "test_0_acc1": 14.368, "test_0_acc5": 31.786, "test_5_loss": 6.756372647642403, "test_5_acc1": 2.309, "test_5_acc5": 6.5675, "epoch": 11, "n_parameters": 22050664}
{"train_lr": 0.001941176365109525, "train_loss": 6.128870297345421, "test_0_loss": 5.251185640492503, "test_0_acc1": 11.492, "test_0_acc5": 26.726, "test_5_loss": 6.756372647642403, "test_5_acc1": 2.309, "test_5_acc5": 6.5675, "epoch": 12, "n_parameters": 22050664}
{"train_lr": 0.0019301276034588222, "train_loss": 6.053121808859752, "test_0_loss": 4.758080562108309, "test_0_acc1": 16.252, "test_0_acc5": 34.882, "test_5_loss": 6.756372647642403, "test_5_acc1": 2.309, "test_5_acc5": 6.5675, "epoch": 13, "n_parameters": 22050664}
{"train_lr": 0.0019181658525555538, "train_loss": 6.0439764577136055, "test_0_loss": 4.586510399862962, "test_0_acc1": 16.69, "test_0_acc5": 35.526, "test_5_loss": 6.756372647642403, "test_5_acc1": 2.309, "test_5_acc5": 6.5675, "epoch": 14, "n_parameters": 22050664}
{"train_lr": 0.0019053029172036828, "train_loss": 5.91496213320062, "test_0_loss": 4.488940908904268, "test_0_acc1": 17.398, "test_0_acc5": 36.698, "test_5_loss": 7.4814025707452325, "test_5_acc1": 1.2555, "test_5_acc5": 4.2435, "epoch": 15, "n_parameters": 22050664}
{"train_lr": 0.0018915514915675221, "train_loss": 6.002524321551898, "test_0_loss": 4.450921233922186, "test_0_acc1": 17.934, "test_0_acc5": 37.114, "test_5_loss": 7.4814025707452325, "test_5_acc1": 1.2555, "test_5_acc5": 4.2435, "epoch": 16, "n_parameters": 22050664}
{"train_lr": 0.0018769251466436458, "train_loss": 5.878266204508851, "test_0_loss": 4.308091710831062, "test_0_acc1": 20.404, "test_0_acc5": 41.2, "test_5_loss": 7.4814025707452325, "test_5_acc1": 1.2555, "test_5_acc5": 4.2435, "epoch": 17, "n_parameters": 22050664}
{"train_lr": 0.0018614383168689135, "train_loss": 5.789360093222343, "test_0_loss": 4.410817133793065, "test_0_acc1": 18.154, "test_0_acc5": 38.082, "test_5_loss": 7.4814025707452325, "test_5_acc1": 1.2555, "test_5_acc5": 4.2435, "epoch": 18, "n_parameters": 22050664}
{"train_lr": 0.0018451062858745686, "train_loss": 5.750880390286541, "test_0_loss": 4.467262921391278, "test_0_acc1": 19.266, "test_0_acc5": 39.462, "test_5_loss": 7.4814025707452325, "test_5_acc1": 1.2555, "test_5_acc5": 4.2435, "epoch": 19, "n_parameters": 22050664}
{"train_lr": 0.0018279451714032378, "train_loss": 5.764791792602562, "test_0_loss": 4.67896575738586, "test_0_acc1": 17.392, "test_0_acc5": 37.15, "test_5_loss": 7.557907587735987, "test_5_acc1": 0.9905, "test_5_acc5": 3.1575, "epoch": 20, "n_parameters": 22050664}
{"train_lr": 0.0018099719094030393, "train_loss": 5.759131700348416, "test_0_loss": 4.419680974762636, "test_0_acc1": 19.966, "test_0_acc5": 40.798, "test_5_loss": 7.557907587735987, "test_5_acc1": 0.9905, "test_5_acc5": 3.1575, "epoch": 21, "n_parameters": 22050664}
{"train_lr": 0.0017912042373137494, "train_loss": 5.710006896111605, "test_0_loss": 4.2751427415236405, "test_0_acc1": 20.356, "test_0_acc5": 41.114, "test_5_loss": 7.557907587735987, "test_5_acc1": 0.9905, "test_5_acc5": 3.1575, "epoch": 22, "n_parameters": 22050664}
{"train_lr": 0.0017716606765619972, "train_loss": 5.68051082098322, "test_0_loss": 4.154385426833091, "test_0_acc1": 21.638, "test_0_acc5": 43.102, "test_5_loss": 7.557907587735987, "test_5_acc1": 0.9905, "test_5_acc5": 3.1575, "epoch": 23, "n_parameters": 22050664}
{"train_lr": 0.0017513605142823508, "train_loss": 5.693617649811158, "test_0_loss": 4.25816687512535, "test_0_acc1": 20.994, "test_0_acc5": 41.96, "test_5_loss": 7.557907587735987, "test_5_acc1": 0.9905, "test_5_acc5": 3.1575, "epoch": 24, "n_parameters": 22050664}
{"train_lr": 0.0017303237842843694, "train_loss": 5.6821105527839695, "test_0_loss": 4.421267043301026, "test_0_acc1": 19.116, "test_0_acc5": 39.094, "test_5_loss": 8.897946262237587, "test_5_acc1": 0.514, "test_5_acc5": 1.892, "epoch": 25, "n_parameters": 22050664}
{"train_lr": 0.001708571247280513, "train_loss": 5.69677297047955, "test_0_loss": 4.398700253595852, "test_0_acc1": 19.178, "test_0_acc5": 39.536, "test_5_loss": 8.897946262237587, "test_5_acc1": 0.514, "test_5_acc5": 1.892, "epoch": 26, "n_parameters": 22050664}
{"train_lr": 0.0016861243703990647, "train_loss": 5.740358965097666, "test_0_loss": 4.446112109237348, "test_0_acc1": 19.972, "test_0_acc5": 40.84, "test_5_loss": 8.897946262237587, "test_5_acc1": 0.514, "test_5_acc5": 1.892, "epoch": 27, "n_parameters": 22050664}
{"train_lr": 0.0016630053059970855, "train_loss": 5.712303198760838, "test_0_loss": 4.1932648324234245, "test_0_acc1": 21.566, "test_0_acc5": 42.98, "test_5_loss": 8.897946262237587, "test_5_acc1": 0.514, "test_5_acc5": 1.892, "epoch": 28, "n_parameters": 22050664}
{"train_lr": 0.0016392368698000565, "train_loss": 5.74558376472631, "test_0_loss": 4.124165606513972, "test_0_acc1": 21.932, "test_0_acc5": 43.39, "test_5_loss": 8.897946262237587, "test_5_acc1": 0.514, "test_5_acc5": 1.892, "epoch": 29, "n_parameters": 22050664}
{"train_lr": 0.0016148425183847566, "train_loss": 5.3731044158518175, "test_0_loss": 4.680003530995935, "test_0_acc1": 15.374, "test_0_acc5": 33.588, "test_5_loss": 11.426024395688863, "test_5_acc1": 0.1075, "test_5_acc5": 0.45, "epoch": 30, "n_parameters": 22050664}
{"train_lr": 0.0015898463260310706, "train_loss": 4.259690835869474, "test_0_loss": 5.981102620495181, "test_0_acc1": 5.786, "test_0_acc5": 14.55, "test_5_loss": 11.426024395688863, "test_5_acc1": 0.1075, "test_5_acc5": 0.45, "epoch": 31, "n_parameters": 22050664}
{"train_lr": 0.0015642729609628443, "train_loss": 4.075305948785836, "test_0_loss": 5.933592574686403, "test_0_acc1": 4.598, "test_0_acc5": 13.066, "test_5_loss": 11.426024395688863, "test_5_acc1": 0.1075, "test_5_acc5": 0.45, "epoch": 32, "n_parameters": 22050664}
{"train_lr": 0.001538147661004018, "train_loss": 4.167220209940351, "test_0_loss": 6.295307500501207, "test_0_acc1": 3.228, "test_0_acc5": 9.566, "test_5_loss": 11.426024395688863, "test_5_acc1": 0.1075, "test_5_acc5": 0.45, "epoch": 33, "n_parameters": 22050664}
{"train_lr": 0.001511496208671658, "train_loss": 4.134730825523774, "test_0_loss": 5.972504679850104, "test_0_acc1": 3.806, "test_0_acc5": 11.758, "test_5_loss": 11.426024395688863, "test_5_acc1": 0.1075, "test_5_acc5": 0.45, "epoch": 34, "n_parameters": 22050664}
{"train_lr": 0.0014843449057311518, "train_loss": 4.365966309007885, "test_0_loss": 6.5958606600380065, "test_0_acc1": 2.156, "test_0_acc5": 7.3, "test_5_loss": 12.839466273136347, "test_5_acc1": 0.002, "test_5_acc5": 0.0035, "epoch": 35, "n_parameters": 22050664}
{"train_lr": 0.00145672054724078, "train_loss": 4.49492947772729, "test_0_loss": 6.905164708865429, "test_0_acc1": 1.588, "test_0_acc5": 5.264, "test_5_loss": 12.839466273136347, "test_5_acc1": 0.002, "test_5_acc5": 0.0035, "epoch": 36, "n_parameters": 22050664}
{"train_lr": 0.0014286503951072877, "train_loss": 4.562651729769558, "test_0_loss": 6.958603466617245, "test_0_acc1": 1.594, "test_0_acc5": 5.226, "test_5_loss": 12.839466273136347, "test_5_acc1": 0.002, "test_5_acc5": 0.0035, "epoch": 37, "n_parameters": 22050664}
{"train_lr": 0.0014001621511816529, "train_loss": 4.620032903101804, "test_0_loss": 6.883705623624269, "test_0_acc1": 1.946, "test_0_acc5": 5.582, "test_5_loss": 12.839466273136347, "test_5_acc1": 0.002, "test_5_acc5": 0.0035, "epoch": 38, "n_parameters": 22050664}
{"train_lr": 0.0013712839299212382, "train_loss": 4.635755813831715, "test_0_loss": 7.244386745887312, "test_0_acc1": 0.964, "test_0_acc5": 3.976, "test_5_loss": 12.839466273136347, "test_5_acc1": 0.002, "test_5_acc5": 0.0035, "epoch": 39, "n_parameters": 22050664}
{"train_lr": 0.0013420442306441068, "train_loss": 4.83734727265547, "test_0_loss": 7.3013705145603405, "test_0_acc1": 1.686, "test_0_acc5": 4.768, "test_5_loss": 14.802937962195847, "test_5_acc1": 0.0, "test_5_acc5": 0.0, "epoch": 40, "n_parameters": 22050664}
The text was updated successfully, but these errors were encountered: