Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The adversarial training script is showing strange trend #8

Closed
ksouvik52 opened this issue Jul 22, 2022 · 8 comments
Closed

The adversarial training script is showing strange trend #8

ksouvik52 opened this issue Jul 22, 2022 · 8 comments

Comments

@ksouvik52
Copy link

Hi The adversarial training script is showing strage trend, after certain epochs top-1 accuracy has fallen to 1.6% from around 21%. Is it normal?

I used the script for adv training as:
python -m torch.distributed.launch --nproc_per_node=4 --master_port=5672 --use_env main_adv_deit.py --model deit_small_patch16_224_adv --batch-size 128 --data-path /datasets/imagenet-ilsvrc2012 --attack-iter 1 --attack-epsilon 4 --attack-step-size 4 --epoch 100 --reprob 0 --no-repeated-aug --sing singln --drop 0 --drop-path 0 --start_epoch 0 --warmup-epochs 10 --cutmix 0 --output_dir save/deit_adv/deit_small_patch16_224

Here is the training log (till 40 epochs):
{"train_lr": 1.0000000000000031e-06, "train_loss": 6.885785259502969, "test_0_loss": 6.7725973782139715, "test_0_acc1": 0.806, "test_0_acc5": 2.804, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 0, "n_parameters": 22050664}
{"train_lr": 1.0000000000000031e-06, "train_loss": 6.885785259502969, "test_0_loss": 6.7725973782139715, "test_0_acc1": 0.806, "test_0_acc5": 2.804, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 0, "n_parameters": 22050664}
{"train_lr": 1.0000000000000031e-06, "train_loss": 6.846427675869634, "test_0_loss": 6.689390176393554, "test_0_acc1": 1.192, "test_0_acc5": 4.378, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 1, "n_parameters": 22050664}
{"train_lr": 0.00020090000000000288, "train_loss": 6.701197089479981, "test_0_loss": 5.865309043488896, "test_0_acc1": 5.43, "test_0_acc5": 14.672, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 2, "n_parameters": 22050664}
{"train_lr": 0.00040079999999998546, "train_loss": 6.543532955179588, "test_0_loss": 5.340847122768371, "test_0_acc1": 9.812, "test_0_acc5": 23.782, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 3, "n_parameters": 22050664}
{"train_lr": 0.0006006999999999715, "train_loss": 6.4769038225916455, "test_0_loss": 5.03732673006796, "test_0_acc1": 13.248, "test_0_acc5": 29.602, "test_5_loss": 6.844994894602477, "test_5_acc1": 0.55, "test_5_acc5": 1.958, "epoch": 4, "n_parameters": 22050664}
{"train_lr": 0.0008006000000000287, "train_loss": 6.315360357340196, "test_0_loss": 5.300121459301969, "test_0_acc1": 10.944, "test_0_acc5": 25.546, "test_5_loss": 6.55244832365313, "test_5_acc1": 2.756, "test_5_acc5": 7.6525, "epoch": 5, "n_parameters": 22050664}
{"train_lr": 0.0010004999999999689, "train_loss": 6.190600837687318, "test_0_loss": 4.9149362563476755, "test_0_acc1": 14.35, "test_0_acc5": 31.418, "test_5_loss": 6.55244832365313, "test_5_acc1": 2.756, "test_5_acc5": 7.6525, "epoch": 6, "n_parameters": 22050664}
{"train_lr": 0.0012004000000000647, "train_loss": 6.088374964529566, "test_0_loss": 5.50498779843575, "test_0_acc1": 10.254, "test_0_acc5": 24.242, "test_5_loss": 6.55244832365313, "test_5_acc1": 2.756, "test_5_acc5": 7.6525, "epoch": 7, "n_parameters": 22050664}
{"train_lr": 0.0014002999999999238, "train_loss": 6.08913704293142, "test_0_loss": 4.774700349977363, "test_0_acc1": 14.72, "test_0_acc5": 32.384, "test_5_loss": 6.55244832365313, "test_5_acc1": 2.756, "test_5_acc5": 7.6525, "epoch": 8, "n_parameters": 22050664}
{"train_lr": 0.0016001999999999618, "train_loss": 6.150533516344121, "test_0_loss": 5.227625224198276, "test_0_acc1": 10.67, "test_0_acc5": 25.058, "test_5_loss": 6.55244832365313, "test_5_acc1": 2.756, "test_5_acc5": 7.6525, "epoch": 9, "n_parameters": 22050664}
{"train_lr": 0.0018001000000000126, "train_loss": 6.101692359891536, "test_0_loss": 5.141786843786161, "test_0_acc1": 11.414, "test_0_acc5": 26.346, "test_5_loss": 6.756372647642403, "test_5_acc1": 2.309, "test_5_acc5": 6.5675, "epoch": 10, "n_parameters": 22050664}
{"train_lr": 0.001951301233713633, "train_loss": 6.093319233182332, "test_0_loss": 4.774902591320924, "test_0_acc1": 14.368, "test_0_acc5": 31.786, "test_5_loss": 6.756372647642403, "test_5_acc1": 2.309, "test_5_acc5": 6.5675, "epoch": 11, "n_parameters": 22050664}
{"train_lr": 0.001941176365109525, "train_loss": 6.128870297345421, "test_0_loss": 5.251185640492503, "test_0_acc1": 11.492, "test_0_acc5": 26.726, "test_5_loss": 6.756372647642403, "test_5_acc1": 2.309, "test_5_acc5": 6.5675, "epoch": 12, "n_parameters": 22050664}
{"train_lr": 0.0019301276034588222, "train_loss": 6.053121808859752, "test_0_loss": 4.758080562108309, "test_0_acc1": 16.252, "test_0_acc5": 34.882, "test_5_loss": 6.756372647642403, "test_5_acc1": 2.309, "test_5_acc5": 6.5675, "epoch": 13, "n_parameters": 22050664}
{"train_lr": 0.0019181658525555538, "train_loss": 6.0439764577136055, "test_0_loss": 4.586510399862962, "test_0_acc1": 16.69, "test_0_acc5": 35.526, "test_5_loss": 6.756372647642403, "test_5_acc1": 2.309, "test_5_acc5": 6.5675, "epoch": 14, "n_parameters": 22050664}
{"train_lr": 0.0019053029172036828, "train_loss": 5.91496213320062, "test_0_loss": 4.488940908904268, "test_0_acc1": 17.398, "test_0_acc5": 36.698, "test_5_loss": 7.4814025707452325, "test_5_acc1": 1.2555, "test_5_acc5": 4.2435, "epoch": 15, "n_parameters": 22050664}
{"train_lr": 0.0018915514915675221, "train_loss": 6.002524321551898, "test_0_loss": 4.450921233922186, "test_0_acc1": 17.934, "test_0_acc5": 37.114, "test_5_loss": 7.4814025707452325, "test_5_acc1": 1.2555, "test_5_acc5": 4.2435, "epoch": 16, "n_parameters": 22050664}
{"train_lr": 0.0018769251466436458, "train_loss": 5.878266204508851, "test_0_loss": 4.308091710831062, "test_0_acc1": 20.404, "test_0_acc5": 41.2, "test_5_loss": 7.4814025707452325, "test_5_acc1": 1.2555, "test_5_acc5": 4.2435, "epoch": 17, "n_parameters": 22050664}
{"train_lr": 0.0018614383168689135, "train_loss": 5.789360093222343, "test_0_loss": 4.410817133793065, "test_0_acc1": 18.154, "test_0_acc5": 38.082, "test_5_loss": 7.4814025707452325, "test_5_acc1": 1.2555, "test_5_acc5": 4.2435, "epoch": 18, "n_parameters": 22050664}
{"train_lr": 0.0018451062858745686, "train_loss": 5.750880390286541, "test_0_loss": 4.467262921391278, "test_0_acc1": 19.266, "test_0_acc5": 39.462, "test_5_loss": 7.4814025707452325, "test_5_acc1": 1.2555, "test_5_acc5": 4.2435, "epoch": 19, "n_parameters": 22050664}
{"train_lr": 0.0018279451714032378, "train_loss": 5.764791792602562, "test_0_loss": 4.67896575738586, "test_0_acc1": 17.392, "test_0_acc5": 37.15, "test_5_loss": 7.557907587735987, "test_5_acc1": 0.9905, "test_5_acc5": 3.1575, "epoch": 20, "n_parameters": 22050664}
{"train_lr": 0.0018099719094030393, "train_loss": 5.759131700348416, "test_0_loss": 4.419680974762636, "test_0_acc1": 19.966, "test_0_acc5": 40.798, "test_5_loss": 7.557907587735987, "test_5_acc1": 0.9905, "test_5_acc5": 3.1575, "epoch": 21, "n_parameters": 22050664}
{"train_lr": 0.0017912042373137494, "train_loss": 5.710006896111605, "test_0_loss": 4.2751427415236405, "test_0_acc1": 20.356, "test_0_acc5": 41.114, "test_5_loss": 7.557907587735987, "test_5_acc1": 0.9905, "test_5_acc5": 3.1575, "epoch": 22, "n_parameters": 22050664}
{"train_lr": 0.0017716606765619972, "train_loss": 5.68051082098322, "test_0_loss": 4.154385426833091, "test_0_acc1": 21.638, "test_0_acc5": 43.102, "test_5_loss": 7.557907587735987, "test_5_acc1": 0.9905, "test_5_acc5": 3.1575, "epoch": 23, "n_parameters": 22050664}
{"train_lr": 0.0017513605142823508, "train_loss": 5.693617649811158, "test_0_loss": 4.25816687512535, "test_0_acc1": 20.994, "test_0_acc5": 41.96, "test_5_loss": 7.557907587735987, "test_5_acc1": 0.9905, "test_5_acc5": 3.1575, "epoch": 24, "n_parameters": 22050664}
{"train_lr": 0.0017303237842843694, "train_loss": 5.6821105527839695, "test_0_loss": 4.421267043301026, "test_0_acc1": 19.116, "test_0_acc5": 39.094, "test_5_loss": 8.897946262237587, "test_5_acc1": 0.514, "test_5_acc5": 1.892, "epoch": 25, "n_parameters": 22050664}
{"train_lr": 0.001708571247280513, "train_loss": 5.69677297047955, "test_0_loss": 4.398700253595852, "test_0_acc1": 19.178, "test_0_acc5": 39.536, "test_5_loss": 8.897946262237587, "test_5_acc1": 0.514, "test_5_acc5": 1.892, "epoch": 26, "n_parameters": 22050664}
{"train_lr": 0.0016861243703990647, "train_loss": 5.740358965097666, "test_0_loss": 4.446112109237348, "test_0_acc1": 19.972, "test_0_acc5": 40.84, "test_5_loss": 8.897946262237587, "test_5_acc1": 0.514, "test_5_acc5": 1.892, "epoch": 27, "n_parameters": 22050664}
{"train_lr": 0.0016630053059970855, "train_loss": 5.712303198760838, "test_0_loss": 4.1932648324234245, "test_0_acc1": 21.566, "test_0_acc5": 42.98, "test_5_loss": 8.897946262237587, "test_5_acc1": 0.514, "test_5_acc5": 1.892, "epoch": 28, "n_parameters": 22050664}
{"train_lr": 0.0016392368698000565, "train_loss": 5.74558376472631, "test_0_loss": 4.124165606513972, "test_0_acc1": 21.932, "test_0_acc5": 43.39, "test_5_loss": 8.897946262237587, "test_5_acc1": 0.514, "test_5_acc5": 1.892, "epoch": 29, "n_parameters": 22050664}
{"train_lr": 0.0016148425183847566, "train_loss": 5.3731044158518175, "test_0_loss": 4.680003530995935, "test_0_acc1": 15.374, "test_0_acc5": 33.588, "test_5_loss": 11.426024395688863, "test_5_acc1": 0.1075, "test_5_acc5": 0.45, "epoch": 30, "n_parameters": 22050664}
{"train_lr": 0.0015898463260310706, "train_loss": 4.259690835869474, "test_0_loss": 5.981102620495181, "test_0_acc1": 5.786, "test_0_acc5": 14.55, "test_5_loss": 11.426024395688863, "test_5_acc1": 0.1075, "test_5_acc5": 0.45, "epoch": 31, "n_parameters": 22050664}
{"train_lr": 0.0015642729609628443, "train_loss": 4.075305948785836, "test_0_loss": 5.933592574686403, "test_0_acc1": 4.598, "test_0_acc5": 13.066, "test_5_loss": 11.426024395688863, "test_5_acc1": 0.1075, "test_5_acc5": 0.45, "epoch": 32, "n_parameters": 22050664}
{"train_lr": 0.001538147661004018, "train_loss": 4.167220209940351, "test_0_loss": 6.295307500501207, "test_0_acc1": 3.228, "test_0_acc5": 9.566, "test_5_loss": 11.426024395688863, "test_5_acc1": 0.1075, "test_5_acc5": 0.45, "epoch": 33, "n_parameters": 22050664}
{"train_lr": 0.001511496208671658, "train_loss": 4.134730825523774, "test_0_loss": 5.972504679850104, "test_0_acc1": 3.806, "test_0_acc5": 11.758, "test_5_loss": 11.426024395688863, "test_5_acc1": 0.1075, "test_5_acc5": 0.45, "epoch": 34, "n_parameters": 22050664}
{"train_lr": 0.0014843449057311518, "train_loss": 4.365966309007885, "test_0_loss": 6.5958606600380065, "test_0_acc1": 2.156, "test_0_acc5": 7.3, "test_5_loss": 12.839466273136347, "test_5_acc1": 0.002, "test_5_acc5": 0.0035, "epoch": 35, "n_parameters": 22050664}
{"train_lr": 0.00145672054724078, "train_loss": 4.49492947772729, "test_0_loss": 6.905164708865429, "test_0_acc1": 1.588, "test_0_acc5": 5.264, "test_5_loss": 12.839466273136347, "test_5_acc1": 0.002, "test_5_acc5": 0.0035, "epoch": 36, "n_parameters": 22050664}
{"train_lr": 0.0014286503951072877, "train_loss": 4.562651729769558, "test_0_loss": 6.958603466617245, "test_0_acc1": 1.594, "test_0_acc5": 5.226, "test_5_loss": 12.839466273136347, "test_5_acc1": 0.002, "test_5_acc5": 0.0035, "epoch": 37, "n_parameters": 22050664}
{"train_lr": 0.0014001621511816529, "train_loss": 4.620032903101804, "test_0_loss": 6.883705623624269, "test_0_acc1": 1.946, "test_0_acc5": 5.582, "test_5_loss": 12.839466273136347, "test_5_acc1": 0.002, "test_5_acc5": 0.0035, "epoch": 38, "n_parameters": 22050664}
{"train_lr": 0.0013712839299212382, "train_loss": 4.635755813831715, "test_0_loss": 7.244386745887312, "test_0_acc1": 0.964, "test_0_acc5": 3.976, "test_5_loss": 12.839466273136347, "test_5_acc1": 0.002, "test_5_acc5": 0.0035, "epoch": 39, "n_parameters": 22050664}
{"train_lr": 0.0013420442306441068, "train_loss": 4.83734727265547, "test_0_loss": 7.3013705145603405, "test_0_acc1": 1.686, "test_0_acc5": 4.768, "test_5_loss": 14.802937962195847, "test_5_acc1": 0.0, "test_5_acc5": 0.0, "epoch": 40, "n_parameters": 22050664}

@ksouvik52
Copy link
Author

Any help in this regard is highly appreciated. Is something happening after 30th epoch?

@ytongbai
Copy link
Owner

Hi, thanks for your interest in our work. Yes I think this log doesn't look right to me. Can you try larger batchsize (4096 for example)? You can try the accumulate gradient to mimic the large batch size that we provided in the code.

@ytongbai
Copy link
Owner

Hi, we located the problem:
Can you try to change this line:

linear_scaled_lr = args.lr * args.batch_size * utils.get_world_size() * 4 / args.adjust_lr

to:
linear_scaled_lr = args.lr * args.batch_size * utils.get_world_size() * args.update_freq / args.adjust_lr

where update_freq is your accumulated time.

The update_freq should be set as 8 in your case to maintain the 4096 total batch size.

Sorry we previously temperally changed our code for a fixed setting under a certain machine, but it should be fed with a argument. Will fix this. Please let me know if you have encountered further problem, will be happy to solve!

@ksouvik52
Copy link
Author

ksouvik52 commented Jul 24, 2022

So, then we are good with batch size of 64 if this line is changed, right?
As per my understanding you are saying args.batch_size * utils.get_world_size() * args.update_freq should be 4096 right? If so, I think for a batch-size of 64, with 4 gpus the update_fre should be 16, right?

@ytongbai
Copy link
Owner

Oh, I just found out that you shrinked the total batch size (--nproc_per_node=4) in your orginal script, right?

We set --nproc_per_node=8, in https://github.com/ytongbai/ViTs-vs-CNNs/blob/99bd87d1ea3a59724887b1b84fe6cda43267ed70/script/advdeit.sh

That means you used half of the batch size than this:

Can you try to maintain the original total batch size first?

@ksouvik52
Copy link
Author

ksouvik52 commented Jul 24, 2022

I am now using exactly your settings:

python -m torch.distributed.launch --nproc_per_node=8 --master_port=12349 --use_env main_adv_deit.py --model deit_tiny_patch16_224_adv --batch-size=128 --data-path /datasets/imagenet-ilsvrc2012 --attack-iter 1 --attack-epsilon 4 --attack-step-size 4 --epoch 100 --reprob 0 --no-repeated-aug --sing singln --drop 0 --drop-path 0 --start_epoch 0 --warmup-epochs 10 --cutmix 0 --output_dir save/deit_adv/deit_tiny_patch16_224,

, but was just curious what is the issue with this update freq? As I dont see you using this to divide the total dataset? how are you maintaining a virtual batch size of 4096 here? I understand 8x128x4 = 4096, however, did not see its use anywhere.

@ytongbai
Copy link
Owner

Yeah try this first and check if the curve looks healthy.

Please ignore that for now. I thought you already tried 1024 batch size and it collapse so I am thinking this current line is not flexible enough for you to perform accumulate gradient for a even larger batch size.

But try this first and i'll be happy to help if you still got other questions :)

@ksouvik52
Copy link
Author

Thanks for your quick response. Hope this works.!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants