Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accuracy #2

Open
SophieZhou opened this issue Mar 16, 2018 · 24 comments
Open

accuracy #2

SophieZhou opened this issue Mar 16, 2018 · 24 comments

Comments

@SophieZhou
Copy link

I use the default parameters in your code. But I have not got your results. The results are as follows:

  • Prec@1 62.320 Prec@5 84.862, prec1 accuracy is only 62 and top5 is only 84.86, I do not know why.

Test: [0/196] Time 5.409 (5.409) Loss 0.8694 (0.8694) Prec@1 79.297 (79.297) Prec@5 92.578 (92.578)
Test: [10/196] Time 0.603 (1.098) Loss 1.4834 (1.1204) Prec@1 60.938 (71.804) Prec@5 85.156 (89.666)
Test: [20/196] Time 2.040 (1.014) Loss 1.2063 (1.1246) Prec@1 76.953 (72.024) Prec@5 87.500 (89.844)
Test: [30/196] Time 0.090 (0.925) Loss 1.1544 (1.1004) Prec@1 67.578 (72.228) Prec@5 91.406 (90.373)
Test: [40/196] Time 0.090 (0.895) Loss 1.1766 (1.1700) Prec@1 69.141 (69.769) Prec@5 91.406 (90.139)
Test: [50/196] Time 0.139 (0.864) Loss 0.8123 (1.1642) Prec@1 79.297 (69.447) Prec@5 95.703 (90.640)
Test: [60/196] Time 0.145 (0.877) Loss 1.4835 (1.1626) Prec@1 60.938 (69.454) Prec@5 90.234 (90.843)
Test: [70/196] Time 0.501 (0.863) Loss 1.1081 (1.1471) Prec@1 72.266 (70.054) Prec@5 91.797 (91.065)
Test: [80/196] Time 1.244 (0.866) Loss 2.0545 (1.1744) Prec@1 50.000 (69.517) Prec@5 79.297 (90.615)
Test: [90/196] Time 2.430 (0.871) Loss 2.7312 (1.2613) Prec@1 37.891 (67.801) Prec@5 69.141 (89.423)
Test: [100/196] Time 0.107 (0.851) Loss 2.3366 (1.3372) Prec@1 42.969 (66.286) Prec@5 73.047 (88.285)
Test: [110/196] Time 0.100 (0.852) Loss 1.3854 (1.3681) Prec@1 67.578 (65.819) Prec@5 86.328 (87.767)
Test: [120/196] Time 0.099 (0.847) Loss 2.1421 (1.3998) Prec@1 53.516 (65.357) Prec@5 75.391 (87.206)
Test: [130/196] Time 0.653 (0.844) Loss 1.3761 (1.4418) Prec@1 67.188 (64.474) Prec@5 87.891 (86.650)
Test: [140/196] Time 0.102 (0.834) Loss 1.7194 (1.4745) Prec@1 58.984 (63.860) Prec@5 82.031 (86.212)
Test: [150/196] Time 0.096 (0.832) Loss 1.7810 (1.5061) Prec@1 66.016 (63.351) Prec@5 81.250 (85.741)
Test: [160/196] Time 0.468 (0.830) Loss 1.4580 (1.5287) Prec@1 69.141 (62.963) Prec@5 85.156 (85.355)
Test: [170/196] Time 1.068 (0.833) Loss 1.2060 (1.5562) Prec@1 69.922 (62.358) Prec@5 90.234 (84.937)
Test: [180/196] Time 0.259 (0.826) Loss 1.4454 (1.5751) Prec@1 59.766 (61.991) Prec@5 90.234 (84.647)
Test: [190/196] Time 0.212 (0.827) Loss 1.5322 (1.5684) Prec@1 57.812 (62.089) Prec@5 88.281 (84.757)

  • Prec@1 62.320 Prec@5 84.862
@ericsun99
Copy link
Owner

Hi, this is my test log,pls have a reference.
Test: [0/98] Time 99.469 (99.469) Loss 0.6380 (0.6380) Prec@1 82.812 (82.812) Prec@5 95.703 (95.703)
Test: [10/98] Time 0.166 (9.192) Loss 0.8281 (0.8153) Prec@1 79.102 (78.516) Prec@5 93.750 (93.821)
Test: [20/98] Time 0.222 (4.897) Loss 0.7882 (0.8287) Prec@1 77.930 (78.125) Prec@5 94.531 (94.085)
Test: [30/98] Time 5.631 (3.810) Loss 1.0537 (0.8360) Prec@1 72.656 (77.640) Prec@5 92.773 (94.418)
Test: [40/98] Time 0.143 (4.345) Loss 1.5852 (0.8507) Prec@1 62.891 (77.553) Prec@5 86.328 (94.150)
Test: [50/98] Time 0.159 (3.524) Loss 1.2438 (0.9598) Prec@1 68.555 (75.314) Prec@5 89.453 (92.785)
Test: [60/98] Time 0.145 (3.394) Loss 1.7359 (1.0162) Prec@1 54.297 (74.318) Prec@5 81.445 (91.983)
Test: [70/98] Time 0.231 (3.321) Loss 1.1273 (1.0696) Prec@1 73.047 (73.228) Prec@5 91.211 (91.299)
Test: [80/98] Time 0.154 (3.168) Loss 1.3778 (1.1144) Prec@1 68.359 (72.377) Prec@5 88.477 (90.688)
Test: [90/98] Time 0.940 (3.096) Loss 1.2233 (1.1465) Prec@1 69.336 (71.585) Prec@5 91.797 (90.284)

  • Prec@1 71.806 Prec@5 90.410

@sunwillz
Copy link

hello,thanks for your sharing,I am not got your results too, could you please tell me the training environment such as use how many GPUs and it took how much time to train?

@austingg
Copy link

the mobilenetv2 has change the top1 accuracy to 72.0% in the latest paper.

@blueardour
Copy link

I achived 66% top1 and 88% top5 with SGD.
The initial lr is 0.045 and exert a plateau detection and drop stragegy. However I could not get the 72% result. Thus I'm urge to know how to gain the paper accuracy. Any experiences are welcome.

@Coderx7
Copy link

Coderx7 commented Sep 16, 2018

@blueardour ,@ericsun99 : same here, I cant get passed 66 top 1 with sgd.
Are you using the very same script and hyperparameters?
(by the way how many epochs does it need to reach 72% top1?)
Any help is greatly appreciated.

@blueardour
Copy link

blueardour commented Sep 17, 2018

@Coderx7 Hi, I had another try to seek luck which get 68% top1 and 88% top5. The only hyperparameter different with my last one was the mini-batch size as 256. Still it is lower than the 72% top1 accuracy.

In the training, the accuracy became stable after 3 days (reached 68% top1) and I continued the training, even after 10 days, the accuracy did not seem to perform better. I employed a plateau learning rate decrease method. Initial lr was 0.01 and finally it decreased to 1e-8. By the way I used RandomResizedCrop and RandomHorizontalFlip for data augment.

What's interesting is all the image classification network I trained, for example the resnet18, resnet50, xception met the same problem. The accuracy I got was always 2 percents lower than those from the paper. Sigh~~~

I guess it might not be caused by gradient decrease method or initial weight settings but probably the data augment. Another important factor is the mini-batch size.

Any tricks to overcome the last few percents accuracy are welcomed.

@blueardour
Copy link

A fix step learning rate policy helps. Decreasing the lr by 0.98 each epoch improved the accuracy 1~2 points.

@zeyu-liu
Copy link

A fix step learning rate policy helps. Decreasing the lr by 0.98 each epoch improved the accuracy 1~2 points.

Did you reproduce the model with 72% accuracy? If yes, could you share the hyper-parameter settings?

@blueardour
Copy link

No, I didn't reach the 72%, but somewhat nearly. As the training is several month ago, I didn't remember exactly the precision. As far as I could remember it was about 71%, less than 1% than the authors's.

The key point is to decrease the lr slowly, 0.98 for each epoch and to wait a long time. I quickly reached 68% in the first three days, however to obtain the 71% it cost another three days.

@CF2220160244
Copy link

Hello @blueardour , your training skill is excellent. I train with a 1080Ti, batchsize=96, lr=0.045, weight decay=0.00004, and decrease the lr 0.98 for each epoch, after 2 days only get 67%, can you tell me your GPU, batchsize, and the initial lr to help me.
I am a student in beijing institute of technology. Thank you very much!

@blueardour
Copy link

I tried both 128 and 256 for batchsize, 5e-4, 5e-5 for weight decay. It seemed no benefit for the accuracy.
Initial lr is the same with 0.045, decrease 0.98 for every epoch.

I trained on P100. As I mentioned, to reach 68% accuracy was easy, the final 2~3% precision cost another several days. To obtain more precision, I advise to spend more time on training if you think it is worthwhile.

@CF2220160244
Copy link

thank you very much! @blueardour

@Coderx7
Copy link

Coderx7 commented Dec 22, 2018

@CF2220160244 : hey would you do us a favor and keep up updated about how your try turned out ?

@itsliupeng
Copy link

mobilenet_v2 1.0 top1: 0.716
mobilenet_v2 1.4 top1: 0.749

80 1080Ti, label smoothing, inception_preprocessing, epoch 120, cosine lr

@Coderx7
Copy link

Coderx7 commented Jan 4, 2019

@itsliupeng what was your learning rate and batch-size?
80!! 1080TI ?
could you kindly also share your training script and logs?
what was your pytorch version by the way?

@itsliupeng
Copy link

@itsliupeng what was your learning rate and batch-size?
80!! 1080TI ?
could you kindly also share your training script and logs?
what was your pytorch version by the way?

Sorry, I don't used PyTorch. I use Horovod + TensorFlow. 8 machines, 8 1080Ti GPUs per machine.
Batch size is 64 per GPU, so the total batch size is 4k.

The MobileNet_v1 can also achieve top1 0.7323. But I cannot repeat the top1 in ShuffleNet V2 thesis.

@Coderx7
Copy link

Coderx7 commented Jan 7, 2019

@itsliupeng Thanks a lot for the further clarification it helps a lot. by the way, could you share your tensorflow training script? Its greatly appreciated

@itsliupeng
Copy link

@Coderx7
Sorry, the code is based on our inner framework, a wrapper of Horovod and Tensorflow. But it has no tricks. Cosine lr and label smoothing are just learned from Mxnet https://gluon-cv.mxnet.io/model_zoo/classification.html, https://arxiv.org/abs/1812.01187.

@Coderx7
Copy link

Coderx7 commented Jan 7, 2019

@itsliupeng : Thanks a lot :) I really appreciate your kind and helpful response.

@mathmanu
Copy link

mathmanu commented Jan 16, 2019

@itsliupeng Thankyou for sharing this information. Being able to train MobileNets in 120 epochs is a wonderful thing.
I have couple of questions.

  1. What was the initial learning rate?
  2. How was the weight update after backpropagation? Was it one weight update for the entire 4K image batch together as usually done in PyTorch or did your framework use some other kind of asynchronous update that is specific to tensorflow?

@itsliupeng
Copy link

@mathmanu

  1. Learning rate is 5 epochs linear warm up starting from 0 to 1.6, then use cosine lr.
    image
  2. No special, just like PyTorch DataParallel, it's synchronous update every batch. I use SGD with momentum 0.9

@mathmanu
Copy link

Thanks. I have read about the use of high learning rates (0.5 or 0.6) in ShuffleNetV2 and Squeeze&Excite papers - but this is even higher. Motivates me to try it. May be warmup is the key to use such high rates.

@mathmanu
Copy link

@itsliupeng I have one doubt. In the MobileNetV2 paper, the learning rate is still kept at 0.045 even though there are 16 (asynchronous) GPUs and with each having a batch size of 96. My question is, why isn't the learning rate increased like you have done?

"MobileNetV2: Inverted Residuals and Linear Bottlenecks", https://arxiv.org/pdf/1801.04381.pdf
6.1. ImageNet Classification
Training setup We train our models using
TensorFlow[31]. We use the standard RMSPropOptimizer with both decay and momentum set to 0.9.
We use batch normalization after every layer, and the
standard weight decay is set to 0.00004. Following
MobileNetV1[27] setup we use initial learning rate of
0.045, and learning rate decay rate of 0.98 per epoch.
We use 16 GPU asynchronous workers, and a batch size
of 96.

@mathmanu
Copy link

Never mind, I read about asynchronous update here:
https://blog.skymind.ai/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks/
The learning rate that they used is probably applied to the gradient within each worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants