-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch normalization layer with test and examples #1965
Conversation
Nice! I guess cifar_baseline is using ReLU instead of Sigmoid? Do you have any training examples of using ReLU + BN? Does it converge much faster than not using BN as stated in the paper? |
@ducha-aiki, As for the fixed mean and variance for inference, I think we can hack that by using two extra vars to keep track of the (exponential) moving averaging mean or variance, and use those instead of the current batch_mean and batch_variance for normalization in TEST phase. Besides, the current implementation keeps two "copies" of the blob (buffer_blob and x_norm) which might be a bit memory consuming when using big deep net. It might worth considering switching to for-loop rather than the current BLAS vectorization as @Russell91 did in his init commit. |
I used different "stage" at the beginning of TEST "phase" to compute mean and variance from a few training mini-batches. Moving average might work during training as mentioned in Section 3.1 in the paper. |
@weiliu89, |
Yes. I refer to how Caffe handle phase, and add several stages that bn_layer does different thing in different stage. This is the easiest way I can think of to implement Algorithm 2. It seems working well, but I haven't debug it though |
@ducha-aiki does master branch supports googlenet network now? |
@weiliu89, we will add some graphs with ReLU-CIFAR later. For now, BN-model converges very fast, but a bit less accurate than non-BN. If we make a bit deeper model, than cifar-baseline, than BN converges faster and to more accurate one. And thank you for stage-phase suggestion @sunbaigui, sure, if you sponsor us with some GPU. When I have trained my modification of the GoogLeNet, it took 3 weeks. Even if it would be 7 times faster, it is too much GPU time to spent for us :) @ChenglongChen we will think about loop-based implementation. However, you are welcome to make PR into this branch :) |
cifar_baseline is example/cifar/train_full.sh I also trained variation of vgg16 on cifar with and without batch normalization For no_bn net base_lr=0.001 causes net to diverge. For bn net lr is first guess, so maybe with bigger lr it will converge faster and better. @ChenglongChen @weiliu89 @ducha-aiki About test phase. I tested cifar_vgg16 whith range of batch sizes (2-250) in test phase and found very small changes in accuracy (with batch size 2 accuracy is only 1% less than with 250) |
@shelhamer @longjon @jeffdonahue Could you please review this PR? |
I think that current PR doesn't compute mean and variance from training images (or moving mean and variance) during testing phase, but it compute mean and variance from test mini-batch, which I think is not exactly the same as described in the paper. I am not sure how much it affects the test accuracy. |
@jjkjkj For the cifar experiment, do you try the comparison adding bn before or after every relu? Does that matter? |
@yangyi02 Yes i tried and found no difference(with examples/cifar10/cifar10_full_train_test.prototxt). As i said i think that this net is bad example for batch normalization. |
To feed evaluation network mean/var, setting mean = beta, var = (1/gamma)^2 will be OK? (Learned beta, gamma is similar to true mean/var?) |
@ducha-aiki when batchsize=1 in test phase, can it also work? |
@justfortest1 no. |
@weiliu89 Could you please share you implementation? I think different stages should be considered. |
@lsy1993311 My Caffe version is old and I am not familiar with how to upload the code and haven't tested it. The high level idea is to include set_stage(string) and stage() in include/caffe/common.hpp (Refer to set_phase() and phase() in the same file). Then in src/caffe/solver.cpp, I add a function at the beginning of TestAll() which tries to compute mean & std. In the function, I set phase to TRAIN, and include two stages by using set_stage() as described before. The first stage is called "aggregation" which does several iteration of Forward pass to aggregate mean & std from a few mini-batches; the second stage is called "finalize" which compute the final mean & std by dividing the number of mini-batches you have passed. Finally, in batch_norm_layer, I can call Caffe::stage() and implement some additional thing in order to handle different stage (e.g. "aggregate" and "finalize"). I won't go in details into how to do it as it should be trivial. However I don't have time to really debug this thoroughly. One thing to notice is that what I described above needs to compute mean & std every time I call TestAll() which might not necessary because it costs extra computation during training. On the other hand, you can only call the function in Snapshot() and use moving mean & std during training as described in the paper (can set a different stage for doing this during training). |
I have been testing that version with my data set using VGG16 model. And it works, speed up converge. |
I have just found implementation of @ChenglongChen which implement BN with right code in Test Phase. It save the mean and variance in BN layer. It looks like better implementation because it does not need to change Solver code. But it does not calculate mean and variance within all Train data, only update the value of statistic using ex: S_{t+1} = decay * Y_{t+1} + (1 - decay) * S_{t}, where decay is parameter. What do you think about such implementation? @ChenglongChen, does it work better than mini-batch statistic? More information here: |
Sorry guys. I have been caught up with work this moment, so I don't have time to test it out thoroughly. The use of exponentially weighted moving average (EWMA) is simply due to the fact that BN tends to keep the distribution of activation stable (?). The algo2 in the paper is a bit complicate:
|
@lsy1993311 As exepcted: slightly faster initial training but strong overfitting(it's natural whet parameters >> dataset). So, BN does not always remove need of dropout. |
What if removing BN layers at testing phase? I mean that no normalization/reconstruction will be used during testing. |
@weiliu89, thanks for catch! @borisgin Hi Boris, thanks for for observation. It is interesting, that is it very architecture dependent: we have tried on other architectures and there was no difference, as stated in original paper. Still lot of place for exploring :) |
weiliu89>>I think that current PR doesn't compute mean and variance from training images (or moving mean and variance) during testing phase, but it compute mean and variance from test mini-batch, which I think is not exactly the same as described in the paper. I am not sure how much it affects the test accuracy. FWIW I agree, that's different from the paper description. Also what if the test data contains only one sample (single image inference)? |
Hi @ducha-aiki, and others, thanks for your excellent work! Here I have a question about your code. Could you please explain to me what is the purpose for the codes starting from the line No. 153 to line NO. 185 in the function "void DataLayer::InternalThreadEntry()" in data_layer.cpp? I just cannot figure out why these codes should be there when the "datum.encoded()" is true. Thx a lot in advance! |
@AIROBOTAI it needs for data shuffling. With batch normalization, it is important that network don`t see same images together in batch, so this lines implement shuffling. |
@ducha-aiki Thanks for your prompt reply! But what should I do if my datum is NOT encoded? I have checked the return value of datum.encoded() to find it to be false. So in this case, those lines of code for shuffling will be jumped over. |
@AIROBOTAI then you can regenerate LMDB with encoded key, or add same lines to unencoded branch of if :) |
One alternative to true shuffling is to do random skips. The DataLayer has Hope that helps. On Sat, Aug 22, 2015 at 11:02 AM, Dmytro Mishkin [email protected]
|
@ducha-aiki I have modified some seemingly confusing codes in your code to make it more straight to me. Now it works, thanks again! |
@waldol1 thanks for your suggestion! Your method seems more easy to use than the shuffling pool proposed in this pull request. I'd also like to know whether you have tested the test accuracy using your method and how is the performance. @ducha-aiki what's your comments on this new shuffling method? Thanks for your all! |
I haven't tested this shuffling method with regards to BatchNorm, but it On Mon, Aug 24, 2015 at 9:06 AM, AIROBOTAI [email protected] wrote:
|
@ducha-aiki @shelhamer - is there still plan to pull this in?
|
@talda This is still to young to be reviewed. It is only six months old. :) |
@bhack actually, this PR is not needed at all, if believe to Google. |
@bhack too bad Github does not have like/upvote button for comments. I would defiantly upvote your previous comment. |
@ducha-aiki Can I classify a single image by using that PR's modifications and batch normalization ? |
@erogol No. |
@ducha-aiki I applied moving average and it now works. |
Added bn_layer.[cpp/cu] with corresponding hpp file. Performs batch-normalization with in-place scale/shift. Originally created by ducha-aiki: https://github.com/ducha-aiki ChenglongChen: https://github.com/ChenglongChen Russell91: https://github.com/Russell91 jjkjkj: https://github.com/jjkjkj detailed discussion of this implementation can be found at: BVLC#1965
Added bn_layer.[cpp/cu] with corresponding hpp file. Performs batch-normalization with in-place scale/shift. Originally created by ducha-aiki: https://github.com/ducha-aiki ChenglongChen: https://github.com/ChenglongChen Russell91: https://github.com/Russell91 jjkjkj: https://github.com/jjkjkj detailed discussion of this implementation can be found at: BVLC#1965
Implemented batch normalization layer (see http://arxiv.org/abs/1502.03167) based on @ChenglongChen and @Russell91 code with fixes and improvements.
Also added shuffling pool by @jjkjkj of the input data to the data_layer to not to have same files together in same batch. Tests passes and rebased on master.
For illustration of the effectiveness two examples of CIFAR-10 classifier with sigmoid non-linearity with and without batch normalization.