-
Notifications
You must be signed in to change notification settings - Fork 653
Call for Contribution #17
Comments
I'd be happy to add PRs, but I need to know what format would suite this
repo.
Generally applications are one offs, they build a specific model, load
weights and then provide the model for the end user.
The models I have are builder types, meaning there are multiple parameter
which must be supplied.
On top of that, the models are not trained on ImageNet. Only Cifar 10 or
100. Also, these weights are for the mid level models, since I don't have a
big enough GPU to train the largest models in the papers.
Any suggestions on these issues ?
Also, while we are discussing this, what about the callback builder for
Snapshot Ensemble? It is achieved via a callback list and not a single
callback.
Should I keep the builder class to simplify the usage, or just put the
callbacks separately and the schedule separately and the user must take the
trouble to set up the model ? Personally I prefer the builder approach.
Somshubra Majumdar
…On Feb 11, 2017 22:33, "Junwei Pan" ***@***.***> wrote:
Contributions w.r.t. the following layers and examples are welcomed:
- Layers
- Layer Normalization <https://arxiv.org/abs/1607.06450>
- PELU <https://arxiv.org/abs/1605.09332>
- Examples
- DenseNets <https://arxiv.org/abs/1608.06993v2>
- Residual of Residual Networks
<https://arxiv.org/pdf/1608.02908v1.pdf>
- Wide ResNets <https://arxiv.org/abs/1605.07146v1>
- ResNet in ResNet <https://arxiv.org/abs/1603.08029>
Seems that @titu1994 <https://github.com/titu1994> has already taken the
examples of first 3 examples, as mentioned in this issue
<#10>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AC6Emmo7BFPmj_Dc7GEampYOk4w2b4Ktks5rbouZgaJpZM4L-Y8T>
.
|
I think you can provide a builder types, defining several functions with different parameters, such as def wrn_28_10(): return wrn.create_wide_residual_network(ip, nb_classes=10, N=4, k=10, dropout=0.0, verbose=1) and so on
I think that's fine.
Let's discuss this on the original issue |
@kemaswill Thanks for the clarification. Should I add an example script of how to train the DenseNet model as well? As in a script which shows how the DenseNet-40-12 model was trained on the CIFAR-10 dataset? I'm just about ready to make a PR for DenseNet. |
Yes, go ahead. |
@kemaswill I was wondering, isn't BatchNormalization with mode=1 the same as Layer Normalization? Or am I missing something here? |
@kemaswill I am attempting to implement the Batch Renormalization paper, but I have a few questions :
So, should I use K.mean and K.sqrt(K.var(...)) instead? Because I am quite sure that will be a huge performance bottleneck. On the flip side, if I use the theano backend, normalize_batch_in_training() simply calls the theano batch_normalization_train function, which provides high performance, but I can't perform renormalization then (since it returns a batch normalized x). Is it worth the performance penalty on theano backend to manually compute the mean and std? On tensorflow, I can get the mean and std using tf.nn.moments() but I can't see any similar function for theano. EDIT: For reference, this is the current implementation https://github.com/titu1994/keras-contrib/blob/batch_renorm/keras_contrib/layers/normalization.py Also, I am wondering about the updates to mean and standard deviation. It is inside the if mode == 0 block. So should I move it out of there and modify the momentum delta? Since the paper suggests a fairly high momentum. |
because In any case K.mean and K.std call the corresponding Theano and Tensorflow functions, so you should always use them if needed. |
@the-moliver Wouldn't this be less efficient than simply computing mean and std manually (getting K.mean() and K.var and then compute std)? Because we are computing matrix multiplications of gamma and beta inside normalize_batch_in_training? Also, changing a core parameter of a layer using a callback is not a solution. Especially when you consider a fact that more than one BatchRenormalization layer may be added. It is not feasible to keep a track of multiple layers and manually update a parameter. |
It may actually be more efficient since K.normalize_batch_in_training(), uses specialized theano and tensorflow calls, and returns the mean and variance which can be used to compute d and r. You'll probably have to use something like the trick I mentioned if you want to use the specialized calls. But it may also be better to just implement it directly with the K.mean and K.std/K.var functions. It's hard to know without testing. Yeah I agree it's a little cumbersome, but given that you'd probably want all BatchRenormalization layers to change their parameter together, it should be possible to have a callback change all of them. @bobchennan Did something like this with dropout rates: keras-team/keras#3424 |
@the-moliver For now I guess I will keep the standard calls (K.mean and var), but your approach is also valid. If I remember correctly, theano has direct cuda calls for batchnorm, so the batchnorm itself would be faster, but then performing more operations on that output may end up costing us. Can't tell without performance tests. As for changing the r_max and d_max, that seems like a very hackish way of doing things. Perhaps it is better to just stick with the final r_max and d_max mentioned in the paper (r_max = 3 and d_max = 5). According to the paper, leaving it at a high rate leads to problems. Specifically from the paper ;
|
Alternatively, it my be possible to accomplish essentially the same thing by changing one line in the BatchNormalization layer: |
class AnnealedDropoutCallback(Callback):
def __init__(self, N):
self.N = N
def on_epoch_end(self, batch, logs={}):
v = max(0,1-1.0*batch/self.N)
#print "epoch",batch,"update",v
for i in self.model.layers:
if type(i)==MyDropout:
K.set_value(i.ratio, v) Should be similar. |
@the-moliver That portion of the code if checking if Keras is training or testing. In Mode 0, BN occurs at training time but not at testing time. I don't quite understand what your code would do. |
@bobchennan I was trying to avoid having a separate callback for just a special class. @fchollet Is there any plan in Keras 2 to support such manipulations? I'll hold back on the implementation of this algorithm in that case, until Keras 2 is formalized. |
@titu1994 In the original code |
@the-moliver I get that, but what would such weighted combination accomplish? It would not be equivalent to the paper. |
@titu1994 Exactly, I was just proposing it as an alternative and simpler implementation. |
@the-moliver Oh ok. More important than that is trying to set the value of r_max and d_max. If it is going to be a part of the Keras 2 spec, we can for the time being use the callback approach and then when Keras 2 is finalized we can drop support for the callback and switch to the native implementation. Is that alright? If not, we can continue using the callback approach. In the meanwhile, I think setting r_max = 3 and d_max = 5 as the initial values will avoid a problem if the user forgets to set the value via the callback. It won't be exactly according to the paper, but at least it wont cause exploding gradients to occur. Is that alright? |
@titu1994 BatchNormalization with mode=0,1,2 all use the mean and variance of summed inputs to a neuron over a mini-batch of training cases to do normalization, why layer normalization use the mean and variance from all of the summed inputs to the neurons in a layer on a single training case. |
Oh OK. Thanks for the clarification. |
@kemaswill @the-moliver So I implemented the algorithm, and to me it looks like it follows the paper exactly in all regards, other than updating the r_max and d_max.
Edit : See below for update. |
@kemaswill @the-moliver So I managed to fix the training problem (i.e. the fact that r_max and d_max needs to be adjusted per iteration). In fact, I had to settle for per epoch updation of the max values, even though the paper suggests specific iterations at which r_max and d_max should be at their max possible value. How I did it : The plot below shows the value as t increases. By the 5th-6th epoch, the max value has been reached, and it avoids gradient explosion problem now. Ignore the derivative part. New issues :
EDIT : Fixed the problem. Sadly, the naming of the original batch norm variables was incorrect. The running std was actually computed to be the running variance, but the variable name was given as running_std. Since the name is erroneous, I was giving updates equivalent to running std to a variable which should hold running variance values. Now that it is fixed, I re-ran the Mode 2 tests. It works same as before (even improving a slight amount (1st epoch is 68.7 instead of 67.1%, 5th epoch is 84.54 % instead of 84.34%), and mode 0 works now as well, but it learns slowly compared to mode 2 (training loss fluctuates between 71 - 76 % in the first 10 epochs, similar to Batchnorm implementation with mode=0).
EDIT: Did some research and found that it was same as batch norm but with mean and std calculated for each sample rather than each channel. Fixed this, now mode 1 works as well.
But from what I am seeing, the speed with which the networks learn using batch renorm, it is more than worth the speed loss (in mode 2 for now). Not even DenseNet-28-8 or a modified Inception ResNet v2 learns this fast on CIFAR 10 or 100 (with mode 2). For comparison, on theano backend (speaking of validation accuracy scores), Inception ResNet v2 (modified to handle 32x32 images) reaches 59% accuracy in the first 2 epochs, whereas DenseNet-28-8 reaches 61 % in 2 epochs. This one reaches 79% in the second epoch with WRN-16-4, one of the smallest WRN's I've ever tested. The WRN-16-4 with normal Batch normalization gets just 56 % accuracy after 2 epochs! In just 5 epochs, WRN-16-4 with Batch Renormalization beat the score of the WRN-16-4 model with normal Batch Normalization after 100 epochs (84.33 % vs 84.19 %).!
The code is currently at : https://github.com/titu1994/keras-contrib/blob/batch_renorm/keras_contrib/layers/normalization.py Below is the training history for batchnorm and batch renorm : (Note : Both models overfit because I did not apply either image augmentation or dropout. Just batchnorm or batch renorm to the wide resnet.). As you can see, batch renorm (mode 2) did better by a small margin, and achieved a higher score far faster than with betchnorm. Will be uploading the training graph of batch renorm with mode 0 in a few hours once training is over. |
@farizrahman4u @kemaswill @the-moliver On a final note, should I put BRn (Batch Renormalization) in Keras (branch Keras 2) or should I put this in keras_contrib? BRn is super useful for fast training, especially in case of mode 2 and BatchNormalization layers can simply be swapped out for BatchRenormalization layers for an improvement. For now, I think I'll add a PR, and when Keras 2 is finalized perhaps this layer can be merged there as well. @fchollet In your opinion, should this be in keras contrib for now and once Keras 2 is finalized be merged there, or merged into Keras 2 directly? I believe that the layer should not have any conflicts with the changes that will be there in Keras 2. It can be swapped in as a replacement for BN layer pretty much everywhere. Edit : Cleaned up a few sentences to remove confusion. |
Closing as outdated, feel free to open another issue with another call for contributions if you want. |
Contributions w.r.t. the following layers and examples are welcomed:
The text was updated successfully, but these errors were encountered: