Skip to content
This repository has been archived by the owner on Nov 3, 2022. It is now read-only.

Call for Contribution #17

Closed
junwei-pan opened this issue Feb 12, 2017 · 24 comments
Closed

Call for Contribution #17

junwei-pan opened this issue Feb 12, 2017 · 24 comments

Comments

@junwei-pan
Copy link
Contributor

junwei-pan commented Feb 12, 2017

Contributions w.r.t. the following layers and examples are welcomed:

@titu1994
Copy link
Contributor

titu1994 commented Feb 12, 2017 via email

@junwei-pan
Copy link
Contributor Author

The models I have are builder types, meaning there are multiple parameter
which must be supplied.

I think you can provide a builder types, defining several functions with different parameters, such as def wrn_28_10(): return wrn.create_wide_residual_network(ip, nb_classes=10, N=4, k=10, dropout=0.0, verbose=1) and so on

On top of that, the models are not trained on ImageNet. Only Cifar 10 or

I think that's fine.

Also, while we are discussing this, what about the callback builder for
Snapshot Ensemble?

Let's discuss this on the original issue

@titu1994
Copy link
Contributor

titu1994 commented Feb 12, 2017

@kemaswill Thanks for the clarification.

Should I add an example script of how to train the DenseNet model as well? As in a script which shows how the DenseNet-40-12 model was trained on the CIFAR-10 dataset?

I'm just about ready to make a PR for DenseNet.

@junwei-pan
Copy link
Contributor Author

Yes, go ahead.

@titu1994
Copy link
Contributor

@kemaswill I was wondering, isn't BatchNormalization with mode=1 the same as Layer Normalization? Or am I missing something here?

@titu1994
Copy link
Contributor

titu1994 commented Feb 15, 2017

@kemaswill I am attempting to implement the Batch Renormalization paper, but I have a few questions :

  1. How do I get current iteration number in the layer? The paper mentions altering r_max and d_max after a certain number of iterations gradually, but I don't think Keras layers receive any information about current iteration / epoch number.

  2. I would need to replicate a large portion of K.normalize_batch_in_training() because currently it computes and returns (gamma * ((x - mean) / std)) + beta, whereas for this algorithm we need is actually (((x - mean) / std) * r + d) * gamma + beta.

So, should I use K.mean and K.sqrt(K.var(...)) instead? Because I am quite sure that will be a huge performance bottleneck. On the flip side, if I use the theano backend, normalize_batch_in_training() simply calls the theano batch_normalization_train function, which provides high performance, but I can't perform renormalization then (since it returns a batch normalized x).

Is it worth the performance penalty on theano backend to manually compute the mean and std? On tensorflow, I can get the mean and std using tf.nn.moments() but I can't see any similar function for theano.

EDIT: For reference, this is the current implementation https://github.com/titu1994/keras-contrib/blob/batch_renorm/keras_contrib/layers/normalization.py

Also, I am wondering about the updates to mean and standard deviation. It is inside the if mode == 0 block. So should I move it out of there and modify the momentum delta? Since the paper suggests a fairly high momentum.

@the-moliver
Copy link
Collaborator

@titu1994

  1. I think you should be able to change those during training using callbacks.
  2. You could probably use the output of K.normalize_batch_in_training() as such:
out = (gamma * ((x - mean) / std)) + beta
new_out = r*out + gamma*d - beta*(r-1)

because
new_out = (((x - mean) / std) * r + d) * gamma + beta

In any case K.mean and K.std call the corresponding Theano and Tensorflow functions, so you should always use them if needed.

@titu1994
Copy link
Contributor

titu1994 commented Feb 15, 2017

@the-moliver Wouldn't this be less efficient than simply computing mean and std manually (getting K.mean() and K.var and then compute std)? Because we are computing matrix multiplications of gamma and beta inside normalize_batch_in_training?

Also, changing a core parameter of a layer using a callback is not a solution. Especially when you consider a fact that more than one BatchRenormalization layer may be added. It is not feasible to keep a track of multiple layers and manually update a parameter.

@the-moliver
Copy link
Collaborator

the-moliver commented Feb 15, 2017

It may actually be more efficient since K.normalize_batch_in_training(), uses specialized theano and tensorflow calls, and returns the mean and variance which can be used to compute d and r. You'll probably have to use something like the trick I mentioned if you want to use the specialized calls. But it may also be better to just implement it directly with the K.mean and K.std/K.var functions. It's hard to know without testing.

Yeah I agree it's a little cumbersome, but given that you'd probably want all BatchRenormalization layers to change their parameter together, it should be possible to have a callback change all of them. @bobchennan Did something like this with dropout rates: keras-team/keras#3424
Not sure if there is a better solution...

@titu1994
Copy link
Contributor

titu1994 commented Feb 15, 2017

@the-moliver For now I guess I will keep the standard calls (K.mean and var), but your approach is also valid.

If I remember correctly, theano has direct cuda calls for batchnorm, so the batchnorm itself would be faster, but then performing more operations on that output may end up costing us. Can't tell without performance tests.

As for changing the r_max and d_max, that seems like a very hackish way of doing things. Perhaps it is better to just stick with the final r_max and d_max mentioned in the paper (r_max = 3 and d_max = 5).

According to the paper, leaving it at a high rate leads to problems. Specifically from the paper ;

However, at the beginning of training, when the learning rate was larger, 
it proved important to increase rmax slowly: otherwise, large gradients 
were observed to suddenly and severely increase the loss.

@the-moliver
Copy link
Collaborator

Alternatively, it my be possible to accomplish essentially the same thing by changing one line in the BatchNormalization layer:
x_normed = K.in_train_phase(x_normed, x_normed_running)
to
x_normed = K.in_train_phase((1-param)*x_normed + param*x_normed_running, x_normed_running)
with param controlling the tradeoff from normalizing with batch statistics to normalizing with running statistics

@bobchennan
Copy link

class AnnealedDropoutCallback(Callback):
    def __init__(self, N):
        self.N = N
    def on_epoch_end(self, batch, logs={}):
        v      = max(0,1-1.0*batch/self.N)
        #print "epoch",batch,"update",v
        for i in self.model.layers:
            if type(i)==MyDropout:
                K.set_value(i.ratio, v)

Should be similar.

@titu1994
Copy link
Contributor

@the-moliver That portion of the code if checking if Keras is training or testing. In Mode 0, BN occurs at training time but not at testing time.

I don't quite understand what your code would do.

@titu1994
Copy link
Contributor

@bobchennan I was trying to avoid having a separate callback for just a special class.

@fchollet Is there any plan in Keras 2 to support such manipulations? I'll hold back on the implementation of this algorithm in that case, until Keras 2 is formalized.

@the-moliver
Copy link
Collaborator

@titu1994 In the original code x_normed and x_normed_running are always both computed and then one is chosen for output by K.in_train_phase depending on whether its train or test time. My code outputs a weighted combination of x_normed and x_normed_running during training, and x_normed_running at test.

@titu1994
Copy link
Contributor

titu1994 commented Feb 16, 2017

@the-moliver I get that, but what would such weighted combination accomplish? It would not be equivalent to the paper.

@the-moliver
Copy link
Collaborator

the-moliver commented Feb 16, 2017

@titu1994 Exactly, I was just proposing it as an alternative and simpler implementation.
Edit: Sorry, misread. It's not quite mathematically identical, but works out to be pretty close and probably has similar behavior.
Edit: I guess I didn't misread, you added a "not" ;-)

@titu1994
Copy link
Contributor

titu1994 commented Feb 16, 2017

@the-moliver Oh ok. More important than that is trying to set the value of r_max and d_max.

If it is going to be a part of the Keras 2 spec, we can for the time being use the callback approach and then when Keras 2 is finalized we can drop support for the callback and switch to the native implementation. Is that alright?

If not, we can continue using the callback approach.

In the meanwhile, I think setting r_max = 3 and d_max = 5 as the initial values will avoid a problem if the user forgets to set the value via the callback. It won't be exactly according to the paper, but at least it wont cause exploding gradients to occur. Is that alright?

@junwei-pan
Copy link
Contributor Author

I was wondering, isn't BatchNormalization with mode=1 the same as Layer Normalization? Or am I missing something here?

@titu1994 BatchNormalization with mode=0,1,2 all use the mean and variance of summed inputs to a neuron over a mini-batch of training cases to do normalization, why layer normalization use the mean and variance from all of the summed inputs to the neurons in a layer on a single training case.

@titu1994
Copy link
Contributor

Oh OK. Thanks for the clarification.

@titu1994
Copy link
Contributor

titu1994 commented Feb 18, 2017

@kemaswill @the-moliver So I implemented the algorithm, and to me it looks like it follows the paper exactly in all regards, other than updating the r_max and d_max.

In a few words, it is performing exceedingly poorly. I am testing it on a WRN-16-4 model and theano backend (works for tensorflow as well) on the CIFAR 10 dataset. The same model with BatchNormalization performs decently (I am only training for 100 iterations with Adam, not decreasing learning rate or doing full execution of 300 epochs), but the Batch Renormalization model is failing miserably.
I think a few reasons may be :

1) The paper mentions explicit updates to running average and mean, and then go on to say the rest is updated via gradient optimization. I am doing that by pulling the update call from the inner if mode == 0 to the general flow of execution. That should be correct right?

2) The loss initially increases during the first 3-5 iterations and validation accuracy is terrible/constant for these epochs. Is this solely due to the initial r_max and d_max values being so high? To me, it looks like the gradients in the first couple of iterations are too strong due to the high learning rate.

What could be the possible errors?

Edit : See below for update.

@titu1994
Copy link
Contributor

titu1994 commented Feb 19, 2017

@kemaswill @the-moliver So I managed to fix the training problem (i.e. the fact that r_max and d_max needs to be adjusted per iteration). In fact, I had to settle for per epoch updation of the max values, even though the paper suggests specific iterations at which r_max and d_max should be at their max possible value.

How I did it :
Used the biological population growth formula (basically a sigmoid formula with extra parameters)
N(t) = K / (1 + (K / (N_0) - 1) * e ^ (-(r * t))) as a function to compute the value at epoch 't'. It has a useful property that one equation can derive two plots, both of which are perfect for smoothly updating the current values of r_max and d_max to their next higher state with different initial values.

The plot below shows the value as t increases. By the 5th-6th epoch, the max value has been reached, and it avoids gradient explosion problem now. Ignore the derivative part.

biological growth rate

New issues :

  1. Mode 2 Batch Renormalization is learning fast. Too fast. I mean to say that with the same WRN model with Batch Renorm and mode 2, I was able to get 67% validation accuracy in the 1st epoch, with training accuracy at just 51%. That is in comparison to 36~ % validation accuracy on normal batch normalization with train accuracy of 42%. This may be because CIFAR 10 is too easy to train a network on (WRN-16-4 has 2.8 million parameters in it, but it can't over/underfit in just 1 epoch, I hope).

2) With mode 0, training rises drastically, near 51%, but validation score is very poor at just 21 - 35 % for even the next 5-6 epochs. By this time, training accuracy has already hit 94%, which is weird. This may be due to the formulation of how inference occurs in mode 0 vs mode 2. In mode 0, I am returning the non renornormalized results (simple batch norm results as suggested by the paper) as the output of K.in_training_phase. When I use mode 2, I am sending the renormalized output.

Perhaps the weights are adjusting to handle just the renormalized values? I am stopping gradient flow from r and d with K.stop_gradients so it can't be the case that the gradients of r and d are affecting the training process.

EDIT : Fixed the problem. Sadly, the naming of the original batch norm variables was incorrect. The running std was actually computed to be the running variance, but the variable name was given as running_std. Since the name is erroneous, I was giving updates equivalent to running std to a variable which should hold running variance values. Now that it is fixed, I re-ran the Mode 2 tests. It works same as before (even improving a slight amount (1st epoch is 68.7 instead of 67.1%, 5th epoch is 84.54 % instead of 84.34%), and mode 0 works now as well, but it learns slowly compared to mode 2 (training loss fluctuates between 71 - 76 % in the first 10 epochs, similar to Batchnorm implementation with mode=0).

3) I've removed support for sample wise batch renormalization. The paper doesn't mention it, and I can't understand how it works. For now, it simply throws a ValueError.

EDIT: Did some research and found that it was same as batch norm but with mean and std calculated for each sample rather than each channel. Fixed this, now mode 1 works as well.

  1. Performance : With simple batch normalization, each epoch on my laptop requires 137 seconds on average (980M) with a batch size of 128. With batch renormalization, it requires 142 (mode 2) ~ 152 (mode 0) seconds on average. This is by far the fastest implementation I was able to make. I tried the other two changes suggested by @the-moliver, but they require far more time per epoch (with the first change - 183 seconds, with the second 169 seconds). In any case, I did expect it to be a bit slower than Batchnorm, simply since it is performing more operations than Batchnorm.

But from what I am seeing, the speed with which the networks learn using batch renorm, it is more than worth the speed loss (in mode 2 for now). Not even DenseNet-28-8 or a modified Inception ResNet v2 learns this fast on CIFAR 10 or 100 (with mode 2). For comparison, on theano backend (speaking of validation accuracy scores), Inception ResNet v2 (modified to handle 32x32 images) reaches 59% accuracy in the first 2 epochs, whereas DenseNet-28-8 reaches 61 % in 2 epochs. This one reaches 79% in the second epoch with WRN-16-4, one of the smallest WRN's I've ever tested. The WRN-16-4 with normal Batch normalization gets just 56 % accuracy after 2 epochs! In just 5 epochs, WRN-16-4 with Batch Renormalization beat the score of the WRN-16-4 model with normal Batch Normalization after 100 epochs (84.33 % vs 84.19 %).!

Any suggestions for points 2 - 3?

The code is currently at : https://github.com/titu1994/keras-contrib/blob/batch_renorm/keras_contrib/layers/normalization.py

Below is the training history for batchnorm and batch renorm : (Note : Both models overfit because I did not apply either image augmentation or dropout. Just batchnorm or batch renorm to the wide resnet.). As you can see, batch renorm (mode 2) did better by a small margin, and achieved a higher score far faster than with betchnorm.
batchnorm vs renorm

Will be uploading the training graph of batch renorm with mode 0 in a few hours once training is over.

@titu1994
Copy link
Contributor

titu1994 commented Feb 19, 2017

@farizrahman4u @kemaswill @the-moliver On a final note, should I put BRn (Batch Renormalization) in Keras (branch Keras 2) or should I put this in keras_contrib? BRn is super useful for fast training, especially in case of mode 2 and BatchNormalization layers can simply be swapped out for BatchRenormalization layers for an improvement. For now, I think I'll add a PR, and when Keras 2 is finalized perhaps this layer can be merged there as well.

@fchollet In your opinion, should this be in keras contrib for now and once Keras 2 is finalized be merged there, or merged into Keras 2 directly? I believe that the layer should not have any conflicts with the changes that will be there in Keras 2. It can be swapped in as a replacement for BN layer pretty much everywhere.

Edit : Cleaned up a few sentences to remove confusion.

@gabrieldemarmiesse
Copy link
Contributor

Closing as outdated, feel free to open another issue with another call for contributions if you want.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants