Call for Contribution #17

junwei-pan · 2017-02-12T04:33:28Z

Contributions w.r.t. the following layers and examples are welcomed:

Layers
- Layer Normalization
- PELU [DONE!]
- Batch Renormalization
Examples
- DenseNets [DONE!]
- Residual of Residual Networks [DONE!]
- Wide ResNets[DONE!]
- ResNet in ResNet

titu1994 · 2017-02-12T04:40:54Z

I'd be happy to add PRs, but I need to know what format would suite this repo. Generally applications are one offs, they build a specific model, load weights and then provide the model for the end user. The models I have are builder types, meaning there are multiple parameter which must be supplied. On top of that, the models are not trained on ImageNet. Only Cifar 10 or 100. Also, these weights are for the mid level models, since I don't have a big enough GPU to train the largest models in the papers. Any suggestions on these issues ? Also, while we are discussing this, what about the callback builder for Snapshot Ensemble? It is achieved via a callback list and not a single callback. Should I keep the builder class to simplify the usage, or just put the callbacks separately and the schedule separately and the user must take the trouble to set up the model ? Personally I prefer the builder approach. Somshubra Majumdar

…

On Feb 11, 2017 22:33, "Junwei Pan" ***@***.***> wrote: Contributions w.r.t. the following layers and examples are welcomed: - Layers - Layer Normalization <https://arxiv.org/abs/1607.06450> - PELU <https://arxiv.org/abs/1605.09332> - Examples - DenseNets <https://arxiv.org/abs/1608.06993v2> - Residual of Residual Networks <https://arxiv.org/pdf/1608.02908v1.pdf> - Wide ResNets <https://arxiv.org/abs/1605.07146v1> - ResNet in ResNet <https://arxiv.org/abs/1603.08029> Seems that @titu1994 <https://github.com/titu1994> has already taken the examples of first 3 examples, as mentioned in this issue <#10>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC6Emmo7BFPmj_Dc7GEampYOk4w2b4Ktks5rbouZgaJpZM4L-Y8T> .

junwei-pan · 2017-02-12T19:00:42Z

The models I have are builder types, meaning there are multiple parameter
which must be supplied.

I think you can provide a builder types, defining several functions with different parameters, such as def wrn_28_10(): return wrn.create_wide_residual_network(ip, nb_classes=10, N=4, k=10, dropout=0.0, verbose=1) and so on

On top of that, the models are not trained on ImageNet. Only Cifar 10 or

I think that's fine.

Also, while we are discussing this, what about the callback builder for
Snapshot Ensemble?

Let's discuss this on the original issue

titu1994 · 2017-02-12T22:48:55Z

@kemaswill Thanks for the clarification.

Should I add an example script of how to train the DenseNet model as well? As in a script which shows how the DenseNet-40-12 model was trained on the CIFAR-10 dataset?

I'm just about ready to make a PR for DenseNet.

junwei-pan · 2017-02-13T18:07:44Z

Yes, go ahead.

titu1994 · 2017-02-15T20:01:16Z

@kemaswill I was wondering, isn't BatchNormalization with mode=1 the same as Layer Normalization? Or am I missing something here?

titu1994 · 2017-02-15T21:38:27Z

@kemaswill I am attempting to implement the Batch Renormalization paper, but I have a few questions :

How do I get current iteration number in the layer? The paper mentions altering r_max and d_max after a certain number of iterations gradually, but I don't think Keras layers receive any information about current iteration / epoch number.
I would need to replicate a large portion of K.normalize_batch_in_training() because currently it computes and returns (gamma * ((x - mean) / std)) + beta, whereas for this algorithm we need is actually (((x - mean) / std) * r + d) * gamma + beta.

So, should I use K.mean and K.sqrt(K.var(...)) instead? Because I am quite sure that will be a huge performance bottleneck. On the flip side, if I use the theano backend, normalize_batch_in_training() simply calls the theano batch_normalization_train function, which provides high performance, but I can't perform renormalization then (since it returns a batch normalized x).

Is it worth the performance penalty on theano backend to manually compute the mean and std? On tensorflow, I can get the mean and std using tf.nn.moments() but I can't see any similar function for theano.

EDIT: For reference, this is the current implementation https://github.com/titu1994/keras-contrib/blob/batch_renorm/keras_contrib/layers/normalization.py

Also, I am wondering about the updates to mean and standard deviation. It is inside the if mode == 0 block. So should I move it out of there and modify the momentum delta? Since the paper suggests a fairly high momentum.

the-moliver · 2017-02-15T22:40:50Z

@titu1994

I think you should be able to change those during training using callbacks.
You could probably use the output of K.normalize_batch_in_training() as such:

out = (gamma * ((x - mean) / std)) + beta
new_out = r*out + gamma*d - beta*(r-1)

because
new_out = (((x - mean) / std) * r + d) * gamma + beta

In any case K.mean and K.std call the corresponding Theano and Tensorflow functions, so you should always use them if needed.

titu1994 · 2017-02-15T23:03:12Z

@the-moliver Wouldn't this be less efficient than simply computing mean and std manually (getting K.mean() and K.var and then compute std)? Because we are computing matrix multiplications of gamma and beta inside normalize_batch_in_training?

Also, changing a core parameter of a layer using a callback is not a solution. Especially when you consider a fact that more than one BatchRenormalization layer may be added. It is not feasible to keep a track of multiple layers and manually update a parameter.

the-moliver · 2017-02-15T23:36:23Z

It may actually be more efficient since K.normalize_batch_in_training(), uses specialized theano and tensorflow calls, and returns the mean and variance which can be used to compute d and r. You'll probably have to use something like the trick I mentioned if you want to use the specialized calls. But it may also be better to just implement it directly with the K.mean and K.std/K.var functions. It's hard to know without testing.

Yeah I agree it's a little cumbersome, but given that you'd probably want all BatchRenormalization layers to change their parameter together, it should be possible to have a callback change all of them. @bobchennan Did something like this with dropout rates: keras-team/keras#3424
Not sure if there is a better solution...

titu1994 · 2017-02-15T23:54:30Z

@the-moliver For now I guess I will keep the standard calls (K.mean and var), but your approach is also valid.

If I remember correctly, theano has direct cuda calls for batchnorm, so the batchnorm itself would be faster, but then performing more operations on that output may end up costing us. Can't tell without performance tests.

As for changing the r_max and d_max, that seems like a very hackish way of doing things. Perhaps it is better to just stick with the final r_max and d_max mentioned in the paper (r_max = 3 and d_max = 5).

According to the paper, leaving it at a high rate leads to problems. Specifically from the paper ;

However, at the beginning of training, when the learning rate was larger, 
it proved important to increase rmax slowly: otherwise, large gradients 
were observed to suddenly and severely increase the loss.

the-moliver · 2017-02-16T00:00:23Z

Alternatively, it my be possible to accomplish essentially the same thing by changing one line in the BatchNormalization layer:
x_normed = K.in_train_phase(x_normed, x_normed_running)
to
x_normed = K.in_train_phase((1-param)*x_normed + param*x_normed_running, x_normed_running)
with param controlling the tradeoff from normalizing with batch statistics to normalizing with running statistics

bobchennan · 2017-02-16T00:09:26Z

class AnnealedDropoutCallback(Callback):
    def __init__(self, N):
        self.N = N
    def on_epoch_end(self, batch, logs={}):
        v      = max(0,1-1.0*batch/self.N)
        #print "epoch",batch,"update",v
        for i in self.model.layers:
            if type(i)==MyDropout:
                K.set_value(i.ratio, v)

Should be similar.

titu1994 · 2017-02-16T00:15:01Z

@the-moliver That portion of the code if checking if Keras is training or testing. In Mode 0, BN occurs at training time but not at testing time.

I don't quite understand what your code would do.

titu1994 · 2017-02-16T00:20:48Z

@bobchennan I was trying to avoid having a separate callback for just a special class.

@fchollet Is there any plan in Keras 2 to support such manipulations? I'll hold back on the implementation of this algorithm in that case, until Keras 2 is formalized.

the-moliver · 2017-02-16T00:20:59Z

@titu1994 In the original code x_normed and x_normed_running are always both computed and then one is chosen for output by K.in_train_phase depending on whether its train or test time. My code outputs a weighted combination of x_normed and x_normed_running during training, and x_normed_running at test.

titu1994 · 2017-02-16T00:22:00Z

@the-moliver I get that, but what would such weighted combination accomplish? It would not be equivalent to the paper.

the-moliver · 2017-02-16T00:23:56Z

@titu1994 Exactly, I was just proposing it as an alternative and simpler implementation.
Edit: Sorry, misread. It's not quite mathematically identical, but works out to be pretty close and probably has similar behavior.
Edit: I guess I didn't misread, you added a "not" ;-)

titu1994 · 2017-02-16T00:45:08Z

@the-moliver Oh ok. More important than that is trying to set the value of r_max and d_max.

If it is going to be a part of the Keras 2 spec, we can for the time being use the callback approach and then when Keras 2 is finalized we can drop support for the callback and switch to the native implementation. Is that alright?

If not, we can continue using the callback approach.

In the meanwhile, I think setting r_max = 3 and d_max = 5 as the initial values will avoid a problem if the user forgets to set the value via the callback. It won't be exactly according to the paper, but at least it wont cause exploding gradients to occur. Is that alright?

junwei-pan · 2017-02-16T05:03:54Z

I was wondering, isn't BatchNormalization with mode=1 the same as Layer Normalization? Or am I missing something here?

@titu1994 BatchNormalization with mode=0,1,2 all use the mean and variance of summed inputs to a neuron over a mini-batch of training cases to do normalization, why layer normalization use the mean and variance from all of the summed inputs to the neurons in a layer on a single training case.

titu1994 · 2017-02-16T06:48:24Z

Oh OK. Thanks for the clarification.

titu1994 · 2017-02-18T09:19:01Z

@kemaswill @the-moliver So I implemented the algorithm, and to me it looks like it follows the paper exactly in all regards, other than updating the r_max and d_max.

In a few words, it is performing exceedingly poorly. I am testing it on a WRN-16-4 model and theano backend (works for tensorflow as well) on the CIFAR 10 dataset. The same model with BatchNormalization performs decently (I am only training for 100 iterations with Adam, not decreasing learning rate or doing full execution of 300 epochs), but the Batch Renormalization model is failing miserably.
~~I think a few reasons may be :~~

1) The paper mentions explicit updates to running average and mean, and then go on to say the rest is updated via gradient optimization. I am doing that by pulling the update call from the inner if mode == 0 to the general flow of execution. That should be correct right?

2) The loss initially increases during the first 3-5 iterations and validation accuracy is terrible/constant for these epochs. Is this solely due to the initial r_max and d_max values being so high? To me, it looks like the gradients in the first couple of iterations are too strong due to the high learning rate.

~~What could be the possible errors?~~

Edit : See below for update.

titu1994 · 2017-02-19T01:59:02Z

@kemaswill @the-moliver So I managed to fix the training problem (i.e. the fact that r_max and d_max needs to be adjusted per iteration). In fact, I had to settle for per epoch updation of the max values, even though the paper suggests specific iterations at which r_max and d_max should be at their max possible value.

How I did it :
Used the biological population growth formula (basically a sigmoid formula with extra parameters)
N(t) = K / (1 + (K / (N_0) - 1) * e ^ (-(r * t))) as a function to compute the value at epoch 't'. It has a useful property that one equation can derive two plots, both of which are perfect for smoothly updating the current values of r_max and d_max to their next higher state with different initial values.

The plot below shows the value as t increases. By the 5th-6th epoch, the max value has been reached, and it avoids gradient explosion problem now. Ignore the derivative part.

New issues :

Mode 2 Batch Renormalization is learning fast. Too fast. I mean to say that with the same WRN model with Batch Renorm and mode 2, I was able to get 67% validation accuracy in the 1st epoch, with training accuracy at just 51%. That is in comparison to 36~ % validation accuracy on normal batch normalization with train accuracy of 42%. This may be because CIFAR 10 is too easy to train a network on (WRN-16-4 has 2.8 million parameters in it, but it can't over/underfit in just 1 epoch, I hope).

2) With mode 0, training rises drastically, near 51%, but validation score is very poor at just 21 - 35 % for even the next 5-6 epochs. By this time, training accuracy has already hit 94%, which is weird. This may be due to the formulation of how inference occurs in mode 0 vs mode 2. In mode 0, I am returning the non renornormalized results (simple batch norm results as suggested by the paper) as the output of K.in_training_phase. When I use mode 2, I am sending the renormalized output.

Perhaps the weights are adjusting to handle just the renormalized values? I am stopping gradient flow from r and d with K.stop_gradients so it can't be the case that the gradients of r and d are affecting the training process.

EDIT : Fixed the problem. Sadly, the naming of the original batch norm variables was incorrect. The running std was actually computed to be the running variance, but the variable name was given as running_std. Since the name is erroneous, I was giving updates equivalent to running std to a variable which should hold running variance values. Now that it is fixed, I re-ran the Mode 2 tests. It works same as before (even improving a slight amount (1st epoch is 68.7 instead of 67.1%, 5th epoch is 84.54 % instead of 84.34%), and mode 0 works now as well, but it learns slowly compared to mode 2 (training loss fluctuates between 71 - 76 % in the first 10 epochs, similar to Batchnorm implementation with mode=0).

~~3) I've removed support for sample wise batch renormalization. The paper doesn't mention it, and I can't understand how it works. For now, it simply throws a ValueError.~~

EDIT: Did some research and found that it was same as batch norm but with mean and std calculated for each sample rather than each channel. Fixed this, now mode 1 works as well.

Performance : With simple batch normalization, each epoch on my laptop requires 137 seconds on average (980M) with a batch size of 128. With batch renormalization, it requires 142 (mode 2) ~ 152 (mode 0) seconds on average. This is by far the fastest implementation I was able to make. I tried the other two changes suggested by @the-moliver, but they require far more time per epoch (with the first change - 183 seconds, with the second 169 seconds). In any case, I did expect it to be a bit slower than Batchnorm, simply since it is performing more operations than Batchnorm.

But from what I am seeing, the speed with which the networks learn using batch renorm, it is more than worth the speed loss (in mode 2 for now). Not even DenseNet-28-8 or a modified Inception ResNet v2 learns this fast on CIFAR 10 or 100 (with mode 2). For comparison, on theano backend (speaking of validation accuracy scores), Inception ResNet v2 (modified to handle 32x32 images) reaches 59% accuracy in the first 2 epochs, whereas DenseNet-28-8 reaches 61 % in 2 epochs. This one reaches 79% in the second epoch with WRN-16-4, one of the smallest WRN's I've ever tested. The WRN-16-4 with normal Batch normalization gets just 56 % accuracy after 2 epochs! In just 5 epochs, WRN-16-4 with Batch Renormalization beat the score of the WRN-16-4 model with normal Batch Normalization after 100 epochs (84.33 % vs 84.19 %).!

~~Any suggestions for points 2 - 3?~~

The code is currently at : https://github.com/titu1994/keras-contrib/blob/batch_renorm/keras_contrib/layers/normalization.py

Below is the training history for batchnorm and batch renorm : (Note : Both models overfit because I did not apply either image augmentation or dropout. Just batchnorm or batch renorm to the wide resnet.). As you can see, batch renorm (mode 2) did better by a small margin, and achieved a higher score far faster than with betchnorm.

Will be uploading the training graph of batch renorm with mode 0 in a few hours once training is over.

titu1994 · 2017-02-19T23:36:21Z

@farizrahman4u @kemaswill @the-moliver On a final note, should I put BRn (Batch Renormalization) in Keras (branch Keras 2) or should I put this in keras_contrib? BRn is super useful for fast training, especially in case of mode 2 and BatchNormalization layers can simply be swapped out for BatchRenormalization layers for an improvement. For now, I think I'll add a PR, and when Keras 2 is finalized perhaps this layer can be merged there as well.

@fchollet In your opinion, should this be in keras contrib for now and once Keras 2 is finalized be merged there, or merged into Keras 2 directly? I believe that the layer should not have any conflicts with the changes that will be there in Keras 2. It can be swapped in as a replacement for BN layer pretty much everywhere.

Edit : Cleaned up a few sentences to remove confusion.

gabrieldemarmiesse · 2019-02-10T16:26:07Z

Closing as outdated, feel free to open another issue with another call for contributions if you want.

junwei-pan added enhancement help wanted labels Feb 12, 2017

titu1994 mentioned this issue Feb 20, 2017

Added Batch Renormalization Layer #28

Merged

gabrieldemarmiesse closed this as completed Feb 10, 2019

shahdev mentioned this issue Jul 22, 2020

Bug in batch_renorm titu1994/BatchRenormalization#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call for Contribution #17

Call for Contribution #17

junwei-pan commented Feb 12, 2017 •

edited

Loading

titu1994 commented Feb 12, 2017 via email

junwei-pan commented Feb 12, 2017

titu1994 commented Feb 12, 2017 •

edited

Loading

junwei-pan commented Feb 13, 2017

titu1994 commented Feb 15, 2017

titu1994 commented Feb 15, 2017 •

edited

Loading

the-moliver commented Feb 15, 2017

titu1994 commented Feb 15, 2017 •

edited

Loading

the-moliver commented Feb 15, 2017 •

edited

Loading

titu1994 commented Feb 15, 2017 •

edited

Loading

the-moliver commented Feb 16, 2017

bobchennan commented Feb 16, 2017

titu1994 commented Feb 16, 2017

titu1994 commented Feb 16, 2017

the-moliver commented Feb 16, 2017

titu1994 commented Feb 16, 2017 •

edited

Loading

the-moliver commented Feb 16, 2017 •

edited

Loading

titu1994 commented Feb 16, 2017 •

edited

Loading

junwei-pan commented Feb 16, 2017

titu1994 commented Feb 16, 2017

titu1994 commented Feb 18, 2017 •

edited

Loading

titu1994 commented Feb 19, 2017 •

edited

Loading

titu1994 commented Feb 19, 2017 •

edited

Loading

gabrieldemarmiesse commented Feb 10, 2019

Call for Contribution #17

Call for Contribution #17

Comments

junwei-pan commented Feb 12, 2017 • edited Loading

titu1994 commented Feb 12, 2017 via email

junwei-pan commented Feb 12, 2017

titu1994 commented Feb 12, 2017 • edited Loading

junwei-pan commented Feb 13, 2017

titu1994 commented Feb 15, 2017

titu1994 commented Feb 15, 2017 • edited Loading

the-moliver commented Feb 15, 2017

titu1994 commented Feb 15, 2017 • edited Loading

the-moliver commented Feb 15, 2017 • edited Loading

titu1994 commented Feb 15, 2017 • edited Loading

the-moliver commented Feb 16, 2017

bobchennan commented Feb 16, 2017

titu1994 commented Feb 16, 2017

titu1994 commented Feb 16, 2017

the-moliver commented Feb 16, 2017

titu1994 commented Feb 16, 2017 • edited Loading

the-moliver commented Feb 16, 2017 • edited Loading

titu1994 commented Feb 16, 2017 • edited Loading

junwei-pan commented Feb 16, 2017

titu1994 commented Feb 16, 2017

titu1994 commented Feb 18, 2017 • edited Loading

titu1994 commented Feb 19, 2017 • edited Loading

titu1994 commented Feb 19, 2017 • edited Loading

gabrieldemarmiesse commented Feb 10, 2019

junwei-pan commented Feb 12, 2017 •

edited

Loading

titu1994 commented Feb 12, 2017 •

edited

Loading

titu1994 commented Feb 15, 2017 •

edited

Loading

titu1994 commented Feb 15, 2017 •

edited

Loading

the-moliver commented Feb 15, 2017 •

edited

Loading

titu1994 commented Feb 15, 2017 •

edited

Loading

titu1994 commented Feb 16, 2017 •

edited

Loading

the-moliver commented Feb 16, 2017 •

edited

Loading

titu1994 commented Feb 16, 2017 •

edited

Loading

titu1994 commented Feb 18, 2017 •

edited

Loading

titu1994 commented Feb 19, 2017 •

edited

Loading

titu1994 commented Feb 19, 2017 •

edited

Loading