Decouple the computational batch size and minibatch size by accumulating gradients #1977

shelhamer · 2015-02-26T00:34:01Z

Accumulate gradients across batches through the iter_size solver field. With this setting batch_size: 16 with iter_size: 1 and batch_size: 4 with iter_size: 4 are equivalent.

master edition of #1663.

deduplicate solver logic: done by Deduplicate solver regularization, logging, and local rates and decays #2518
~~adjust local_rate and local_decay according to iter_size~~ normalize gradients by iter_size
test equality of updates for batch size equivalents

Historical context:
From @longjon

This PRs the gradient accumulation branch living at https://github.com/shelhamer/caffe/tree/accum-grad. I took a lighter approach here than the one there: parameter gradients are always accumulated, there is no other option. The gradient checker is made correct by zero-initing parameter diffs.

Issues:

This changes the behavior of Backward. External code that used Backward is likely to break, if there is any.

I think this breaks solvers other than SGDSolver, but haven't thought carefully about that yet.

From @jeffdonahue

Have we thought about how to handle the case when we're sharing parameters but using different learning rates? I would be okay with simply disallowing that case since it would probably be a pretty weird thing to do. Otherwise the only other way I can think to handle it is pretty messy -- we could have a a special case where, e.g. if blobs_lr is 2 in one layer but 1 in all others, the Net could prescale (by a factor of 2) the top_diff for the layer with blobs_lr 2 by 2... Actually, even that wouldn't work if the layer has other shared param blobs that don't also have the same relative LR...

From @shelhamer

Always accumulating is simple and good, but let's review the weight sharing and solvers issues before merging.

Decouple the computational batch size and minibatch size by accumulating gradients * shelhamer/accum-grad: accumulate gradients in cudnn conv layer accumulate gradients in (de)conv layers accumulate gradients in inner product layer zero-init param diffs in gradient checker zero-init param diffs and accumulate gradients

tnarihi · 2015-02-26T03:03:37Z

src/caffe/solver.cpp

@@ -477,7 +502,8 @@ void SGDSolver<Dtype>::ComputeUpdateValue() {
  case Caffe::CPU:
    for (int param_id = 0; param_id < net_params.size(); ++param_id) {
      // Compute the value to history, and then copy them to the blob's diff.
-      Dtype local_rate = rate * net_params_lr[param_id];
+      Dtype local_rate = rate * net_params_lr[param_id]
+                              / this->param_.iter_size();


I think this does not work correctly. Diving by iter_size should be applied before accumulating parameter decays.

Multiplying local_decay by iter_size should be okay?

Dtype local_decay = weight_decay * net_params_weight_decay[param_id] * this->param_.iter_size();

Ah... good point.

To clarify: the local_decay needs to be multiplied by the iter_size because the update will include the product of local_rate and local_decay. That is, the update by https://github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp#L497-L499 is computed after weight decay is included on https://github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp#L479-L483. As is, weight decay is defined per iteration so should not be scaled by the effective batch size of batch_size * iter_size.

tnarihi · 2015-02-26T04:13:58Z

Commented on the diff.

By the way, I am not understanding very well about what @jeffdonahue mentions. Is there any relations between this PR and weight sharing. Gradient accumulations among shared parameters are computed independently (since diffs are not shared), and applying learning rates are also independent (then, accumulated to owners). If my understanding is correct, accumulating gradient does not matter for weight sharing. I may be missing something..
Actually, for my purpose, very recently I have implemented accumulating gradient using a simple way but it needs additional memory (It doesn't change the behavior of Backward). tnarihi/caffe@d016dbd
If my understanding is incorrect, my branch also doesn't work in the special case Jeff mentions.

Anyway, the idea of always accumulating is very good (less memory) if the issues are solved. Both issues do not matter to me since I usually use SGDSolver and I will notice the behavior change of Backward, but it might matter to others :P I will move to use this when this PR is done.

jeffdonahue · 2015-02-26T04:45:48Z

Oops, I haven't actually stepped through this myself, but I think you're totally right @tnarihi -- there shouldn't be an issue with weight sharing in this implementation. I was confusing it with my version -- I had rebased my RNN PR (#1873) on this, and then just threw the additional changes to net.cpp I had made in my recurrent branch into a single non-descript commit (jeffdonahue@8983e6f). That commit should really probably have been 3 separate commits with good clarifying messages, as they were pretty substantial now that I look back on it... The first is that I indeed added ShareDiff to the parameter sharing behavior, for the exact reason you mention -- it's essential for memory efficiency in the RNN/LSTMLayers added there (which are implemented as unrolled networks). The other is that I removed the Net's shared parameter gradient accumulation since they're sharing diffs, which gives a bit of a speedup but probably doesn't really matter much because it's just an axpy (but O(T) of them for unrolled recurrent nets...). Those changes do assume lr_mult (the new blobs_lr) is the same for all shared parameters (I should probably have added that that PR is not currently mergable...)

Besides the other issues Takuya mentioned, I now think this is strictly good (i.e. it doesn't break anything that works now) and should be merged. Maybe I'll write a new PR based on this, or a commit to append to this one, that does ShareDiff (and skips the Net's accumulation) for any shared parameters where all lr_mults match. Not the prettiest solution, but I think the speed/memory savings is worth the additional complexity in this case.

tnarihi · 2015-02-26T05:08:09Z

I see. Thanks Jeff! Sharing diff for weight sharing is nice for memory consumption. I think to restrict all lr_mults match should be okay.

The other thing is, I think, to notifying developers (especially for developers working on PR regarding layers that has parameter updates) that Backward behavior change is necessary. To add a test checking if all layers work in gradient accumulation is better, but it seems difficult...

jeffdonahue · 2015-02-26T05:19:24Z

The other thing is, I think, to notifying developers (especially for developers working on PR regarding layers that has parameter updates) that Backward behavior change is necessary. To add a test checking if all layers work in gradient accumulation is better, but it seems difficult...

Another good point. At some point I had modified the gradient checker to check accumulation (by adding some random noise to the param blob diffs, calling Backward, then subtracting the noise before checking the result) -- I can try to dig that up to add to this.

tnarihi · 2015-02-26T06:46:15Z

That sounds nice idea. Abstracting Backward testing could solve this issue. Currently gradient checker seems to play the role.

longjon · 2015-02-26T07:17:25Z

Actually I have thought of a simpler way to implement this that is independent of gradient accumulation. Maybe it is too tricky, maybe not. Will update.

(I am still mildly in favor of always accumulating gradients, disallowing different lr_mults, and simplifying weight sharing. Ideally one implements lr_mult by having a backward net that is slightly different from one's forward net.)

Decouple the computational batch size and minibatch size by accumulating gradients

sguada · 2015-04-19T01:52:56Z

@shelhamer @jeffdonahue @longjon what is happening with this PR, I think we need to find a solution and merge it as soon as possible. Actually I thought it was already merged since the solution has been around for a while.

Decouple the computational batch size and minibatch size by accumulating gradients

shelhamer · 2015-05-14T21:52:55Z

Accumulating gradients includes subtleties with regards to scaling gradients and hyperparameters w.r.t. to the effective batch size vs. the computational batch size batch_size and iteration size iter_size.

For merge, this needs a test that compares the updates computed by batch_size: 16 and iter_size: 1 with batch_size: 4 and iter_size: 4 for instance. We have merely reasoned our way to correctness.

longjon · 2015-05-15T04:47:32Z

Right, this should be fine after @shelhamer's list. My idea for a simpler implementation did not pan out; it would only have worked for SGD with momentum.

shelhamer · 2015-05-28T21:21:14Z

0e7a078 makes the normalization for accumulation more obvious and fixes the issue with AdaGrad by normalizing the gradient before the update and history are computed.

However, when gradients are accumulated there's overhead for this separate scaling step. The time to update CaffeNet parameters for batch_size: 128 and iter_size: 2 rises 1.3x to 3.1 ms from 2.3 ms. If we truly care about the 1 ms / iter AdaGradSolver::ComputeUpdateValue() could be hacked instead.

shelhamer · 2015-05-30T05:50:11Z

Merging for control of memory usage now that this is simple and tested.

@sguada sorry for the wait!

Decouple the computational batch size and minibatch size by accumulating gradients

hli2020 · 2015-06-01T08:08:21Z

I think the PReLU layer also needs to accumulate the gradients. @shelhamer

tnarihi · 2015-06-01T08:32:10Z

Here is my implementation of PReLU gradient accumulation: tnarihi@4d3fbd5
@shelhamer You can cherry-pick it (code may not be clean).

shelhamer · 2015-06-01T17:32:19Z

@tnarihi sorry, I missed the PReLU accumulation when merging accumulating gradients. Could you send a PR for that particular patch? Thanks.

Thanks for the comment @hli2020

shelhamer · 2015-06-01T17:33:12Z

Oh sorry, I missed the cherry pick + commit ID. That'll work fine.

gcr · 2016-01-12T21:55:58Z

src/caffe/solver.cpp

+  if (this->param_.iter_size() == 1) { return; }
+  // Scale gradient to counterbalance accumulation.
+  const vector<shared_ptr<Blob<Dtype> > >& net_params = this->net_->params();
+  const Dtype accum_normalization = Dtype(1.) / this->param_.iter_size();


Is this normalization correct?

Doing this will reduce the gradient by a factor of iter_size compared to computing the gradient over an entire batch. If I'm interpreting this correctly, learning rates should be multiplied by iter_size to overcome this existing code.

Or: Is learning rate automatically scaled by the batch size elsewhere, and this code is necessary to account for the effective increase in the batch size?

It is done this way due to the separation of Net and Solver but it is correct. Net normalizes by the (computation) batch size but only Solver knows about iter_size so it does the portion of the normalization needed to handle accumulation.

This was referenced Feb 26, 2015

Decouple the computational batch size and minibatch size by accumulating gradients #1663

Closed

Mini-batch Size vs. Memory Limit #1929

Closed

shelhamer added focus ES JD JL labels Feb 26, 2015

shelhamer force-pushed the accum-grad branch from 836a2d9 to 05d2bc4 Compare February 26, 2015 00:53

tnarihi reviewed Feb 26, 2015
View reviewed changes

This was referenced Mar 4, 2015

Embed layer #2032

Merged

Unrolled recurrent layers (RNN, LSTM) #2033

Closed

longjon added a commit to longjon/caffe that referenced this pull request Mar 10, 2015

Merge pull request BVLC#1977 from shelhamer/accum-grad

be026fc

Decouple the computational batch size and minibatch size by accumulating gradients

kuprel mentioned this pull request Mar 10, 2015

Running Over Whole Sets/Computing Epochs Instead of Iterations #1094

Open

longjon added a commit to longjon/caffe that referenced this pull request Mar 10, 2015

Merge pull request BVLC#1977 from shelhamer/accum-grad

ae12045

Decouple the computational batch size and minibatch size by accumulating gradients

longjon added a commit to longjon/caffe that referenced this pull request Mar 10, 2015

Merge pull request BVLC#1977 from shelhamer/accum-grad

10c133a

Decouple the computational batch size and minibatch size by accumulating gradients

weiliu89 added a commit to weiliu89/caffe that referenced this pull request Apr 1, 2015

Merge pull request BVLC#1977 from shelhamer/accum-grad

ad6fede

Decouple the computational batch size and minibatch size by accumulating gradients

weiliu89 added a commit to weiliu89/caffe that referenced this pull request Apr 14, 2015

fix a bug in BVLC#1977 (accum-grad) suggested by narihi

0710648

shelhamer mentioned this pull request Apr 27, 2015

dynamic input #2355

Closed

elleryrussell pushed a commit to elleryrussell/caffe that referenced this pull request May 1, 2015

Merge pull request BVLC#1977 from shelhamer/accum-grad

d4ad090

Decouple the computational batch size and minibatch size by accumulating gradients

shelhamer force-pushed the accum-grad branch from 778688d to 0e7a078 Compare May 28, 2015 19:46

shelhamer added ready for review and removed ready for review labels May 28, 2015

shelhamer force-pushed the accum-grad branch from d44db68 to 0e7a078 Compare May 28, 2015 21:05

shelhamer added the ready for review label May 28, 2015

shelhamer added a commit that referenced this pull request May 30, 2015

Merge pull request #1977 from shelhamer/accum-grad

aeef453

Decouple the computational batch size and minibatch size by accumulating gradients

shelhamer merged commit aeef453 into BVLC:master May 30, 2015

shelhamer deleted the accum-grad branch May 30, 2015 05:50

shelhamer mentioned this pull request Jun 1, 2015

PReLU accumulates grad #2532

Merged

kashefy mentioned this pull request Jul 22, 2015

Fully Convolutional Semantic Segmentation error #2788

Closed

shelhamer mentioned this pull request Jul 27, 2015

cuda problem when training imagenet model #2809

Closed

ghost mentioned this pull request Jul 29, 2015

Unable to install caffe-future. longjon/caffe#1

Closed

jeffdonahue mentioned this pull request Aug 6, 2015

Fix weight sharing #2866

Merged

ronghanghu mentioned this pull request Aug 6, 2015

RMSprop clean up and rebase #2867

Merged

This was referenced Aug 6, 2015

Improve / Fix Weight Sharing #1211

Open

Adaptive Solvers: AdaDelta, RMSprop, and ADAM #2860

Closed

AdaDelta Solver (v3) #2782

Merged

lukeyeager mentioned this pull request Aug 7, 2015

Update request - iter_size NVIDIA/caffe#16

Closed

ronghanghu mentioned this pull request Oct 28, 2015

CuDNNConvolutionLayer accumulate gradients #3254

Merged

lukeyeager mentioned this pull request Nov 10, 2015

Adding out of memory handler, making arena bins more detailed NVIDIA/caffe#73

Merged

gcr mentioned this pull request Jan 12, 2016

Code for reproducing cifar-10 examples in "Deep Residual Learni… Lasagne/Recipes#38

Merged

gcr reviewed Jan 12, 2016
View reviewed changes

lukeyeager mentioned this pull request May 17, 2016

Expose iter_size solver option NVIDIA/DIGITS#744

Merged

mrgloom mentioned this pull request Sep 21, 2016

Theoretical slowdown using batch accumulation NVIDIA/DIGITS#1101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple the computational batch size and minibatch size by accumulating gradients #1977

Decouple the computational batch size and minibatch size by accumulating gradients #1977

shelhamer commented Feb 26, 2015

tnarihi Feb 26, 2015

tnarihi Feb 26, 2015

longjon Feb 26, 2015

shelhamer May 14, 2015

tnarihi commented Feb 26, 2015

jeffdonahue commented Feb 26, 2015

tnarihi commented Feb 26, 2015

jeffdonahue commented Feb 26, 2015

tnarihi commented Feb 26, 2015

longjon commented Feb 26, 2015

sguada commented Apr 19, 2015

shelhamer commented May 14, 2015

longjon commented May 15, 2015

shelhamer commented May 28, 2015

shelhamer commented May 30, 2015

hli2020 commented Jun 1, 2015

tnarihi commented Jun 1, 2015

shelhamer commented Jun 1, 2015

shelhamer commented Jun 1, 2015

gcr Jan 12, 2016

shelhamer Jan 12, 2016

Decouple the computational batch size and minibatch size by accumulating gradients #1977

Decouple the computational batch size and minibatch size by accumulating gradients #1977

Conversation

shelhamer commented Feb 26, 2015

tnarihi Feb 26, 2015

Choose a reason for hiding this comment

tnarihi Feb 26, 2015

Choose a reason for hiding this comment

longjon Feb 26, 2015

Choose a reason for hiding this comment

shelhamer May 14, 2015

Choose a reason for hiding this comment

tnarihi commented Feb 26, 2015

jeffdonahue commented Feb 26, 2015

tnarihi commented Feb 26, 2015

jeffdonahue commented Feb 26, 2015

tnarihi commented Feb 26, 2015

longjon commented Feb 26, 2015

sguada commented Apr 19, 2015

shelhamer commented May 14, 2015

longjon commented May 15, 2015

shelhamer commented May 28, 2015

shelhamer commented May 30, 2015

hli2020 commented Jun 1, 2015

tnarihi commented Jun 1, 2015

shelhamer commented Jun 1, 2015

shelhamer commented Jun 1, 2015

gcr Jan 12, 2016

Choose a reason for hiding this comment

shelhamer Jan 12, 2016

Choose a reason for hiding this comment