-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple the computational batch size and minibatch size by accumulating gradients #1977
Conversation
Decouple the computational batch size and minibatch size by accumulating gradients * shelhamer/accum-grad: accumulate gradients in cudnn conv layer accumulate gradients in (de)conv layers accumulate gradients in inner product layer zero-init param diffs in gradient checker zero-init param diffs and accumulate gradients
836a2d9
to
05d2bc4
Compare
Decouple the computational batch size and minibatch size by accumulating gradients * shelhamer/accum-grad: accumulate gradients in cudnn conv layer accumulate gradients in (de)conv layers accumulate gradients in inner product layer zero-init param diffs in gradient checker zero-init param diffs and accumulate gradients
@@ -477,7 +502,8 @@ void SGDSolver<Dtype>::ComputeUpdateValue() { | |||
case Caffe::CPU: | |||
for (int param_id = 0; param_id < net_params.size(); ++param_id) { | |||
// Compute the value to history, and then copy them to the blob's diff. | |||
Dtype local_rate = rate * net_params_lr[param_id]; | |||
Dtype local_rate = rate * net_params_lr[param_id] | |||
/ this->param_.iter_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this does not work correctly. Diving by iter_size
should be applied before accumulating parameter decays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiplying local_decay
by iter_size
should be okay?
Dtype local_decay = weight_decay * net_params_weight_decay[param_id] * this->param_.iter_size();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: the local_decay
needs to be multiplied by the iter_size
because the update will include the product of local_rate
and local_decay
. That is, the update by https://github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp#L497-L499 is computed after weight decay is included on https://github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp#L479-L483. As is, weight decay is defined per iteration so should not be scaled by the effective batch size of batch_size * iter_size
.
Commented on the diff. By the way, I am not understanding very well about what @jeffdonahue mentions. Is there any relations between this PR and weight sharing. Gradient accumulations among shared parameters are computed independently (since Anyway, the idea of always accumulating is very good (less memory) if the issues are solved. Both issues do not matter to me since I usually use SGDSolver and I will notice the behavior change of |
Oops, I haven't actually stepped through this myself, but I think you're totally right @tnarihi -- there shouldn't be an issue with weight sharing in this implementation. I was confusing it with my version -- I had rebased my RNN PR (#1873) on this, and then just threw the additional changes to Besides the other issues Takuya mentioned, I now think this is strictly good (i.e. it doesn't break anything that works now) and should be merged. Maybe I'll write a new PR based on this, or a commit to append to this one, that does |
I see. Thanks Jeff! Sharing diff for weight sharing is nice for memory consumption. I think to restrict all The other thing is, I think, to notifying developers (especially for developers working on PR regarding layers that has parameter updates) that |
Another good point. At some point I had modified the gradient checker to check accumulation (by adding some random noise to the param blob diffs, calling |
That sounds nice idea. Abstracting |
Actually I have thought of a simpler way to implement this that is independent of gradient accumulation. Maybe it is too tricky, maybe not. Will update. (I am still mildly in favor of always accumulating gradients, disallowing different |
Decouple the computational batch size and minibatch size by accumulating gradients
Decouple the computational batch size and minibatch size by accumulating gradients
Decouple the computational batch size and minibatch size by accumulating gradients
Decouple the computational batch size and minibatch size by accumulating gradients
@shelhamer @jeffdonahue @longjon what is happening with this PR, I think we need to find a solution and merge it as soon as possible. Actually I thought it was already merged since the solution has been around for a while. |
Decouple the computational batch size and minibatch size by accumulating gradients
Accumulating gradients includes subtleties with regards to scaling gradients and hyperparameters w.r.t. to the effective batch size vs. the computational batch size For merge, this needs a test that compares the updates computed by |
Right, this should be fine after @shelhamer's list. My idea for a simpler implementation did not pan out; it would only have worked for SGD with momentum. |
0e7a078 makes the normalization for accumulation more obvious and fixes the issue with AdaGrad by normalizing the gradient before the update and history are computed. However, when gradients are accumulated there's overhead for this separate scaling step. The time to update CaffeNet parameters for |
Merging for control of memory usage now that this is simple and tested. @sguada sorry for the wait! |
Decouple the computational batch size and minibatch size by accumulating gradients
I think the PReLU layer also needs to accumulate the gradients. @shelhamer |
Here is my implementation of PReLU gradient accumulation: tnarihi@4d3fbd5 |
Oh sorry, I missed the cherry pick + commit ID. That'll work fine. |
if (this->param_.iter_size() == 1) { return; } | ||
// Scale gradient to counterbalance accumulation. | ||
const vector<shared_ptr<Blob<Dtype> > >& net_params = this->net_->params(); | ||
const Dtype accum_normalization = Dtype(1.) / this->param_.iter_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this normalization correct?
Doing this will reduce the gradient by a factor of iter_size
compared to computing the gradient over an entire batch. If I'm interpreting this correctly, learning rates should be multiplied by iter_size
to overcome this existing code.
Or: Is learning rate automatically scaled by the batch size elsewhere, and this code is necessary to account for the effective increase in the batch size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is done this way due to the separation of Net
and Solver
but it is correct. Net
normalizes by the (computation) batch size but only Solver
knows about iter_size
so it does the portion of the normalization needed to handle accumulation.
Accumulate gradients across batches through the
iter_size
solver field. With this settingbatch_size: 16
withiter_size: 1
andbatch_size: 4
withiter_size: 4
are equivalent.master
edition of #1663.adjustnormalize gradients bylocal_rate
andlocal_decay
according toiter_size
iter_size
Historical context:
From @longjon
From @jeffdonahue
From @shelhamer