-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss generalization #686
Loss generalization #686
Conversation
This is rebased and fairly well-tested. In addition to the several new unit tests, I've verified (seeded) ImageNet training behaves as before (with and without an ACCURACY layer), and verified many variations of lenet training that should be equivalent are equivalent (including the two versions of the SOFTMAX_LOSS I pasted in the original comment above). I hope someone will get a chance to take a look at this at some point soonish to avoid constant rebases. I know it's a lot of code so I understand it might be a little while before someone has the time though -- sorry about that. Possible disadvantages of this PR that I've thought of are the following:
I think code that only uses public Caffe interfaces (including the C library and prototxts) will be completely unaffected. |
This is trivial but can you fix your commit messages? They don't have headers and are just long lines. |
This is cosmetic but it seems to me like All the same, we've said again and again now is the time to fix interfaces so I don't have strong feelings on this. |
Yup, I'll clean up the commits. My reasoning for choosing Open to suggestions on names and overall design -- including switching back to the old way, where each child explicitly calls the parent |
in the objective function.
Check that the loss and gradients throughout the net are appropriately scaled for a few loss_weight values, assuming a default weight of 1 in the loss layer only. Also modify test_gradient_check_util to associate a loss of 2 rather than 1 with the top blob, so that loss layer tests fail if they don't scale their diffs.
its elements are summed with a scalar coefficient. Forward for layers no longer returns a loss; instead all loss layers must have top blobs. Existing loss layers are given a top blob automatically by Net::Init, with an associated top_loss_weight of 1 (set in LossLayer::FurtherSetUp). Due to the increased amount of common SetUp logic, the SetUp interface is modified such that all subclasses should normally override FurtherSetUp only, which is called by SetUp.
…s for it. Test that we can call backward with an ACCURACY layer. This currently fails, but should be possible now that we explicitly associate a loss weight with each top blob.
used to compute the loss.
backward pass when input into a loss.
After discussing with @shelhamer, I've changed the name of the function that layers will now override from |
Loss generalization
Loss generalization
This PR generalizes the loss to allow any top blob to produce a loss
L = a * (blob_0 + blob_1 + blob_2 + ... + blob_{N-1})
with some scalar coefficienta
.This is accomplished by changing the interface of
Forward_{cpu,gpu}
implemented by layers. They becomevoid Forward_{cpu,gpu}
rather thanDtype Forward_{cpu,gpu}
. The current loss layers now all produce a singleton top blob (and don't return a value), which I think they all already did because of @sguada's changes. To allow for backwards compatibility in the sense that users can still use a loss layer without explicitly specifying a top blob, I added a layer propertybool AutoTopBlobs()
to automatically create theMinTopBlobs()
orExactNumTopBlobs()
required by that layer -- currently only the loss layers overrideAutoTopBlobs()
to return true.To implement the scalar coefficient, you add a proto field
loss_weight
specifying a float for each top blob to yourLayerParameter
definition. For example:That's the "old" way of specifying a
SOFTMAX_LOSS
layer. It still works -- it has an implicittop
blob with an implicitloss_weight
of 1. It's equivalent to this:If you'd instead specified
loss_weight: 2
, that would have the exact same effect of doubling yourbase_lr
and halving yourweight_decay
(I confirmed this withlenet_consolidated_solver.prototxt
, which sets a seed -- the training losses were always exactly doubled; test losses were always the same since I didn't set theloss_weight: 2
in the test net). So theloss_weight
coefficients don't give you any extra power if you only have one loss, but if you have multiple losses, you may want these extra parameters to scale the different losses appropriately.*_LOSS
layers are the only ones that have a default non-zeroloss_weight
(of 1) -- all other layers haveloss_weight: 0
by default, but as long as they can performBackward
they can now produce a loss. I'm not entirely sure how useful this will be, but it seemed like a pretty elegant generalization to me and required little extra work. The only layers whose backward passes actually did have to change were the LOSS layers themselves. The scale parameter is stored in thediff()
of the top blob -- in the case of the loss layers that top blob is a singleton, so the loss layers had to be modified to multiply their gradients by a scale parameter specified by the singleton top blob diff, but all the other layers already knew how to backprop their diffs and could just be used as is. The only annoying thing was that to get top blobs to be both inputs to other layers and losses, I had to use split layers, as it's functionally the same thing as sending the output to two different layers (I have to accumulate my diff from my direct loss and from any layers I output to).Another nice thing about this is that it allows you to put an ACCURACY layer in a train net in a non-hacky way. Since the accuracy layer produces 0 loss, the net is able to figure out that it can skip running Backward through the accuracy layer. (The exception to this would be if you tried to specify
loss_weight: <something != 0>
in yourACCURACY
layer, in which case it appropriately breaks.) I added an ACCURACY layer to the lenet_consolidated_solver.prototxt train net as a preview of this.