Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss generalization #686

Merged
merged 10 commits into from
Aug 13, 2014
Merged

Loss generalization #686

merged 10 commits into from
Aug 13, 2014

Conversation

jeffdonahue
Copy link
Contributor

This PR generalizes the loss to allow any top blob to produce a loss L = a * (blob_0 + blob_1 + blob_2 + ... + blob_{N-1}) with some scalar coefficient a.

This is accomplished by changing the interface of Forward_{cpu,gpu} implemented by layers. They become void Forward_{cpu,gpu} rather than Dtype Forward_{cpu,gpu}. The current loss layers now all produce a singleton top blob (and don't return a value), which I think they all already did because of @sguada's changes. To allow for backwards compatibility in the sense that users can still use a loss layer without explicitly specifying a top blob, I added a layer property bool AutoTopBlobs() to automatically create the MinTopBlobs() or ExactNumTopBlobs() required by that layer -- currently only the loss layers override AutoTopBlobs() to return true.

To implement the scalar coefficient, you add a proto field loss_weight specifying a float for each top blob to your LayerParameter definition. For example:

  layers {
    name: "loss"
    type: SOFTMAX_LOSS
    bottom: "ip2"
    bottom: "label"
  }

That's the "old" way of specifying a SOFTMAX_LOSS layer. It still works -- it has an implicit top blob with an implicit loss_weight of 1. It's equivalent to this:

  layers {
    name: "loss"
    type: SOFTMAX_LOSS
    bottom: "ip2"
    bottom: "label"
    top: "softmax_error"
    loss_weight: 1
  }

If you'd instead specified loss_weight: 2, that would have the exact same effect of doubling your base_lr and halving your weight_decay (I confirmed this with lenet_consolidated_solver.prototxt, which sets a seed -- the training losses were always exactly doubled; test losses were always the same since I didn't set the loss_weight: 2 in the test net). So the loss_weight coefficients don't give you any extra power if you only have one loss, but if you have multiple losses, you may want these extra parameters to scale the different losses appropriately.

*_LOSS layers are the only ones that have a default non-zero loss_weight (of 1) -- all other layers have loss_weight: 0 by default, but as long as they can perform Backward they can now produce a loss. I'm not entirely sure how useful this will be, but it seemed like a pretty elegant generalization to me and required little extra work. The only layers whose backward passes actually did have to change were the LOSS layers themselves. The scale parameter is stored in the diff() of the top blob -- in the case of the loss layers that top blob is a singleton, so the loss layers had to be modified to multiply their gradients by a scale parameter specified by the singleton top blob diff, but all the other layers already knew how to backprop their diffs and could just be used as is. The only annoying thing was that to get top blobs to be both inputs to other layers and losses, I had to use split layers, as it's functionally the same thing as sending the output to two different layers (I have to accumulate my diff from my direct loss and from any layers I output to).

Another nice thing about this is that it allows you to put an ACCURACY layer in a train net in a non-hacky way. Since the accuracy layer produces 0 loss, the net is able to figure out that it can skip running Backward through the accuracy layer. (The exception to this would be if you tried to specify loss_weight: <something != 0> in your ACCURACY layer, in which case it appropriately breaks.) I added an ACCURACY layer to the lenet_consolidated_solver.prototxt train net as a preview of this.

@jeffdonahue
Copy link
Contributor Author

This is rebased and fairly well-tested. In addition to the several new unit tests, I've verified (seeded) ImageNet training behaves as before (with and without an ACCURACY layer), and verified many variations of lenet training that should be equivalent are equivalent (including the two versions of the SOFTMAX_LOSS I pasted in the original comment above).

I hope someone will get a chance to take a look at this at some point soonish to avoid constant rebases. I know it's a lot of code so I understand it might be a little while before someone has the time though -- sorry about that.

Possible disadvantages of this PR that I've thought of are the following:

  • it breaks the method signature (changes return type) of Forward_{cpu,gpu} -- but note that these are the protected methods called only by the public Forward, so it only requires people who have written their own layers to change their return type to void and put the loss computation result into the top blob rather than returning it.
  • it changes the SetUp protocol -- layers implement their layer-specific setup in FurtherSetUp, which is called by the base SetUp. I did this because there is now more generic set up for each layer that is a pain to have to remember to put in each layer -- to continue the current protocol all layers implementing SetUp would have to add code to the beginning and end of their implementations. My personal preference is instead just change the name of the overridden method rather than requiring so much of each layer's SetUp method, but I won't do that unilaterally -- what do other people think? As in the previous disadvantage, this only requires changes from people implementing their own layers (and in this case the change is just changing the method name from SetUp to FurtherSetUp).

I think code that only uses public Caffe interfaces (including the C library and prototxts) will be completely unaffected.

@shelhamer shelhamer self-assigned this Jul 29, 2014
@shelhamer
Copy link
Member

This is trivial but can you fix your commit messages? They don't have headers and are just long lines.

@shelhamer
Copy link
Member

This is cosmetic but it seems to me like SetUp() should keep its name and original purpose as layer initialization and PreSetUp() should prepare the infrastructure. I only say this because SetUp() was the interface method exposed for layer development. SetUp() reads more naturally to me as a method to override.

All the same, we've said again and again now is the time to fix interfaces so I don't have strong feelings on this.

@jeffdonahue
Copy link
Contributor Author

Yup, I'll clean up the commits.

My reasoning for choosing SetUp as the parent method name was that the SetUp parent method calls the child layer's SetUp method (rather than the other way around as it was before), so unless we were going to break the public Layer interface, the Layer-implemented SetUp has to be named SetUp, and the method for child classes to override has to be named something else. FurtherSetUp is one option, but I agree it's not very natural.

Open to suggestions on names and overall design -- including switching back to the old way, where each child explicitly calls the parent Layer<Dtype>::SetUp(...) at the beginning of its SetUp. But in that case they'd also now have to call the parent's Layer<Dtype>::PostSetUp(...) at the end, which starts to get to the point where I'd prefer it to be an automated thing (but there could definitely be a better way to do things like this in C++ for all I know).

Check that the loss and gradients throughout the net are appropriately scaled
for a few loss_weight values, assuming a default weight of 1 in the loss layer
only.  Also modify test_gradient_check_util to associate a loss of 2 rather
than 1 with the top blob, so that loss layer tests fail if they don't scale
their diffs.
its elements are summed with a scalar coefficient.

Forward for layers no longer returns a loss; instead all loss layers must have
top blobs.  Existing loss layers are given a top blob automatically by
Net::Init, with an associated top_loss_weight of 1 (set in
LossLayer::FurtherSetUp).  Due to the increased amount of common SetUp logic,
the SetUp interface is modified such that all subclasses should normally
override FurtherSetUp only, which is called by SetUp.
…s for it.

Test that we can call backward with an ACCURACY layer.  This currently fails,
but should be possible now that we explicitly associate a loss weight with
each top blob.
@jeffdonahue
Copy link
Contributor Author

After discussing with @shelhamer, I've changed the name of the function that layers will now override from FurtherSetUp to LayerSetUp. Merging this momentarily.

jeffdonahue added a commit that referenced this pull request Aug 13, 2014
@jeffdonahue jeffdonahue merged commit 34831e2 into BVLC:dev Aug 13, 2014
@jeffdonahue jeffdonahue deleted the loss-generalization branch August 13, 2014 22:44
This was referenced Sep 18, 2014
mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
RazvanRanca pushed a commit to RazvanRanca/caffe that referenced this pull request Nov 4, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants