Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding paper #1

Closed
melgor opened this issue Jul 21, 2017 · 8 comments
Closed

Question regarding paper #1

melgor opened this issue Jul 21, 2017 · 8 comments

Comments

@melgor
Copy link

melgor commented Jul 21, 2017

I really grateful that @wy1iu release the code. You and @ydwen are really pushing the Face-Verification forward.

I have some question regarding the paper and the code:

  1. What is the major change between L-SoftMax and A-SoftMax. For the equation it look like that in L-SoftMax weight are transformed to norm of weights and in A-SofrMax weight are transformed to normalized weight, right? If this is true, the main motivation was section 3.3 in Large-Margin Softmax Loss for Convolutional Neural Networks?
  2. Could you explain how did you choose function ψ (which replace cos(θ))?
  3. In both paper you use Taylor Series of cos(mθ) (Eq. 7 in Large Margin), right? What was the idea behind using different degree of series based on margin value? Why not using same for all margin?
  4. Here is my intuition behind this both paper: in fact we just scale the output from Linear layer by matrix of ones with different numbers (<1) on target classes. Both paper propose different method of scaling (with theoretical explanation). I think that maybe there is possible to make implementation which would just use scale matrix. I must think about it as there are many non-linear operation here.
  5. I was thinking about using CenterLoss but using cosine similarity. But then I realized than it is equivalent to SoftMax layer without bias (and SoftMax also compare features to other class center as well, not only target, so it make features even better). Do you agree with my interpretation?
@wy1iu
Copy link
Owner

wy1iu commented Jul 21, 2017

Hi, thank you for your interest in our work. I am happy to answer your questions. :)

  1. A-Softmax normalizes the weights and zeros out the biases in the final FC layer, which makes the loss only penalizes the angles. In contrast, L-Softmax will not necessarily normalize the weights and zero out the biases, so it does not necessarily penalize the angles, although in toy examples it did. The main difference is clearly described in the SphereFace paper.

  2. It is natural to reserve the $[0,\frac{\pi}{m}]$ part of the function $cos(m\theta)$. So all we need to do is to design the $[\frac{\pi}{m},\pi]$ part. In fact, the design of this part is not very crucial as long as it is monotonically decreasing.

  3. It is simply a decomposition for $cos(m\theta)$. When $m$ changes, the decomposition will change too.

  4. Of course there will be a different interpretation (somewhat matrix form). However, as you mentioned, this nonlinearity may be difficult to model.

  5. I kinda agree. Softmax without biases may do the same job as center loss, although the back-prop dynamics may be different. Thus adding center loss may help much at the begining, but will improve less and less while the iteration goes. However, combining center loss and softmax loss still make sense to me.

@melgor
Copy link
Author

melgor commented Jul 24, 2017

I have question about your nice implementation of MarginInnerProductLayer
It is very efficient, much more than using formulas from paper.

I almost understand the idea behind it. but I still can not understand how did you find formulas for sign_1 and others.
It is very interesting way for replacing any for/while loop for finding value for k. Could you explain how did you found such formulas or maybe point in what kind of field should I analyse/understand to get intuition behind it?

@wy1iu wy1iu closed this as completed Aug 10, 2017
@melgor
Copy link
Author

melgor commented Aug 10, 2017

Could you explain how you get approximation for this equation?

@ydwen
Copy link
Collaborator

ydwen commented Aug 10, 2017

Hi melgor. I am not sure I have understood exactly what you are asking.
I guess you are confused by the implementation. Why we didn't completely follow the equations in the paper to implement the layer?
The answer is efficiency. It is an alternative implementation and there is no approximation in our code. Sign_1 and others are intermediate variables, which are designed to avoid replicated computation. It may not be the optimal way but a trade-off between speed and memory.

@wy1iu
Copy link
Owner

wy1iu commented Aug 10, 2017

Sorry for missing your question @melgor. As ydwen mentioned, our implementation is efficient in the sense that you have stored some of your intermediate computation results for subsequent reuse (similar to the idea of dynamic programming). It is basically to trade memory for speed. Most importantly, this implementation is totally equivalent to the original formulation in the paper (no approximation happens).

@wy1iu wy1iu reopened this Aug 10, 2017
@melgor
Copy link
Author

melgor commented Aug 10, 2017

Thanks for the answer. I was just trying to get your equation from original in paper and I could not get exactly the same answers. (I'm doing it as a exercise as your implementation is much faster than simple one)

@nyyznyyz1991
Copy link

@wy1iu @melgor
Thanks for your discussion, the implementation of sign_3 and sign_4(with m = 4) is impressive and elegant, it gets rid of the calculation of theta using arg_cos and avoids replicated computation. How did you deduct the formula?
sign_3 = sign_0 * sign(2 * cos_theta_quadratic_ - 1)
sign_4 = 2 * sign_0 + sign_3 - 3
Is there any explanation about it?

@amirhfarzaneh
Copy link

amirhfarzaneh commented Aug 24, 2018

Can someone please explain why the psi function has to be monotonically decreasing?
@wy1iu , @melgor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants