You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper, the weights are the solution to equation (8), which minimizes the squared frobenius norms of the weighted RFF covariance matrices for each pair of features, subject to the constraint that the weights are a probability distribution.
In the code, the weight_learner function solves this problem (?) by using gradient descent on a modified objective that combines the squared frobenius norms of the weighted RFF covariance matrices and a lp norm of the weight vector. What is the purpose of the lp norm on the weight vector (which is already created using softmax on logits, so it is a probability vector)?
Does this somehow ensure that the logits don't go off to infinity? If that is the aim, why not directly regularize by the size of the logits?
The text was updated successfully, but these errors were encountered:
In the paper, the weights are the solution to equation (8), which minimizes the squared frobenius norms of the weighted RFF covariance matrices for each pair of features, subject to the constraint that the weights are a probability distribution.
In the code, the weight_learner function solves this problem (?) by using gradient descent on a modified objective that combines the squared frobenius norms of the weighted RFF covariance matrices and a lp norm of the weight vector. What is the purpose of the lp norm on the weight vector (which is already created using softmax on logits, so it is a probability vector)?
Does this somehow ensure that the logits don't go off to infinity? If that is the aim, why not directly regularize by the size of the logits?
The text was updated successfully, but these errors were encountered: