-
Notifications
You must be signed in to change notification settings - Fork 1
/
sec_SLEM.tex
104 lines (92 loc) · 6.93 KB
/
sec_SLEM.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
\section{The square loss exemplar machine}\label{lsesvm}
In this section, we revisit the exemplar SVM model as presented by \cite{Efros11} and introduce its square loss version.
\subsection{Exemplar SVMs} \label{esvm}
Given base features in $\RR^d$ at training time, one positive example $x_0$ in $\RR^d$ and a set of negative examples $X = [x_1, x_2,...,x_n]$ in $\RR^{d\times n}$, each column of $X$ representing one example by a vector in $\RR^d$.
Any feature vector representation can be used.
We are also given a loss function $l:\{-1, 1\}\times \RR\rightarrow\RR^+$. Learning an E-SVM from these examples amounts to minimizing the function
\begin{equation}
J(\omega, \nu) = \theta \ l(1, \omega^Tx_0+\nu) +\dfrac{1}{n}\sumi l(-1, \omega^T x_i+\nu)+\dfrac{\lambda}{2}\|\omega\|^2, \label{eq:first}
\end{equation}
w.r.t. $\omega$ in $\RR^d$ and $\nu$ in $\RR$. In Equation (\ref{eq:first}), $\lambda$ and $\theta$ are respectively a regularization parameter on $\omega$ and a positive scalar adjusting the weight of the positive exemplar.
%\footnote{We could also acknowledge a parameter $\theta$ other than $\frac{1}{n}$ as a regularization to the error of each negative example. But this parameters seems to be less important to cross-validate. Also, setting $\theta=\frac{1}{n}$ simplify Equation (\ref{omega:solution}) and allows the use of Woodbury identity.}
The exemplar SVM of $x_0$ with respect to $X$ is the classifier $\omega^\star(x_0,X)$ that minimizes the loss function $J$:
\begin{equation}
\big(\omega^\star(x_0, X), \nu^\star(x_0, X)\big) = \underset{(\omega,\nu)\in\RR^d\times\RR}{\argmin} \ J(\omega, \nu). \label{omega:first}
\end{equation}
To shorten the notation, we refer to these solutions as $\omega^\star$ and $\nu^\star$ when their arguments are implicit.
Applications of E-SVM use the hinge loss function, which guarantees that $J$ is a convex function. The solution of Equation (\ref{omega:first}) can be thus found by stochastic gradient descent.
\subsection{Square loss function}\label{SLEM}
Now, let us study the same learning problem for the square-loss function $l(y,\hat{y}) = \frac{1}{2}(y-\hat{y})^2$. As in the case of the hinge loss, Equation (\ref{eq:first}) is a convex problem.
However it is now a ridge regression problem, whose solution can be found in closed form as
\begin{align}
\begin{cases}
\vspace{3 mm}
\omega^\star &= \dfrac{2\theta}{\theta+1}U^{-1}(x_0-\mu), \\
%\vspace{3 mm}
\nu^\star &= \dfrac{\theta-1}{\theta+1}-\dfrac{1}{\theta+1}(\theta x_0+\mu)^T\omega^\star,
\end{cases}
\label{omega:solution}
\end{align}
%\begin{align}
%\nu^\star &= \dfrac{\theta-1}{\theta+1}-\dfrac{1}{\theta+1}(\theta x_0+\mu)^T\omega^\star, \label{eq:nustar}\\
%\omega^\star &= \dfrac{2\theta}{\theta+1}U^{-1}(x_0-\mu) \label{eq:wstar}
%\end{align}
where:
\begin{align}
\begin{cases}
&\mu = \frac{1}{n}\sum_{i=1}^n x_i,\\
&U = \dfrac{1}{n}XX^T-\mu\mu^T+\dfrac{\theta}{\theta+1}(x_0-\mu)(x_0-\mu)^T+\lambda\mathrm{Id}_d. \label{eq:U}
\end{cases}
\end{align}
Equation \ref{omega:solution} shows how to solve (\ref{eq:first}) in a closed form to obtain the SLEM vector $\omega^\star(x_0,X)$ of $x_0$.
Replacing the hinge loss by the square loss offers a more compact solution, but not necessarily more efficient.
\subsection{Woodbury identity and matching scores}
Let us define $A = \frac{1}{n}XX^T-\mu\mu^T +\lambda\mathrm{Id}_d$ and assume $A^{-1}$ known.
%$A$ is the covariance matrix of the negative samples $X$\footnote{added of a small term proportional to the identity matrix to insure $A$ is positive-definite.}. Let us assume its inverse $A^{-1}$ is known.
Matrix $U$ now reads $U = A + \frac{\theta}{\theta+1}\delta\delta^T$, where $\delta=x_0-\mu$. The Woodbury identity \cite{woodbury} gives us
\begin{equation}
U^{-1} = A^{-1} -\dfrac{\theta}{\theta\delta^TA^{-1}\delta+ \theta+1}A^{-1}\delta^T\delta A^{-1}. \label{invU}
\end{equation}
Substituting (\ref{invU}) in (\ref{omega:solution}) yields
\begin{equation}
\begin{split}
\omega^\star &= \dfrac{2\theta}{\theta+1}U^{-1}\delta \\
&= A^{-1}\delta - \dfrac{\theta}{\theta\delta^TA^{-1}\delta+ \theta+1} A^{-1}\delta (\delta^TA^{-1}\delta)\\
&= \dfrac{2\theta}{\theta\delta^TA^{-1}\delta+ \theta+1} A^{-1}\delta.\label{Wood:omega}
\end{split}
\end{equation}
Note that the positive sample weight $\theta$ does not influence the direction of the optimal vector $\omega^\star$, only its norm. This means that if search and ranking are based on the normalized feature $\frac{1}{\|\omega^\star\|}\omega^\star$, $\theta$ does not influence the matching score of the SLEM vectors of two different images.
%Indeed, if $\omega$ and $\omega'$ are the $d$-dimensional SLEM vectors of $x$ and $x'$, respectively, their cosine similarity reads:
%we can denote $s$ the matching score scalar function defined in $\RR^d\times \RR^d$, which is given by
%\begin{equation}
%s(\omega, \omega') = \dfrac{\omega^T \omega'}{\|\omega\| \|\omega'\|} = \frac{(x-\mu)^TA^{-2}(x'-\mu)}{\|A^{-1}(x-\mu)\|\|A^{-1}(x'-\mu)\|},\label{match:score}
%\end{equation}
%which does not depend on the value of $\theta$.
This means that the weight of the positive sample can be fixed at any positive value without changing matching scores. This sets the SLEM appart from E-SVM that requires this parameter to be cross validated \cite{Efros11,ZePe15}.
\subsection{LDA and SLEM}
Let us now reanalyse the SLEM problem and suppose that we have multiple positive samples. It can be shown that in this case, the corresponding linear classifier of Equation (\ref{eq:first}) for the square-loss is also given by
(\ref{omega:solution}), where $x_0$ denotes this time the center of mass
of the positive samples, {\em if} these samples have the {\em same} covariance matrix $\Sigma$ as the negative samples $X$.
This equal-covariance assumption is of course quite restrictive, and probably unrealistic in general. It is interesting to note, however, that this is exactly the assumption made by linear discriminant analysis. As shown in~\cite{Hastie2009} for example, LDA can be seen as a (non-regularized) linear classifier with decision function $\omega^T z+ b'$, where $z$ is a sample in
$\RR^d$, and
\begin{equation}
\left\{\begin{array}{l}
\displaystyle \omega'=\Sigma^{-1}(x_0-\mu),\\
\displaystyle b'=-\frac{1}{2}(x_0+\mu)^T \omega',
\end{array}\right.
\label{eq:lda}
\end{equation}
Note that when $\lambda=0$ in SLEM (no regularization),
\begin{align}
\Sigma \omega & = U\omega -\frac{1}{2}[(x_0-\mu)^T \omega] (x_0-\mu) =
\left[1-\frac{1}{2}(x_0-\mu)^T \omega\right](x_0-\mu) \\
&=\left[1-\frac{1}{2}(x_0-\mu)^T \omega\right]\Sigma \omega'
\end{align}
thus $\omega$ and $\omega'$ have the same direction. In other words, a
SLEM is a generalized version of LDA when the
regularizer parameter $\lambda $ is zero.
This observation has been equally made by \cite{Koba15}.
Many interesting properties of LDA has been rediscovered for classification tasks \cite{GMPD12,HMR12} and, more recently, for image retrieval with non-embedded image encoders \cite{babenko15}.
%%% Local Variables:
%%% TeX-master: "main_eccv"
%%% End: