-
Notifications
You must be signed in to change notification settings - Fork 1
/
sec_eval.tex
215 lines (170 loc) · 18.2 KB
/
sec_eval.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
\section{Experimental Evaluation}
\label{eval}
%% Low-rank, mAP vs. r', n=50e3
%% Temps,
%% Spatial Pyramid.
\subsection{Datasets and evaluation protocol} \label{eval:protocol}
\subsubsection{Holidays and Oxford5K.} We use two standard image retrieval datasets for our experiments. The first one is the INRIA \emph{Holidays} dataset \cite{holidays}. This dataset consists of 1491 images divided into 500 groups of matching images. %If the dataset of an experiment is not mentioned in what follows this means, by default, that it is performed on \emph{Holidays}.
The second one is the \emph{Oxford5k} dataset \cite{oxford}, which consists of 5062 images divided into 55 groups of matching images.
In both datasets, each group contains one query image, the other images in the group being the only correct answers to the query. We use the mAP computation procedures provided along with the dataset to evaluate performance.
\subsubsection{Flickr1M.} Besides these two datasets, we also use 1 million random distractor images downloaded from \emph{Flickr} to carry out large scale experiments, following standard protocol \cite{holidays,Delhumeau2013,ZePe15}.
%For each query image, we calculate its similarity to all other images in the database and rank them, in decreasing order.
%The average precision of a group is derived from the ranking of the images of the group for the similarity with the corresponding query image.
%The final mean average precision (mAP) for a dataset is the mean of the average precision over all its groups.
\subsubsection{Negative pool.} As a pool of negative images to build ESVM and SLEM representations, we use the Flickr100k collection \cite{oxford}, composed of $10^5$ random Flickr images disjoint from those in Flickr1M. For a full rank decomposition, we use only a subset of it, containing between 6000 and 15000 images.
As stated in Section \ref{low-rank} and further discussed in Section \ref{time-scale}, a full rank decomposition does not scale well for bigger number of negative samples. For low rank decomposition, we use all 100000 images.
%REMOVED
%At evaluation time, for a dataset that consists of $p$ images and $q$ query images, we calculate its $p\times q$ \emph{similarity matrix} $S$, where each of its $q$ columns is the matching scores of the query image with all the $p$ images.
\subsection{Which kernel to choose?}
We tested four differnt kernels. The first three --linear, Gaussian, polynomial -- are defined below:
\begin{align}
&k_{linear}(x,y) = x^Ty, \label{k:lin}\\
&k_{Gauss}(x,y) = \exp(-\gamma\|x-y \|^2), \label{k:rbf}\\
&k_{poly}(x,y) = x^Ty+\lambda(x^Ty)^2. \label{k:poly}
%&k_{SPM}(x,y) = \dfrac{1}{2^L}\mathcal{I}(H^0_x, H^0_y) +\sum_{l=1}^L\dfrac{1}{2^{L-l+1}}\mathcal{I}(H^{l,\gamma}_x, H^{l,\gamma}_y). \label{k:spm}
\end{align}
%\JZ{We give results with spp, we should add it. Makes it clear that we use a wide range of kernels.}
%\JZ{The text below could be removed. But some brief text for SPM is a good idea, as it is less standard. I think SPP and SPM are not the same thing... We should use SPM... SPP is a finite dimensional feature map.}
% \textbf{Linear SLEM} The linear kernel of Equation (\ref{k:lin}) is the first default choice and equivalent to the non-kernelized version of SLEM.
% In the remaining of this paper, we reference to linear SLEM when we use the non-kernelized SLEM.
% \textbf{Gaussian SLEM} The radial basis function kernel of Equation (\ref{k:rbf}) is a
% well known reproducing kernel, used for classification with support vector machines.
% \textbf{Polynomial SLEM} The polynomial kernel of Equation (\ref{k:poly}) is a reproducing kernel often used in natural language processing.
%\textbf{SPM SLEM} The spatial pyramid matching kernel of $L+1$ levels in Equation \ref{k:spm} take as input a set of local descriptors and its location in pyramidal bins \cite{spk}.
Note that Gaussian and polynomial kernels require a scalar parameter ($\gamma$ and $\lambda$, respectively), that we set using cross-validation \JZ{What is the cross-valid dataset?}.
Besides the three kernels above described, we likewise carry out experiments using the Spatial Pyramid Matchin (SPM) kernel.
%The SPM kernel of $L+1$ levels in Equation \eqref{k:spm} takes as input a set of local descriptors and computes a similarity score that takes into account their geometrical layout \cite{spk}.\JZ{Need to define $H_x^{l,\gamma}$ ...}
%its location in pyramidal bins \cite{spk}.
\subsection{Base visual features}
We test our feature encoder for three different base features $x\in\mathbb{R}^d$: the hand-crafted VLAD image representation \cite{Delhumeau2013} and two learned features derived from the activation coefficients of deep Convolutional Neural Networks \cite{Krizhevsky2012,babenko15}.
We use the same VLAD variant of \cite{Delhumeau2013} used in \cite{ZePe15} that relies on densely-extracted rootSIFT \cite{3things} local descriptors, per-cluster, PCA-based rotations, and root normalization. Like \cite{ZePe15}, we use $64$ clusters, for a final feature size of $8192$.
The first CNN-based feature we use consists of the activation coefficients of the previous-to-last layer of the architecture of \cite{Krizhevsky2012}, based on a publicly available pre-trained model \cite{jia2014caffe}. This is the same feature considered in \cite{ZePe15} and originally considered as an image retrieval feature in \cite{Sharif}. We refer to it as FC-CNN. \JZ{Please verify...}
The second CNN-based feature, the SPoC feature of \cite{babenko15}, is tailored specifically for the image retrieval application. It consists of a spatially-weighted sum-pooling of the activations of the last convolutional layer of a 19-layer CNN \cite{Simonyan2014}.
%REMOVED
% We revisit the VLAD feature presented in \cite{VLAD} as an example of a hand-crafted representation. First we extract a set $\mathcal{F}$ of local descriptors of the image $I$. We use the 128 dimension RootSIFT \cite{3things} descriptors, extracted densely.
% Then, we hard-assign each descriptor $f$ in $\mathcal{F}$ to the closest among $K$ pre-trained codewords $c_k$, $k\in\{1\cdots K\}$,
% %one of $K$ set $\mathcal{C}_k$ of descriptors associated to codewords $\{c_k\}_{1\leq k\leq K}$,
% and map $f$ to a $\RR^{128K}$ vector
% \begin{equation}
% \phi^{VL}_1(f) = \left[0 \cdots 0\quad \Phi_k^T\frac{(f-c_k)}{\|f-c_k\|} \quad 0 \cdots 0\right],
% \end{equation}
% where $\Phi_k$ is a $128\times 128$ PCA matrix learned on training features mapped to $k$-the codeword.
% %associated to descriptors in $\mathcal{C}_k$.
% The final VLAD representation is the power-normalization and $l_2$ normalization of the sum-pooling of $\phi^{VL}_1$:
% \begin{equation}
% \phi^{VL}_2(I) \propto \mathrm{power}\big(\sum_{f\in \mathcal{F}}\phi^{VL}_1(f)\big),~\|\phi^{VL}_2(I)\|=1,
% \end{equation}
% with scalar power normalization $\mathrm{power}(v)=\mathrm{sign}(v)|v|^{0.5}$ applied component-wise.
% In experiments, we use $K=64$ codewords learned on images from Flickr. Our VLAD representation has $d=8192$ dimensions.
%REMOVED
% Convolutional features obtained from very deep convolutional neural networks (CNNs) have been shown to work as good local descriptors for matching \cite{SimonZisser15}. The SPoC representation \cite{babenko15} is a weighted sum-pooling of the activations of the last convolutional layer of a 19-layer CNN. For a given input image $I$, the activations in this layer are organized over a $W\times H$ spatial grid and over $d$ channels. Each position $(w,h)\in \{1 \cdots W\}\times \{1\cdots H\}$ can thus be equipped with a descriptor $f_{h,w}(I)\in\RR^D$.
% %Indeed, if our last convolutional layer has $D$ neurons, and each neuron a $W\times H$ map of activations of this neurons to the image $I$,
% %each pair $(w,h)$ with $w$ in $\{1,2,..., W\}$ and $h$ in $\{1,2,...,H\}$ can be associated to a descriptor $f_{(h,w)}$ in $\RR^D$ of the responses of each neuron at coordinate $(w,h)$ of the maps.
% We then sum-pool these descriptors, weighted accordingly to their distance to the center of the image:
% \begin{equation}
% \phi^{SPoC}(I) = \sum_{w=1}^W\sum_{h=1}^H \alpha_{w,h}f_{w,h}(I),
% \end{equation}
% where
% \begin{equation}
% \alpha_{w,h} = \exp \left(-\dfrac{(w-W/2)^2+(h-H/2)^2}{2\sigma^2}\right).
% \end{equation}
% In our experiments, we follow the implementation details of \cite{babenko15}: We resize all images to $586\times 586$ pixels before feeding them to the network. The last convolutional layer has $d=512$ channels with activation maps of size $(W,H)=(37,37)$. We also set $\sigma=\frac{H}{3}$.
% Finally, we also test a more convencional CNN feature, less deep and using two fully connected layers after the convolutional layers.
% Our final representation is a $4096$ non-negative feature. We based out implementation on CAFFE \cite{jia2014caffe}.
% \subsection{Implementation details}
% We cross validate $\gamma$ and $\lambda$ for Poly and Gaussian SLEM.
%\subsection{Full rank results}
\subsection{Comparison to other enhancement methods using various base features} \JZ{Renamed from ``Full rank results'', not too convinced myself... but if we don't have low-rank results...}
In Table \ref{fullrank:results}, we test our method using the three recent base features discussed above (VLAD \cite{Delhumeau2013}, SPoC \cite{babenko15}, and FC-CNN \cite{Sharif}) and two datasets (6 combinations in total). We further compare against two feature enhancement strategies: {\it (i)} the ESVM method of \cite{ZePe15}, as well as {\it (ii)} the PCA+whitening strategy shown in \cite{babenko15} to drastically improve the retrieval results for SPoC. In our experiments, using PCA dimensionality reduction worsens the results, and hence we use the full PCA rotation matrix for the PCA+whitening strategy.
%Hence we compare the improvements of PCA plus whitening, without compression, with our method.
Note that our method succeeds in improving the performance of the base feature for all three features tested and on both Holidays and Oxford5K datasets. It further outperformns both ESVM and PCA+whiteninng feature enhancement strategies in 5 out of 6 cases, by as many as 2.8 mAP points (for SPoC CNN on Oxford5K). Although nonlinear SLEM enjoys the highest performance, linear SLEM nonetheless outperforms both feature enhancement methods in 4 out of 6 cases illustrated in the table.
%Linear SLEM performs similarly to ESVM despite being much more time efficient (see discussion in section \ref{time-scale}).
%Gaussian SLEM and Polynomial SLEM outperform all methods, for all datasets and all image representations.
%The results are presented in Table \ref{fullrank:results}.
\begin{table*}[t]
\begin{center}
\caption{Mean average precision results for INRIA Holidays and Oxford5k datasets, expressed as percentages. In this table, we present our results for VLAD \cite{Delhumeau2013}, sum-pooling of convolutional features (SPoC CNN) \cite{babenko15}, and activation coefficients form the previous-to-last CNN layer (FC CNN) of the architecture in \cite{Krizhevsky2012}}
\setlength{\tabcolsep}{.2em}
%\begin{tabular}{|c|c|c|c|c|c|c|c|}
\small
\begin{tabular}{c@{\hskip 2em}ccc@{\hskip 2em}cccc}
\toprule
Dataset & \multicolumn{3}{c}{\textbf{Holidays}} & \multicolumn{3}{c}{\textbf{Oxford5k}}\\
\midrule
Method, features & VLAD & SPoC CNN & FC CNN & VLAD & SPoC CNN & FC CNN\\
\midrule
Base feature & 72.7 & 73.1 & 68.2 & 46.3 & 54.4 & 40.6\\
%Whitening & - & - & - & - & -\\
PCA+whitening & 69.4 & 77.5 & 69.2 & 50.9 & 63.7 & 45 \\
ESVM & \textbf{77.5} & 79.9 & 71.8 & 57.5 & 62.1 & 44.6\\
Linear SLEM & 78.0 & 78.3 & 72.1 & \textbf{59.3} & 64.1 & 45.5\\
Gaussian SLEM & 78.1 & 81.4 & \textbf{72.9} & 59.0 & \textbf{64.9} & 46.1\\
Poly SLEM & 78.1 & \textbf{82.0} & \textbf{72.9} & \textbf{59.3} & 64.8 & 46.0\\
\bottomrule
\end{tabular}
\end{center}
\label{fullrank:results}
\end{table*}
\subsection{Time Scalability} \label{time-scale}
In Figure \ref{fullrank:results} we evaluate \textit{(left)} mAP performance and \textit{(right)} time efficiency of (non-)linear SLEM as a function of the negative pool size $n$, comparing it to ESVM. To compute mAP for ESVM, we use $T=10^5$ iterations for all values of $n$ to favour performance. Timings for ESVM are those reported in \cite{ZePe15}, but assuming $T=n 10/6$ (the factor of $10/6$ is used in \cite{ZePe15}).
For all features, we report only timings related to ESVM or SLEM feature extraction. For the case of ESVM or linear SLEM, this includes only the computation of the vector of weights $\omega^\star$. For nonlinear SLEM, this includes the computation of all components of Equation \eqref{eq:nl_sim} that involve the query vector $x_0$ (i.e., $\hat \beta, \beta_0, k(x_0, x_0^\prime), u$ and $v$). \JZ{This does depend on the size of the database... We should indicate what database size we are assuming...}
Note that linear SLEM enjoys a considerable computational advantage of close to two orders of magnitude over ESVM, while under-performing by less than two mAP points. Non-linear SLEM, on the other hand uniformly outperforms ESVM in terms of mAP, but incurs a computational penalty for higher negative pool sizes. The increased complexity for Gaussian and polynomial kernels is expected: storing and solving an $n\times n$ system does not scale for large number of negative samples $n$.
%In this section we compare the time efficiency of our method to that of E-SVM, as discuss which method to use accordingly with the number of positive and negative samples.
%REMOVED this bc now it does change with n...
%In Figure \ref{fullrank:results}, we see that Linear SLEM efficiency does not change with $n$.
%Indeed, if $d$ is the dimension of the base representation, $A$ is a $d\times d$ matrix for Linear SLEM, whereas for a full rank kernel, $A$ is a $n\times n$.
%The reason for the increased complexity for Gaussian and polynomial kernels follows since storing and solving a $n\times n$ system does not scale for large number of negative samples.
%But retrieval results in Figure \ref{fullrank:results} suggest we can benefit from larger sets of negative samples. %Linear SLEM and ESVM scales well, but do not perform so well.
\vspace{3 mm}
\input{figuren.tex}
%\subsection{Low-rank decomposition evaluation}
%\begin{figure}[!h]
%\centering
%\begin{subfigure}[b]{0.48\textwidth}
%\includegraphics[width=\textwidth]{linear_decomposition_nolog.png}
%\end{subfigure}
%\begin{subfigure}[b]{0.48\textwidth}
%\includegraphics[width=\textwidth]{rbf_decomposition_nolog.png}
%\end{subfigure}
%\caption{Comparison between full rank and low rank. In blue, mAP results for full rank SLEM. %In red, mAP results for low rank decomposition of SLEM, varying the rank $r'$.
%At the left, linear SLEM results;at the right, Gaussian SLEM results. In this experiment, $n=10281$.}
%\label{no.ker.vs.linear2}
%\end{figure}
\subsection{Spatial pyramid matching kernel}
\begin{table}[!t]
\centering
\caption{Results for spatial pyramids kernel when compared with a Bag of Visual Words from the local descriptors \JZ{BoW to change to SPM} }
%\begin{tabular}{|c|c|c|c|c|}
\setlength{\tabcolsep}{.5em}
\begin{tabular}{ccccc}
\toprule
Dataset & \multicolumn{2}{c}{\textbf{Holidays}} & \multicolumn{2}{c}{\textbf{Oxford5k}}\\
\midrule
Method, feaures & RootSIFT & CNN & RootSIFT
&CNN \\
\midrule
Baseline (BoW) & 37.4 & 49.8 & 31.1 & 32.4 \\
SP SLEM & \textbf{66.9} & \textbf{70.2} & \textbf{42.4} & \textbf{45.0} \\
\bottomrule
\end{tabular}
\label{tab:spk}
\end{table}
All previous kernel functions take as input image representations in a fixed sized vectorial space. In this section, we propose using the spatial pyramid kernel \cite{GrauDa05}, that takes as input a set of local descriptors and its coordinates in the image.
We revisit the spatial pyramid scheme presented in \cite{spk} using the generalized intersection function of histograms.
Let $X$, $Y$ be the sets of local descriptors of a pair of images.
These local descriptors can be associated to one of $M$ codewords $\{c_1,c_2,...,c_M\}$, so that $X=\bigcup_{1\leq m \leq M}X_m$ and $Y=\bigcup_{1\leq m \leq M}Y_m$, with $X_m$ and $Y_m$ the sets of descriptors associated to $c_m$.
For level $l$ in $\{0,1,..., L\}$, we divide the images in grids $2^l\times 2^l$ and we call $H^{(l,m)}_X$, $H^{(l,m)}_Y$ the $4^l$-dimensional histograms of the occurrences of the visual word $c_m$ in each bin of the grid.
The kernel $K_{sp}(X,Y)$ is given by sum over all codewords of the intersection histogram of $H^{(l,m)}_X$ and $H^{(l,m)}_Y$, $l=0,...,L$, weighted proportionally to the number of bins in each level.
We test this kernel for RootSIFT extracted in multiscales, as used in \cite{spk}, and learn the codewords with K-means. Inspired by the results of \cite{SPPCNN}, we also test it for activations of the last convolutional layer of the network of \cite{SimonZisser15}. Each neuron of the last layer corresponds to a codeword. The results are presented in Table \ref{tab:spk}.
\subsection{Large-scale experiments}
\input{figuredistrac.tex}
In Figure \ref{vlad:oxford} we evalute the performance of the SLEM feature in large-scale image search, by embedding the Holidays dataset within a large dataset of distractor images mined from Flickr, as per the standard large-scale evaluation protocol \cite{holidays,Delhumeau2013,ZePe15}. Given that the best base feature in \cite{ZePe15} is VLAD, we use this base feature in our evaluation although it results in comparable performance for non-linear and linear SLEM (see Table \ref{fullrank:results}). Note however that, for larger numbers of distractor images, the polynomial SLEM variant does give an advantage relative to linear SLEM. And overall, (non-)linear SLEM outperforms the order $2$ ESVM feature regardless of the number of distractor images by close to $2$ mAP points.
%The spatial pyramid kernel is given by
%\begin{align}
% K_{sp}(X,Y) &= \sum_{m=1}^M \kappa^L(X_m, Y_m), \ where \\
% \kappa^L(X_m,Y_m) &= 2^{-L}I^{l}(X_m,Y_m) +\sum_{l=1}^L 2^{-L+l-1}I^{l}(X_m,Y_m) \label{kappa}\\
% I^{l}(X_m,Y_m) = \sum_{i=1}^{4^l}\min()
%\end{align}
%% Local Variables:
%% TeX-master: "main_eccv"
%% End: