Date: March 9th, 2016
Title: Aggregating local descriptors into a compact image representation
Author: Herve Jegou, et al
Novelties:
1. Use a new, more sparse, and more structured way for descriptors.
2. Split a large vector into pieces to enhance performances.
Contributions:
This paper introduces a better way to do similar-image-searching problems.
To judge whether a method is good or not, it considers three constraints: the search accuracy, the efficiency , and the memory usage.
Technical Summarizes:
1. Image vector representation
The method is somewhat like Fisher kernel ways, it proposes a vector representation called VLAD, which is the abbreviation of "vector of locally aggregated descriptors."
For a codebook
C of
k visual words (from BOF) and an local descriptor
x of
d dimensions, the methods represent it in
D=k×d dimensions. The components of the VLAD
v looks like:
Where
i = 1...
k and
j = 1...
d, the NN(
x) means the nearest visual word of
x. Then the vector
v will be L2-normalized.
The figure shows the Images and the corresponding VLAD descriptors where
k=16, blue lines means positive value, while red means negative ones.
We can observe that the descriptors are sparse and clustered, so we can encode the descriptors.
2. From vectors to codes
There is two steps:
1) use a projection to reduce the dimensions (PCA)
2) use a quantization to index the vectors.
If we quantize vectors, we should find approximate nearest neighbor, the paper use ADC in it:
Where
a means we want to find
a-th nearest neighbor, the
q() is the quantization function,
x is for query vector, and
y is for database vectors.
Note that we do not quantize
x in this formula, so there will be no approximation error on the query side.
the bits of quantized vectors might be very large, so we split it into
m subvectors and represents
q(
x) by them:
Then we can decomposition the formula in ADC and use look-up tables for the square distances in the new formula thus enhance performances.
Experiments:
Mainly on three datasets:
1. The INRIA Holidays datasets, 1491 holiday images.
2. The UKB, 2250 images.
3. 10M images from Flickr.
The first two datasets show that VLAD has better performance than Fisher kernel representation in any dimensions; The last dataset shows on large dataset, VLAD is significantly better, too.