2016年6月2日 星期四

[ammai] Sequence to Sequence – Video to Text

Title: Sequence to Sequence – Video to Text

Author: Subhashini Venugopalan, et al.



Novelties:

This paper performs a approach for video captioning in a simple LSTM model.

Contributions:

The LSTM model can fit with different input and output size.
Its structure is very simple, but it performs well.

Technical Summarizes:

LSTM is a famous method for sequence to sequence task such as speech recognition, and translation.
They tried LSTM on video captioning task which can be seemed as a sequence to sequence task, too.
Their architecture, S2Vt is as follows:
They used a two-layer LSTM with encoding and decoding:
The encoding stage use RGB frames extracted from fc7 and optical flow features from fc6 as input, and concatenate the output with padding to become the input data for the decoding stage. The decoding stage starts with <BOS> and terminates with <EOS>.

Experiments:

The experiments is on MSVD (Microsoft Video Description Corpus), MPII-MD (MPII Movie Description Corpus), and M-VAD (Montreal Video Annotation Dataset), evaluated by METEOR (Metric for Evaluation of Translation with Explicit Ordering).


2016年5月25日 星期三

[ammai] Deep neural networks for acoustic modeling in speech recognition

Title: Deep neural networks for acoustic modeling in speech recognition

Author: Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury



Novelties:

This paper perform neural network architecture on acoustic modeling.

Contributions:

The modern way for speech recognition is mostly use hidden Markov models, HMM, for temporal information.
They also use Gaussian mixture models, GMMsm to denote the fitness between states of HMM and frames, this model is known as GMM-HMM system.
This paper approaches a method to use deep belief nets with HMM, and shows the DNN-HMM system is better than GMM-HMM system in many aspects.

Technical Summarizes:

The restricted Boltzmann machine, RBM, contains a visible layer and a stochastic binary hidden layer. The two layers are connected by undirected connections. After training current RBM, the hidden layer is prepared for the next RBM as input data.
After that, the stack of RBMs can be seemed to be a deep belief net, DBN by set the direction of the undirected connections.
The final step is to add a softmax layer on it.

Experiments:

The testing part is on TIMIT dataset, which is small enough to try different methods and details on it.
The results shows that DNN methods outperform the old method on most of the aspects, despite it is harder for parallelize.

2016年5月18日 星期三

[ammai] Text Understanding from Scratch

Title: Text Understanding from Scratch

Author: Xiang Zhang, Yann LeCun



Novelties:

This paper performs image-like deep method on text understanding.

Contributions:

This paper introduces a character-based CNN for text classification from scratch.
This method is capable of classification on different language, web posts.
This method need no knowledge about syntactic, semantic structure of a language.

Technical Summarizes:

The main idea is to treat a sentence as an image.
The key modules is a common CNN model. It is a temporal convolutional module with 1-D kernel.
There are strides, max-pooling, and ReLUs which are common in image tasks.

1. Quantization
To treat a sentence as an image, they convert it into a 1-by-l image with m channels. The m is the alphabet size, which is 70 in this paper. That is, each character is quantized into a m-dimensional vector with 1-of-m encoding.
They quantized characters in backward order, which is similar to LSTM method.

2. Model Design
They design 2 networks with 6 convolutional layers and  3 fully-connected layers, the different between them are the number of hidden units and frame sizes. one is 1024 and the other is 256.
The model is shown above.

3. Data Augmentation
They use thesaurus for data augmentation. That is, each word has a probability based on a geometric distribution to replace it from its synonym. They also rank its synonyms and pick one with a probability based on another geometric distribution.

Experiments:

1. DBpedia Ontology Classification
They pick 14 non-overlapping classes (company, educational institution, artist, ...) and choose 40000 training samples and 5000 testing samples each. They concatenate the title and the contents, and truncate it into 1014 characters. They test it for different model, use thesaurus or not, and compare with BoW and w2v.

2. Amazon Review Sentiment Analysis
It contains 34686770 reviews from 6643669 users on Amazon review dataset.
The reviews are labeled by the score the original user gave, which is 1 to 5.
They have tested the full and polarity test on the dataset.
The full test is to classify the review into 5 scores; the polarity label them into 'positive,' with score 4 to 5 and 'negative,' with score 1 to 2. The classifier only have to decide whether a review is positive or not.

3. Sogou News Categorizaion
It contains Chinese news with 5 categories.
They use pypinyin and jieba to translate Chinese characters into alphabets and classify the news the same way as above experiments.

4. Others
They also tried Yahoo! Answers and English News.

2016年5月11日 星期三

[ammai] DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Title: DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Author: Taigman, et al



Novelties:

This paper shows some new methds for face classifier.
1. 3D model-based alignment
2. large capacity feedforward model

Contributions:

This paper introduces DeepFace which is close to human level accuracy on face recognition.

Technical Summarizes:

In modern face recognition, there is four stages: detect=>align=>represent=>classify.
There are two stages which are significantly improved in this paper

1. Face Alignment
They use a system with 3D models based on fiducial points to do face alignment.
They first dcetected 6 fiducial points on 2D-crop, it's (a) in the following figure.
They manually place 67 anchor points on the 3D shape (b) to make the detected fiducial points link with their 3D references. Then they use frontalization (g) to get frontalized crop of the face.

2. Representation
The architecture is illustrated in the following figure.
They rain a DNN for the multi-class face recognition task.
The input of C1 is a 3D-aligned, RGB-channel face image of size 152x152. It is shows at the leftmost part of the figure. Then follows a max-pooling layer M2, and a convolutional layer C3. The M2 is to enhance the robustness against local translations. Note that there is no more pooling layer in the following convolutional layers, it is because they want to keep some precise position information.
Then there are three locally connected layers, L4, L5, and L6. This is to preserve different local statistics.
The last part are two fully connected layers, F7 and F8. After that is a softmax to classify the input image.

They also test an end-to-end metric learning approach, a Siamese network. There are two input images with two copies of the network described above.


Experiments:

There are three dataset, SFC, LFW, and TYF.

The SFC dataset is the Social Face Classification dataset. it contains 4030 people with 800 to 1200 faces each.
They leave 5% for test. They trained three network with different size per persons, which are 1.5K, 3K, and 4K. The error only slightly increases from 7.0% to 8.7%, which means it is suitable for large data.
Then they use 10%, 20%, and 50% of total data. The error rate is decreasing, so the network still gain information from dataset instead of overfitting soon.
Finally, they crop the layers to verify the necessary of the network.

The LFW dataset is the Labeled Faces in the Wild dataset. It contains 13323 photos of 5749 celebrities in 6000 pairs.
There are three performance measures:
1. The Restricted Protocol: There is only same and not same labels in training.
2. The Unrestricted Protocol: There are additional training pairs accessible in training.
3. The Unsupervised Setting: No training whatsoever is performed on LFW images.
They use the Siamese Network structure to learn a verification metric. The results show that DeepFace method advances the state-of-the-art.

The YTF dataset is the YouTube Face dataset. It contains 3425 YouTube videos. It can be seen as the LFW dataset with focusing on video.


2016年5月4日 星期三

[ammai] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Date: May 5th, 2016

Title: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Author: Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun



Novelties:

Introduce Region proposal networks(RPNs) that share convolutional layers to reduce the computing marginal costs.

Contributions:

Improve the region-based CNNs with RPNs.

Technical Summarizes:

Region proposal network is to output set of rectangular object proposals with score on an input image. Our goal is to share computation and conv layers with Fast R-CNN object detection networks.
In this work, they investigate ZF and VGG models.
The network is implemented with 3x3 conv layer followed by two sibling 1x1 conv layers. This network shares across spatial locations since it operates in sliding-window fashion. Each sliding position is with k=9 anchors for translation invariant properties.

Then they should learn conv layers that are shared between RPN defined before and Fast R-CNN with joint optimizing. The steps are training the RPN with ImageNet pre-trained model, training the Fast R-CNN using the proposals generated by RPN, fixing the shared conv layers, fine-tuning the unique layer in RPN, and fine-tuning the fc layers of the Fast R-CNN.

Experiments:

The dataset is PASCAL VOC 2007, which contains 5k trainval images and 5k test images over 20 categories. The RPN with Fast R-CNN has the mAP 59.9% with up to 300 proposals which is much faster, too.

The SS takes 1.51s on average, Fast R-CNN with VGG-16 takes 320ms, and their system takes only 198 ms in total.

2016年4月27日 星期三

[ammai] DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING

Date: April 21st, 2016

Title: DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING

Author: Song Han, Huizi Mao, William J. Dally



Novelties:

Compress deep neural networks to fit in embedded systems without loss precision.

Contributions:

They use three methods to compress deep neural networks:
1. Pruning the weights.
2. Using quantization to represent weights with less bits.
3. Huffman encoding the weights

Technical Summarizes:


They use three steps to compress the network, pruning, quantization,, and Huffman coding:


The first step is network pruning. It is to prune small-weight connections which are smaller than some thresholds. After pruning, there is a sparse network whose size is 9x and 13x smaller for AlexNet and VGG-19 model.
Then they use compressed sparse row/column format to reduce the numbers needed. Finally, they represent the index in difference of neighbor positions to compress more.

The second step is quantization and weight sharing. They distributed the weights into some bins, the weights in same bins share same weights. So they performed k-means clustering on the weights for each layer. They tried three different centroid initialization methods: fordy, density-based, and linear.
The result showed that linear initialization performed the best because it linearly split the [min,max] of the original weights. This method preserves larger weights which influences more in the network.

The last step is Huffman coding. It saves 20%~30% storage.

Experiments:


They saves 35x to 49x storage. They tried LeNet-300-100 and LeNet-5 on MNIST, AlexNet on ImageNet, and VGG-16 on ImageNet.

2016年4月21日 星期四

[ammai2016] Two-Stream Convolutional Networks for Action Recognition in Videos

Date: April 21st, 2016

Title: Two-Stream Convolutional Networks for Action Recognition in Videos

Author: Karen Simonyan, Andrew Zisserman



Novelties:

Provide a model incorporates spatial and temporal recognition streams based on ConvNets.


Contributions:

1. The algorithm provides a deep video classification model.
2. The algorithm shows the temporal ConvNet on optical flow is better.


Technical Summarizes:


They decomposed video into spatial and temporal components, and combined them by late fusion.


The spatial part is to denote the objects and scenes in an individual frame.
The spatial part follows the recent advances in large-scale image recognition methods, along with pre-train step.

The temporal part is to denote the motions of the observer and objects.
They use optical flow stacking, that is, they represent it with a set o f displacement vector field between two consecutive frames. The displacement vector on (u,v) is a vector denote the motion of (u,v) during these frames. The horizontal and vertical components of the vector can be seen as image channels.
Moreover, they introduced a Trajectory stacking. for a trajectory stacking with L, it traces a specific (u,v) along with the trajectory, and keeps recording the motion vector for L frames.
For the following image, the left side is displacement vectors, while the right side is the trajectory vectors.
Moreover, these method can be done bi-directionally.

For the fusion, they tried many method, including "slow fusion" architecture.

Experiments:

There are two video dataset, UCF-101 and HMDB-51.
UCF-101 contains 13K videos with 180 frames on average, annoted into 101 classes; HMDB-51 contains  6.8K videos, annoted into 51 classes.
They take L=10 and ILSVRC for pre-training, and use multi-task learning on temporal stream.

2016年4月6日 星期三

[ammai2016] A Bayesian Hierarchical Model for Learning Natural Scene Categories

Date: April 7th, 2016

Title: A Bayesian Hierarchical Model for Learning Natural Scene Categories

Author: Fei Fei Li



Novelties:

Previous works about nature scene categorization need experts to label the training data mostly.
The paper introduces an unsupervised way to reach the same goal.


Contributions:

There are three main contributions about this work:
1. The algorithm provides a way to learn scenes without supervision.
2. The algorithm framework is flexible.
3. The algorithm can group these categories into a sensible hierarchy, just like humans.

Technical Summarizes:

The main idea is to classify a scene by extracting its features, representing the image into bag of codewords (i.e. local patches), learning Bayesian hierarchical models, and deciding which category has the highest likelihood probability.
The flow chart is:
They describe patches with local features instead of global features. Previous works on nature scene focused on the latter mostly, but they show that the former is more robust to spatial variations and occlusions.
These is the codebook obtained in their work. Most of the codewords are simple orientations and illumination patterns, this property is similar to human visual system's,

Experiments:

Their dataset contains 13 categories with hundreds of images each, randomly select 100 images from each categories for raining.

By branching the categories  with distance measure between models, it shows the dendrogram, we can figure out that the closest models on the leftmost are all in-door scenes.

2016年3月23日 星期三

[ammai2016] Nonlinear Dimensionality Reduction by Locally Linear Embedding

Date: March 24th, 2016

Title: Nonlinear Dimensionality Reduction by Locally Linear Embedding

Author: Sam T. Roweis and Lawrence K. Saul


Novelties:

1. Introduce LLE(locally linear embedding), which is an unsupervised and neighborhood-preserving embeddings.

Contributions:

1. The method is straightforward, the only free parameter is the number of neighbors, K.
2. LLE does not have to be rerun when more dimensions are added to it.

Technical Summarizes:

LLE is a method to mapping high-dimensional inputs into a low-dimensional space and preserving neighborhoods.
So the method based on geometric intuitions is to reconstruct each data point from its neighbors.
The reconstruction errors are add up the squared distances between all the data points and their reconstructions, that is:
The W is the weights we computed, W_ij means the contribution of the jth data point to the ith data point.
That is, if the jth data point is not a neighbor of the ith data point, W_ij should be 0, and we normalized the summation of all the points' weight to the ith data point to 1.
After mapping X into low-dimensional vector Y with minimizing the following cost function:
The step can be illustrated in the following figure:

2016年3月16日 星期三

[ammai2016] Online Dictionary Learning for Sparse Coding

Date: March 17th, 2016

Title: Online Dictionary Learning for Sparse Coding

Author: Julien Mairal, et al


Novelties:

1. Introduce a way to learn dictionary which is online and can apply on large dataset.

Contributions:

There are three contributions:
1. They ast the dictionary learning problem as the optimization of a smooth nonconvex objective function over a convex set.
2. They propose an iterative online algorithm to solve it.
3. Through experiments, the algorithm is faster than state-of-the-art.

Technical Summarizes:

The online dictionary learning algorithm:
We have a quardratic function:
This function aggregate the past information and is the upperbound of the empirical cost function.
Since the value of the function in neighbor iteration are close, we can obtain Dt with previous one as warm restart:
Then they introduce some practically improvement on this algorithm.

Experiments:

The dataset is images from Berkeley segmentation dataset, 1,000,000 for training and 250,000 for testing.
The top is online compares with batch settings, the bottom is their method compares with SG ways.

[ammai2016] Iterative Quantization: A Procrustean Approach to Learning Binary Codes

Date: March 17th, 2016

Title: Iterative Quantization: A Procrustean Approach to Learning Binary Codes

Author: Yunchao Gong and Svetlana Lazebnik


Novelties:

1. Introduce a way to rotate the data and preserve the locality structure.

Contributions:

This paper introduces a better way to do binary codes called ITQ.
To judge whether a method is good or not, it considers three constraints: the length of the code, the Hamming distance between similar images, and the efficiency.
The ITQ method can perform on unsupervised data embeddings (PCA) and supervised data embedding (CCA), that is, this method can work on any projection.

Technical Summarizes:

For unsupervised part, they first do dimensionality reduction with PCA and make our data zero-centered.

If we simply do binary coding on PCA aligned data, the result might look like (a), which split the cluster into two different parts. ITQ will rotate it first like (c) to put the dots in the same cluster into same code.
That is, we should find a rotation with smaller quantization loss:
B is the target coding matrix, contains n codes with length c, V is the projected data, and R starts with a random initialization c by c matrix.
Then there are two steps to do cyclically: "Fix R and update B" and "Fix B and update R."

1. Fix R and update B:
This step wants to minimize Q(B,R), that is to say, it wants to maximizing:
Where the V with tilde denotes VR. Then they can get B=sgn(VR).

2. Fix B and update R:
R can be calculated with B and SVD method.

Experiments:


The unsupervised datasets are subsets of the Tiny Image dataset. The first is a version of CIFAR dataset, the second is a larger subset. The images of these datasets are 32 x 32 pixels.
They evaluate these by nearest neighbor search and average precision of top 500 ranked image for each query.
By compare with other methods with mAPs, the PCA-ITQ method performs very well.

Then they perform ITQ on dataset with CCA and shows CCA-ITQ on clean dataset is better than baseline, and CCA-ITQ on noisy dataset is way better than PCA-ITQ.


2016年3月9日 星期三

[ammai2016] Aggregating local descriptors into a compact image representation

Date: March 9th, 2016
Title: Aggregating local descriptors into a compact image representation
Author: Herve Jegou, et al


Novelties:

1. Use a new, more sparse, and more structured way for descriptors.
2. Split a large vector into pieces to enhance performances.

Contributions:

This paper introduces a better way to do similar-image-searching problems.
To judge whether a method is good or not, it considers three constraints: the search accuracy, the efficiency , and the memory usage.

Technical Summarizes:

1. Image vector representation
The method is somewhat like Fisher kernel ways, it proposes a vector representation called VLAD, which is the abbreviation of  "vector of locally aggregated descriptors."
For a codebook C of k visual words (from BOF) and an local descriptor x of d dimensions, the methods represent it in D=k×d dimensions. The components of the VLAD v looks like:
Where i = 1...k and  j = 1...d, the NN(x) means the nearest visual word of x. Then the vector v will be L2-normalized.

The figure shows the Images and the corresponding VLAD descriptors where k=16, blue lines means positive value, while red means negative ones.
We can observe that the descriptors are sparse and clustered, so we can encode the descriptors.

2. From vectors to codes
There is two steps:
1) use a projection to reduce the dimensions (PCA)
2) use a quantization to index the vectors.

If we quantize vectors, we should find approximate nearest neighbor, the paper use ADC in it:
Where a means we want to find a-th nearest neighbor, the q() is the quantization function, x is for query vector, and y is for database vectors.
Note that we do not quantize x in this formula, so there will be no approximation error on the query side.
the bits of quantized vectors might be very large, so we split it into m subvectors and represents q(x) by them:
Then we can decomposition the formula in ADC and use look-up tables for the square distances in the new formula thus enhance performances.

Experiments:

Mainly on three datasets:
1. The INRIA Holidays datasets, 1491 holiday images.
2. The UKB, 2250 images.
3. 10M images from Flickr.
The first two datasets show that VLAD has better performance than Fisher kernel representation in any dimensions; The last dataset shows on large dataset, VLAD is significantly better, too.