2016年4月27日 星期三

[ammai] DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING

Date: April 21st, 2016

Title: DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING

Author: Song Han, Huizi Mao, William J. Dally



Novelties:

Compress deep neural networks to fit in embedded systems without loss precision.

Contributions:

They use three methods to compress deep neural networks:
1. Pruning the weights.
2. Using quantization to represent weights with less bits.
3. Huffman encoding the weights

Technical Summarizes:


They use three steps to compress the network, pruning, quantization,, and Huffman coding:


The first step is network pruning. It is to prune small-weight connections which are smaller than some thresholds. After pruning, there is a sparse network whose size is 9x and 13x smaller for AlexNet and VGG-19 model.
Then they use compressed sparse row/column format to reduce the numbers needed. Finally, they represent the index in difference of neighbor positions to compress more.

The second step is quantization and weight sharing. They distributed the weights into some bins, the weights in same bins share same weights. So they performed k-means clustering on the weights for each layer. They tried three different centroid initialization methods: fordy, density-based, and linear.
The result showed that linear initialization performed the best because it linearly split the [min,max] of the original weights. This method preserves larger weights which influences more in the network.

The last step is Huffman coding. It saves 20%~30% storage.

Experiments:


They saves 35x to 49x storage. They tried LeNet-300-100 and LeNet-5 on MNIST, AlexNet on ImageNet, and VGG-16 on ImageNet.

2016年4月21日 星期四

[ammai2016] Two-Stream Convolutional Networks for Action Recognition in Videos

Date: April 21st, 2016

Title: Two-Stream Convolutional Networks for Action Recognition in Videos

Author: Karen Simonyan, Andrew Zisserman



Novelties:

Provide a model incorporates spatial and temporal recognition streams based on ConvNets.


Contributions:

1. The algorithm provides a deep video classification model.
2. The algorithm shows the temporal ConvNet on optical flow is better.


Technical Summarizes:


They decomposed video into spatial and temporal components, and combined them by late fusion.


The spatial part is to denote the objects and scenes in an individual frame.
The spatial part follows the recent advances in large-scale image recognition methods, along with pre-train step.

The temporal part is to denote the motions of the observer and objects.
They use optical flow stacking, that is, they represent it with a set o f displacement vector field between two consecutive frames. The displacement vector on (u,v) is a vector denote the motion of (u,v) during these frames. The horizontal and vertical components of the vector can be seen as image channels.
Moreover, they introduced a Trajectory stacking. for a trajectory stacking with L, it traces a specific (u,v) along with the trajectory, and keeps recording the motion vector for L frames.
For the following image, the left side is displacement vectors, while the right side is the trajectory vectors.
Moreover, these method can be done bi-directionally.

For the fusion, they tried many method, including "slow fusion" architecture.

Experiments:

There are two video dataset, UCF-101 and HMDB-51.
UCF-101 contains 13K videos with 180 frames on average, annoted into 101 classes; HMDB-51 contains  6.8K videos, annoted into 51 classes.
They take L=10 and ILSVRC for pre-training, and use multi-task learning on temporal stream.

2016年4月6日 星期三

[ammai2016] A Bayesian Hierarchical Model for Learning Natural Scene Categories

Date: April 7th, 2016

Title: A Bayesian Hierarchical Model for Learning Natural Scene Categories

Author: Fei Fei Li



Novelties:

Previous works about nature scene categorization need experts to label the training data mostly.
The paper introduces an unsupervised way to reach the same goal.


Contributions:

There are three main contributions about this work:
1. The algorithm provides a way to learn scenes without supervision.
2. The algorithm framework is flexible.
3. The algorithm can group these categories into a sensible hierarchy, just like humans.

Technical Summarizes:

The main idea is to classify a scene by extracting its features, representing the image into bag of codewords (i.e. local patches), learning Bayesian hierarchical models, and deciding which category has the highest likelihood probability.
The flow chart is:
They describe patches with local features instead of global features. Previous works on nature scene focused on the latter mostly, but they show that the former is more robust to spatial variations and occlusions.
These is the codebook obtained in their work. Most of the codewords are simple orientations and illumination patterns, this property is similar to human visual system's,

Experiments:

Their dataset contains 13 categories with hundreds of images each, randomly select 100 images from each categories for raining.

By branching the categories  with distance measure between models, it shows the dendrogram, we can figure out that the closest models on the leftmost are all in-door scenes.