2016年4月21日 星期四

[ammai2016] Two-Stream Convolutional Networks for Action Recognition in Videos

Date: April 21st, 2016

Title: Two-Stream Convolutional Networks for Action Recognition in Videos

Author: Karen Simonyan, Andrew Zisserman



Novelties:

Provide a model incorporates spatial and temporal recognition streams based on ConvNets.


Contributions:

1. The algorithm provides a deep video classification model.
2. The algorithm shows the temporal ConvNet on optical flow is better.


Technical Summarizes:


They decomposed video into spatial and temporal components, and combined them by late fusion.


The spatial part is to denote the objects and scenes in an individual frame.
The spatial part follows the recent advances in large-scale image recognition methods, along with pre-train step.

The temporal part is to denote the motions of the observer and objects.
They use optical flow stacking, that is, they represent it with a set o f displacement vector field between two consecutive frames. The displacement vector on (u,v) is a vector denote the motion of (u,v) during these frames. The horizontal and vertical components of the vector can be seen as image channels.
Moreover, they introduced a Trajectory stacking. for a trajectory stacking with L, it traces a specific (u,v) along with the trajectory, and keeps recording the motion vector for L frames.
For the following image, the left side is displacement vectors, while the right side is the trajectory vectors.
Moreover, these method can be done bi-directionally.

For the fusion, they tried many method, including "slow fusion" architecture.

Experiments:

There are two video dataset, UCF-101 and HMDB-51.
UCF-101 contains 13K videos with 180 frames on average, annoted into 101 classes; HMDB-51 contains  6.8K videos, annoted into 51 classes.
They take L=10 and ILSVRC for pre-training, and use multi-task learning on temporal stream.

沒有留言:

張貼留言