2016年5月11日 星期三

[ammai] DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Title: DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Author: Taigman, et al



Novelties:

This paper shows some new methds for face classifier.
1. 3D model-based alignment
2. large capacity feedforward model

Contributions:

This paper introduces DeepFace which is close to human level accuracy on face recognition.

Technical Summarizes:

In modern face recognition, there is four stages: detect=>align=>represent=>classify.
There are two stages which are significantly improved in this paper

1. Face Alignment
They use a system with 3D models based on fiducial points to do face alignment.
They first dcetected 6 fiducial points on 2D-crop, it's (a) in the following figure.
They manually place 67 anchor points on the 3D shape (b) to make the detected fiducial points link with their 3D references. Then they use frontalization (g) to get frontalized crop of the face.

2. Representation
The architecture is illustrated in the following figure.
They rain a DNN for the multi-class face recognition task.
The input of C1 is a 3D-aligned, RGB-channel face image of size 152x152. It is shows at the leftmost part of the figure. Then follows a max-pooling layer M2, and a convolutional layer C3. The M2 is to enhance the robustness against local translations. Note that there is no more pooling layer in the following convolutional layers, it is because they want to keep some precise position information.
Then there are three locally connected layers, L4, L5, and L6. This is to preserve different local statistics.
The last part are two fully connected layers, F7 and F8. After that is a softmax to classify the input image.

They also test an end-to-end metric learning approach, a Siamese network. There are two input images with two copies of the network described above.


Experiments:

There are three dataset, SFC, LFW, and TYF.

The SFC dataset is the Social Face Classification dataset. it contains 4030 people with 800 to 1200 faces each.
They leave 5% for test. They trained three network with different size per persons, which are 1.5K, 3K, and 4K. The error only slightly increases from 7.0% to 8.7%, which means it is suitable for large data.
Then they use 10%, 20%, and 50% of total data. The error rate is decreasing, so the network still gain information from dataset instead of overfitting soon.
Finally, they crop the layers to verify the necessary of the network.

The LFW dataset is the Labeled Faces in the Wild dataset. It contains 13323 photos of 5749 celebrities in 6000 pairs.
There are three performance measures:
1. The Restricted Protocol: There is only same and not same labels in training.
2. The Unrestricted Protocol: There are additional training pairs accessible in training.
3. The Unsupervised Setting: No training whatsoever is performed on LFW images.
They use the Siamese Network structure to learn a verification metric. The results show that DeepFace method advances the state-of-the-art.

The YTF dataset is the YouTube Face dataset. It contains 3425 YouTube videos. It can be seen as the LFW dataset with focusing on video.


沒有留言:

張貼留言