Hsuan-ting Chen AMMAI Note: [ammai] Text Understanding from Scratch

Title: Text Understanding from Scratch

Author: Xiang Zhang, Yann LeCun

Novelties:

This paper performs image-like deep method on text understanding.

Contributions:

This paper introduces a character-based CNN for text classification from scratch.

This method is capable of classification on different language, web posts.

This method need no knowledge about syntactic, semantic structure of a language.

Technical Summarizes:

The main idea is to treat a sentence as an image.

The key modules is a common CNN model. It is a temporal convolutional module with 1-D kernel.

There are strides, max-pooling, and ReLUs which are common in image tasks.

1. Quantization

To treat a sentence as an image, they convert it into a 1-by-l image with m channels. The m is the alphabet size, which is 70 in this paper. That is, each character is quantized into a m-dimensional vector with 1-of-m encoding.

They quantized characters in backward order, which is similar to LSTM method.

2. Model Design

They design 2 networks with 6 convolutional layers and 3 fully-connected layers, the different between them are the number of hidden units and frame sizes. one is 1024 and the other is 256.

The model is shown above.

3. Data Augmentation

They use thesaurus for data augmentation. That is, each word has a probability based on a geometric distribution to replace it from its synonym. They also rank its synonyms and pick one with a probability based on another geometric distribution.

Experiments:

1. DBpedia Ontology Classification

They pick 14 non-overlapping classes (company, educational institution, artist, ...) and choose 40000 training samples and 5000 testing samples each. They concatenate the title and the contents, and truncate it into 1014 characters. They test it for different model, use thesaurus or not, and compare with BoW and w2v.

2. Amazon Review Sentiment Analysis

It contains 34686770 reviews from 6643669 users on Amazon review dataset.

The reviews are labeled by the score the original user gave, which is 1 to 5.

They have tested the full and polarity test on the dataset.

The full test is to classify the review into 5 scores; the polarity label them into 'positive,' with score 4 to 5 and 'negative,' with score 1 to 2. The classifier only have to decide whether a review is positive or not.

3. Sogou News Categorizaion

It contains Chinese news with 5 categories.

They use pypinyin and jieba to translate Chinese characters into alphabets and classify the news the same way as above experiments.

4. Others

They also tried Yahoo! Answers and English News.

Hsuan-ting Chen AMMAI Note

2016年5月18日星期三

[ammai] Text Understanding from Scratch

Title: Text Understanding from Scratch

Author: Xiang Zhang, Yann LeCun

Novelties:

Contributions:

Technical Summarizes:

Experiments:

沒有留言:

張貼留言

2016年5月18日 星期三

[ammai] Text Understanding from Scratch

Title: Text Understanding from Scratch

Author: Xiang Zhang, Yann LeCun

Novelties:

Contributions:

Technical Summarizes:

Experiments:

沒有留言:

張貼留言

2016年5月18日星期三