Title: Text Understanding from Scratch
Author: Xiang Zhang, Yann LeCun
Novelties:
This paper performs image-like deep method on text understanding.Contributions:
This paper introduces a character-based CNN for text classification from scratch.
This method is capable of classification on different language, web posts.
This method need no knowledge about syntactic, semantic structure of a language.
Technical Summarizes:
The main idea is to treat a sentence as an image.
The key modules is a common CNN model. It is a temporal convolutional module with 1-D kernel.
There are strides, max-pooling, and ReLUs which are common in image tasks.
1. Quantization
To treat a sentence as an image, they convert it into a 1-by-l image with m channels. The m is the alphabet size, which is 70 in this paper. That is, each character is quantized into a m-dimensional vector with 1-of-m encoding.
They quantized characters in backward order, which is similar to LSTM method.
2. Model Design
They design 2 networks with 6 convolutional layers and 3 fully-connected layers, the different between them are the number of hidden units and frame sizes. one is 1024 and the other is 256.
The model is shown above.
3. Data Augmentation
They use thesaurus for data augmentation. That is, each word has a probability based on a geometric distribution to replace it from its synonym. They also rank its synonyms and pick one with a probability based on another geometric distribution.
Experiments:
1. DBpedia Ontology Classification
They pick 14 non-overlapping classes (company, educational institution, artist, ...) and choose 40000 training samples and 5000 testing samples each. They concatenate the title and the contents, and truncate it into 1014 characters. They test it for different model, use thesaurus or not, and compare with BoW and w2v.
2. Amazon Review Sentiment Analysis
It contains 34686770 reviews from 6643669 users on Amazon review dataset.
The reviews are labeled by the score the original user gave, which is 1 to 5.
They have tested the full and polarity test on the dataset.
The full test is to classify the review into 5 scores; the polarity label them into 'positive,' with score 4 to 5 and 'negative,' with score 1 to 2. The classifier only have to decide whether a review is positive or not.
3. Sogou News Categorizaion
It contains Chinese news with 5 categories.
They use pypinyin and jieba to translate Chinese characters into alphabets and classify the news the same way as above experiments.
4. Others
They also tried Yahoo! Answers and English News.


沒有留言:
張貼留言