Multimed Tools Appl https://doi.org/10.1007/s11042-018-5748-4
A novel approach to generate a large scale of supervised data for short text sentiment analysis Xiao Sun1
· Jiajin He1
Received: 25 December 2017 / Revised: 19 January 2018 / Accepted: 1 February 2018 © Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract As for the complexity of language structure, the semantic structure, and the relative scarcity of labeled data and context information, sentiment analysis has been regarded as a challenging task in Natural Language Processing especially in the field of short-text processing. Deep learning model need a large scale of training data to overcome data sparseness and the over-fitting problem, we propose multi-granularity text-oriented data augmentation technologies to generate large-scale artificial data for training model, which is compared with Generative adversarial network(GAN). In this paper, a novel hybrid neural network model architecture(LSCNN) was proposed with our data augmentation technology, which is can outperforms many single neural network models. The proposed data augmentation method enhances the generalization ability of the proposed model. Experiment results show that the proposed data augmentation method in combination with the neural networks model can achieve astonishing performance without any handcrafted features on sentiment analysis or short text classification. It was validated on a Chinese on-line comment dataset and Chinese news headline corpus, and outperforms many state-of-the-art models. Evidence shows that the proposed data argumentation technology can obtain more accurate distribution representation from data for deep learning, which improves the generalization characteristics of the extracted features. The combination of the data argumentation technology and LSCNN fusion model is well suited to short text sentiment analysis, especially on small scale corpus. Keywords Data-driven feature learning · Data augmentation · Short text sentiment analysis · Model architectural designs · Large-scale artificial data
Xiao Sun
[email protected] 1
School of Computer and Information, Hefei University of Technology, No. 193 TunXi Road, BaoHe District, Hefei, China
Multimed Tools Appl
1 Introduction Sentiment analysis [18] is commonly used to detect the user’s perception of a product and the user’s sentiment for chat robots [23]. Effective sentiment analysis can mine the user’s subjective feelings to some product and seller can adjust the service timely according to user’s feeling. Recently, it is became popular to use a neural network-based sentiment analysis model. A large amount of data is a foundation to training effective models. However, extremely deep neural networks may lead to over-fitting. In order to solve the problem of over-fitting, the idea of data argumentation may proposed for the neural networks model. Many Neural network-based architectures have achieved completely success in the field of Natural Language Processing (NLP), such as Recurrent Neural networks(RNN) [10, 16],Convolutional Neural networks(CNN) [11], and Long-Short Term Memory(LSTM) [7]; however, these efforts were adversely affected by the lack of large-scale data for training. We propose a data augmentation method to generate a larger scale of data for pretraining, training neural networks for sentiment classification and short text classification, which is a wide utilization in image processing [14, 27] and sound processing [24, 33, 37, 39]. The proposed data augmentation technology has been applied to neural network-based models such as Convolutional Neural Networks [11], Long-Short Term Memory [6, 7] and BOW -based SVMs model [29]. We show that the proposed neural network models with data augmentation outperform models without it and the BOW -based model. Data-driven representation learning via deep learning model is effective in many field of application such as text classification, chatbot,speech recognition [35, 36], action recognition and image processing [34, 38] The crucial contributions are as follows: (1)
(2)
(3)
We propose multi-granularity text-oriented data augmentation technologies to automatically manufacture artificial data to overcome the problem of data sparseness in NLP which is slightly stable against GAN. We firstly proposed a data augmentation-oriented hybrid neural network model called LSCNN and successfully apply the proposed model to sentiment analysis, obtain astonishing improvements and enhance the generalization ability. The proposed LSCNN model is almost automatic and free of any manual features and other resources.
The remainder of the paper is organized as follows. Section 2 presents related work. Section 3 introduces the proposed data augmentation technologies. Section 4 describes the proposed model. Section 5 reports the experiments and evaluation results with and without the data augmentation method. The conclusion and future work are provided in the final Section 6.
2 Related work Text sentiment analysis [15, 18] or short text classification is a challenging task in natural language processing. There are two typical types of task in text classification. The first is identifying topic aspect level sentiment classification [13, 17, 28]. The second is classifying the input text, such as a document, sentence, or paragraph, into predefined categories. Deep neural networks based on word embedding have recently demonstrated remarkable results for text sentiment analysis, such as CNNs [11] and LSTMs [7]. Wang [31] combined regional CNNs and LSTMs for dimensional sentiment analysis. Zhang [41] presented a new Convolutional Neural Network (RACNN) model for text classification that jointly exploits labels on documents and their compositions. Ruder [21] exploited a CNN-based approach
Multimed Tools Appl
for multilingual aspect-based sentiment analysis. Kalchbrenner [9] utilized a CNNs for sentence modeling and classification. [26] employed deep neural networks with convolutional extension features for Chinese microblog sentiment analysis. Zhou [42] proposed a hybrid model that combines CNNs with LSTMs for sentiment classification and question classification. However, the availability of sufficient and diverse labeled data is crucial for neural network training. Krizhevsky [14] proved data augmentation is most efficient and the easiest method for image classification.Salamon [24] applied a deep neural network and data augmentation for sound classification.Zhang [40] used ConvNets for text processing from characterlevel inputs and for using a thesaurus for data augmentation in various datasets except for Chinese. Wang [30] classified images on ImageNet and GoogleNet with rotated data augmentation. Fawzi [3] employed a trust-region optimization strategy for data augmentation,which performs an image classification task. In the past, in order to obtain more data for training, many researchers choose the method of crowdsourcing, web crawler, and manual annotation to obtain the data. It is a time-consuming work. Recently, a trend emerged for tackling these problem via natural language generation models such as Generative Adversarial Network(GAN) [4, 5, 32] and Variational Auto-Encoder(VAE) [12, 19], which is totally unsupervised or semi-supervised learning for data generation. In this paper, we also compared the proposed method with GANs.
3 Data augmentation Data augmentation has been widely and successfully applied to image classification [14, 27]. The most convenient and efficient way to avoid over-fitting problem when training a neural networks model, which is to automatically enlarge the dataset using data augmentation technologies. There are many methods for data augmentation for image data processing, such as rotation/reflection, flip, shift, changing the scale, zoom and color, contrast, and introduce noise. Our approach involves using data augmentation technologies on text sentiment analysis due to the great success achieved with image classification. In this paper, we firstly propose a multi-granularity, text-oriented data augmentation method, including word-level, phrase-level, and sentence-level data augmentation. We exploit some special text-oriented ways for text data by leveraging the characteristics of text as follows.
3.1 Word-level data augmentation Words are the smallest ideographic units in Chinese. We firstly consider getting some variants from the point of view of words, because it is the easiest way for generating many meaningful isomer expressions. Three parts for word-level data augmentation processing are shown as follows.
3.1.1 Thesaurus substitution In terms of text, it is unscientific to use data augmentation technologies directly that are identical to those of images, because the order of phrases or words carries significant semantic and syntactic information, and some erroneous operation may lead to the loss of semantic information. Therefore, the best way to build a higher quality and larger dataset is creating sentences by manufacturing, but it is time consuming due to the existing large size of the dataset. With the problem above, it is a natural way to generate data by thesaurus.
Multimed Tools Appl
A sentence consists of a word and a phrase. thus, which word to be substituted may be more important. Considering the different role for different words, in this work, we choose the Stanford Parser1 to parse the sentence, then replace some specific words or phrases (such as adv, adj, noun, and etc.) with their synonyms.
3.1.2 Word2Vec substitution In order to enhance diversity of the proposed synonyms dictionary 3.1.1, word2vec is introduced to enrich the thesaurus. We firstly pre-train a 300-dimensional word vector, which contains 20137 words named set w2v. Afterwards, we search for the most similar words from the set w2v in the semantic dimension to enrich and rank the substitution words. In this work, we choose the quantity of the nearest word set generated by word2vec as 5. Formally, this is expressed by (1) and (2). (w1i w2i ) n i∈[1,n] Sim(w1 , w2 ) = (w1i − w2i )2 (1) + |w1 | |w2 | i=1
SimSet (w) = KMax(Sim(w, (w2 ∈ w2v))) (2) where Sim(w1 , w2 ) is a function to obtain the similarity between w1 and w2 and SimSet (w) denotes the collection of the top k nearest words in the set w2v. In this research, as for thesaurus substitution 3.1.1 and word2vec 3.1.2 substitution, we choose multiple mechanisms to generate a diversity of data. It is mainly divided into two mechanisms: (1) (2)
Selecting a fixed number(may be 3) of candidate words randomly for different quantities(1–3) of synonym substitution or word2vec substitution. Selecting a variable number(1–5) of candidate words randomly for a specific number(may be 2) of synonyms or nearest words by word2vec to be substituted.
We exploring a most efficient mechanism as word-level data augmentation method.
3.1.3 Translation Starting from the characteristics of the text, we aim to use horizontal translation in stead of referring to the augmentation methods on image processing. In this work, we translate the sentences by adding some meaningless words on the left or the right in the horizontal direction. Therefore, original samples are randomly translated by 1-10 words, with empty positions padded with meaningless words. Given a text or sentence S(w1 , w2 ...wn ), we randomly translate k( k ∈ [1, 10]) position to S (< /s > ×k, w1 , w2 . . . wn ), where ‘×k’ means k meaningless words will be added at the front of the sentence or text. The proposed translation does not damage semantics of the sentence.
3.1.4 Insertion There are some words that may carry no useful information for sentiment analysis in sentences. An efficient or generalized model should ignore the useless words or phrases and focus more on the sentiment-related information. In this work, some meaningless
1 http://nlp.stanford.edu/software/lex-parser.shtml#Download
Multimed Tools Appl
words or phrases are introduce as noise to the sample sentences. The position and number of meaningless words to be inserted are automatically and randomly. Given a text or sentence S(w1 , w2 ...wn ), we randomly insert m(m ∈ [1, 10])meaningless words to S (w1 , < /s >, w2 . . . < /s >, wn ), where ‘< /s >’ denotes the noise words that are inserted into the sentence or text. The proposed word-level data augmentation method such as substitution, translation, and insertion operations are running automatically and randomly. Random processing may introduce more diversity of text. The detail of the algorithm is described as Algorithm 1.
Algorithm 1 Data augmentation for word-level Require: Text String Thesaurus HashMap Word2Vec 1: function M AIN (text thesaurus dropWord dropDic) 2: POS ADV ADJ NOUN 3: result null 4: words filterByPos PosTag text POS thesaurus 5: permutation getPermutation words dropWords 6: for 0 words.length do 7: synonyms.add(thesaurus.get(words[i])) 8: synonyms.add(word2vec.getNeighbors(words[i])) 9: end for 10: combination getCombination synonyms dropDic 11: for 0 permutation.length do 12: temp text.replace permutation combination 13: result.add temp 14: end for 15: TRANSLATE sentence flips result 16: INSERT sentence inserts result 17: return result 18: end function 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36:
function T RANSLATE(sentence num list) for 0 num do temp sentence list.add ‘ num temp end for return list end function function I NSERT(sentence num list) for 0 num do words sentence.split words.add random.words.size list.add array2Str words words.clear end for return list end function
Multimed Tools Appl
3.2 Phrase-level data augmentation Words are difficult to express complex semantics, and this may even cause ambiguity. Chinese phrases are more expressiveness and vividness. In order to concentrate on the short text sentiment, this paper mainly focuses on attribute-head and adverbial phrases only. More detailed explanations and examples are provided as below.
3.2.1 Adverbial phrase substitution Adverbial phrases are used in contemporary Chinese at a higher frequency. An adverbial phrase usually contains an adverbial and a center word. An adverbial word is a modified part of a verb or an adjective, used to modify, restrict the verb or an adjective; however, the adjective or verb are called center words. Adjectives and their adverbial word usually carries more sentiment factors which have a great effect on sentiment analysis. We scan the input text or sentences S(p1 , p2 ...pn ), and automatically detect the adverbial phrases in these sentences. To avoid too much data to be generated in this way, we substitute all the adverbial phrases in the text with phrases from the thesaurus and word2vec. p2 may contain an adverbial word wadv modify the center word in the front wadj , in which case we replace both the adverbial word and center word with their most similar word to generate a new text S1 (p1 , p2 ...pn ) entirely.
3.2.2 Attribute-head phrase fuzzy processing There is another frequent grammatical structures in Chinese, called attribute-head phrase. This type of phrase is always consist of an attributive and noun-center. As for sentiment analysis, the modifiers of the noun-center often contain more significant meaning. Different modifiers may lead to different sentiment tendencies; however, there is little special significance to the sentiment analysis task of the noun center. In view of this, in this paper, a fuzzy processing method is put forward for the noun-center words and substitute the attributive words with similar words to build a novel sample for training. Suppose that we had a source sentence or text as S(Ahp1 , Ahp2 ...Ahpn ),where Ahpi contains an attributive word as modifiers for the noun-center word as Ahpi = {attribtivei , nouni },we randomly replace the word attribtivei for Sim(attribtivei ), and use a fuzzy process for the nouni as < /s >, which is a meaningless word. Finally, new text is S({attribtive1 , noun1 }∗ , {attribtive2 , noun3 }∗ ...{attribtiven , nounn }∗ ).
3.3 Sentence-level data augmentation 3.3.1 Non-emotion sentence deletion A sentence or a text often contains multiple sentences, each sentence contains a complete semantic units. Many sentences in the text often only play the modifiers role that contain few sentiment factors. It is important to focus on the sentences with rich sentiment factors. In the view of this, a random detection and deletion with a sentence of no sentiment tendency is necessary. In this paper, we are first concerned with the original text, and choose the key sentences without sentiment tendency of the original text. Two methods are chosen for this preprocessing. The one that involves using a sentiment dictionary as a filter, whereas the other approach is to automatically detect the key sentiment sentences with a machine-learning
Multimed Tools Appl Table 1 Several samples generated from multi-granularity data augmentation methods Mechanism
Sentence
source
I checked into this hotel on Friday, comfortable lights, the ground is very clean, very warm
word
I checked into this hotel on Friday, comfortable lights, the ground is very neat, very cozy
adv-phrase
I checked into this hotel on Friday, lamplight is easy, the floor is very neat, very warm
attr-phrase
I checked into this hotel on Friday, comfortable , the is very clean, very warm
sentence
Comfortable lights, the ground is very clean, very warm
method, and we choose an SVM as classifier of sentiment detection in this paper. Finally, we synthesize these two schemes to obtain the final voting result. Tag(s) = SVM(s) + SenDic(s)
(3)
s∈text
Suppose that we had a text with multiple sentences T = {s1 , s2 , s3 ...sn }, after sentiment sentences detection for sentences, we obtained T ∗ = {sens1 , nons2 , sens3 ...nonsn }, where sensi denotes that the sentence contains sentimental tendency, and nonsi signifies a non-emotional sentence. We build the artificial data by randomly deleting a non-sentiment sentence. Finally, we obtained T ∗ = {sens1 , sens3 ...nonsn }.
3.4 Generative adversarial network for discrete data In order to generate a large scale of training data, and learning more distribution representation from supervised data, it is a meaningful work to automatically produce a large number of artificial data. Inspired from GANs [5] and data augmentation for image classification, we attempt to use the framework for automatically artificial data generation. GANs is extremely hard to train and difficult in calculating direct gradient on discrete data. It’s not differentiable for generated sample data. We follow the work of pioneer contributors. The paper makes some changes based on seqGan [32] with policy. During the training, many similar samples are generated, it is a common problem for GANs in discrete data generation, which called mode collapses. In this paper, we compared GANs with the multi-granularity data augmentation method we proposed. Several examples Table 1 are given to illustrate the effectiveness of the data augmentation for multi-granularity.
4 Model presentation The overall framework of the proposed LSTM-CNN model is shown as Fig. 1. The model are fed with two channel inputs. The model combined two channels together at feature level,and co-trained to obtain the final feature representation for the input sentences. Firstly, we pre-train the word vectors via Google word2vec2 toolkit. For each input sample sentence S(w1 , w2 . . . wm ), we use IK Analyzer3 to conduct word segmentation and stitch each word vector Vk (vk1 , vk2 . . . vkn ) into a vector matrix M(v1 , v2 . . . vm ) ∈ R m×n . M is fed into the Convolutional Neural Network Channel, which will be discussed in Section 4.1 in more
2 https://code.google.com/archive/p/word2vec/ 3 http://code.google.com/p/ik-analyzer/downloads/list
Multimed Tools Appl
Word2Vec
Convolution and Pooling Window size=2,3,4
I checked in This hotel on Friday, comfortable lights, the ground is very clean, very Warm.
Full Connection
LSTM
LSTM
Softmax
LSTM
LSTM Gate Mean-pooling
Embedding
Fig. 1 Overall framework of the LSCNN hybrid neural network model
detail. Besides, sentence vectors are composed of a raw word index of each word in the sentence, which is fed into the Long-Short Term memory Channel introduced in detail in the following . Finally, we train two channels in parallel and compromise the output of both the convolutional and long-short term channels.
4.1 Convolutional channels (CCs) Convolutional Neural Networks (CNNs) are widely applied to tackle NLP problem [2, 11]. Kim [11] has successfully applied CNNs to sentence sentiment classification. We exploit the extension or variant of [2] as a CNN channel to be a part of our hybrid model to extract local features with the proposed data augmentation method. The Convolutional Channel takes fixed length text (padded to n) vector matrix M as input, which is represented as a concatenation of word vectors Vk (vk1 , vk2 . . . vkn ),where Vk denotes the n-dimension word vectors of the k-th word in the text. The convolutional layer will be fed with the input feature matrix, which can extract the local context of the sample by sliding convolution windows of different sizes over the feature matrix. We use different size convolution filters for convolution and merge together to obtain more contextual information, as shown in Fig. 1. A new feature map C(c1 , c2 . . . cn−h+1 ) is generated by filters with different weights w ∈ R h×n . The proposed CCs extracted context information via distinct convolutional window sizes (such as 2,3,4), then the feature extracted was fused by concatenation as follows: Ck,i = f (w · Xi...i+h + b)
(4)
Multimed Tools Appl
Ck = Ck,1 , Ck,2 . . . Ck,n−h+1 4 C= Ck
(5) (6)
k=2
In our work, h is the window size, b is a bias term, and f is a nonlinear function. We choose ReLu as an activation function. We concatenate Ck as (5). The max pooling layer will be stacked to extract the most important feature and condense the vectors to a specific length by naturally taking the maximum value of different convolutional results. pooling(Cki ) = max(Cki ) (7)
4.2 Long-short term channels (LCs) Long-Short Term Memory [7] is a variant of a RNN. RNN models may experience both exploding and vanishing [20], which is difficult in extracting longer context information. Hochreiter and Schmidhuber [7] successfully overcame the problem by LSTM model. RNN is suitable for represent sequential data and extracting long distance dependency features, such as text or sentence classification. In this work, we choose an LSTM channel as a global features extractor. The input layer consists of a raw random word index of each word. The RNN continues calculating its hidden state ht by combining the previous time step hidden state ht − 1 and input xt . Formally, we have ht = f (Win · xt + Wr · ht−1 + bh )
(8)
LSTM differs from traditional RNNs, as it is more suitable for capturing long-term dependencies by introducing some gating mechanism, which can decide the ratio of which LSTM unites to forget or hold the previous state and memorize the extracted information from the current input data. More concretely, LSTM-based channels consist of four gate components, an input gate i(t), a memory cell c(t), a forget gate f (t), and a output gate O(t). Firstly, the forget gate is used to compute the forget layer. When the result approaches 1, this means that more information will be retained to the next time step t: f (t) = σ (wxf ·xt + whf · ht−1 + bf )
(9)
Secondly, we compute the current data input with the activation function sigmoid. Additionally, it combines the previous memory state and current input via the tanh function, formally expressed as follows: i(t) = σ (wxi · xt + whi · ht−1 + bi )
(10)
(11) c(t) = σ (wxc · xt + whc · ht−1 + bc ) where i(t) denotes an input gate, and c(t) denotes the candidate result for the state of the memory cell at time t. Given the result of the previous two gates, we obtain input gate activation i(t) and forget activation f (t), a candidate value of the memory cell. We can naturally update the memory cell state at time t: ct = ct−1 ⊗ f (t) + i(t) ⊗ c(t)
(12)
Finally, the result of the output gates can be calculated with the updated memory cell value as follows: (13) o(t) = σ (wxo ·xt + who · ht−1 + bo )
Multimed Tools Appl
ht = tanh(ct ) ⊗ o(t)
(14)
where ⊗ denotes element-wise multiplication and σ denotes a sigmoid activation function .
4.3 Features fusion CNN has been proved suitable for extracting local context features and LSTM is naturally obtain global context features. CNN always extract semantic information based on pre-train word vectors; however, LSTM can learn more long-dependency context information. In order to fusing the advantages of the two channel features, we allow features to naturally undergo fusion by co-training the two channels and updating the model of the two channels via loss feedback, and finally concatenating the outputs of two channels as follows. Out (LSCN N ) = Out (CN N ) + Out (LST M)
(15)
Afterwards, dropout [25] and a dense layer are used to condense the output layer to a specific length (classes of the text sentiment). Softmax is a extension of the logistic model in a multi-class classification problem. Softmax layer is stacked on the dense layer to obtain the text sentiment distribution. exp (tk ) p (y = k) = j (16)
1 exp tj where p (y = k) denotes the predicted probability of sentiment class k, and j is the number of sentiment classes. In this work, the cross-entropy error between predicted probability distribution of the classes and true label, which is defined as a loss function for training and optimization. tk logP (y = k) (17) L=− k
Where tk is the probability of equaling object class; object class will be 1, other classes will be 0.
5 Experiments In this section, we introduce the experimental settings in detail and list the some experiment results of the task of the text sentiment analysis on the publicly available hotel on-line evaluation dataset Tan.4 The dataset is relatively small, exactly suited to validate the proposed data augmentation method. We explore best word-level data augmentation mechanism and provide the experimental results with multi-granularity data augmentation. Further more, we evaluate the data augmentation method on NLPCC dataset.
5.1 Dataset and result We validate the reliability of the data augmentation design by selecting the most comprehensive Chinese dataset Tan,5 which is well annotated with positive or negative annotations
4 http://www.datatang.com/data/11970 5 http://www.datatang.com/data/11970
Multimed Tools Appl Table 2 Statistics of the original dataset (Tan’s)
Pos Tans
Neg
Train
Test
Train
Test
1650
350
1650
350
and is also balanced. In order to ensure the correctness of the proposed method, we also validated the model on Chinese news headline corpus6 which is published on NLPCC 2017. In this work, we mainly demonstrate the improvement of our proposed data augmentation with the proposed hybrid model.
5.1.1 Online evaluation dataset Table 2 is the original dataset, and is annotated well. It is a dataset of hotel online-evaluation sentiment polarity classification, and consists of 2000 positive and 2000 negative items. 1650 cases were randomly selected to form the training samples and 350 were used for test data.
5.1.2 multi-granularity artificial data (Tan’s) In order to explore the most effective data augmentation method, we also proposed a multi-granularity augmentation mechanism, which combines the most effective word-level augmentation method and other granularity of the text, such as the phrase and sentence. Table 3 lists the distribution of all the datasets with different level augmentation methods. As shown in the table, WP denotes the artificial data from word and phrase augmentation, and WPS denotes the data from word-level, phrase-level, and sentence-level augmentation mechanisms. Some researchers carried out experiments on these public datasets. In [22], the sentiment dictionary and word vectors were used as features for sentiment classification. In [1], experiments were carried out on this task based on parsing. In [8] SVM and sentiment dictionaries were applied to sentiment analysis on the dataset. DA + N N denotes our data augmentation with a neural network model, where P arsing signifies syntactic analysis, ED denotes sentiment dictionary, and W ord2V ec represents word vectors. We list some experimental results in Table 4.
5.1.3 Chinese news headline corpus (NLPCC) Chinese news headline corpus was collected from several Chinese news websites, such as toutiao.com, sina.com and so on, which is published in The conference on Natural Language Processing and Chinese Computing (NLPCC 2017). There are 18 categories in total.The data statistics is shown in Table 5.
5.1.4 Multi-granularity artificial data (NLPCC) In order to validate the effectiveness of the proposed multi granularity data augmentation method, we use same data augmentation method in all datasets except for sentence level 6 https://github.com/JerrikEph/nlpcc
Multimed Tools Appl Table 3 Statistics of automatically constructed datasets via multi-granularity data augmentation Ds[Multi]
Word
Sentence
Phrase
WP
WPS
GAN
Pos
Neg
Pos
Neg
Pos
Neg
Pos
Neg
Pos
Neg
Pos
Neg
14781
15080
2383
2546
4532
4852
18043
19319
18956
20264
1650
1650
Table 4 Performance comparison with other methods on the Tan’s task
Table 5 Statistics of Chinese news headline corpus(NLPCC)
Table 6 Statistics of automatically constructed NLPCC data via multi-granularity data augmentation
Method
Precision
Recall
Fscore
ED+Word2Vec
0.8440
0.8580
0.8420
Parsing
0.8026
0.7977
0.8008
EDs+SVM
0.8551
–
–
DA+LSCNN
0.9065
0.9055
0.9054
Category
Train
Dev
Test
total
156000
36000
36000
avg
8667
2000
2000
Ds[Multi] Word Total
Phrase Avg
Total
WP Avg
Total
Avg
1828134 101563 409363 22742 2237497 124305
Table 7 Performance comparison with baseline methods on NLPCC task
Method
Micro P
Micro R
Micro F
NBoW
0.760
0.747
0.7497
CNN
0.769
0.763
0.764
LSTM
0.791
0.783
0.784
DA+LSCNN
0.798
0.793
0.797
Multimed Tools Appl
mechanism due to the characteristics for sentiment classification. Table 6 lists the statistics of all the datasets with different level augmentation methods. Some basic deep learning model has been implemented by author of the dataset, such as neural bag-of-words (NBoW), convolutional neural networks (CNN) [11] and Long shortterm memory network (LSTM) [7]. We list some experimental results in Table 7.
5.2 Experimental settings In this paper, we evaluate our data augmentation method on the publicly available sentiment classification dataset(Tan’s) and Chinese news headline corpus(NLPCC 2017). The detailed statistics of the datasets were summarized in Section 5.1. Word segmentation and POS tagging were achieved by IKAnalyer and the Stanford Parser. We used three metrics including macro precision, macro recall, and macro F-1(as (18)) to measure the classification performance. F Score =
2 ∗ precision ∗ recall (precision + recall)
(18)
We pre-train the 300-dimensional word vectors on the largest data augmentation datasets which can covers all the words in the datasets we used. Sentence sequences are fed to CNN Channels by concatenating each word-embedding. We used 200 filters with window sizes(2, 3, and 4) and dropout rate of 0.5, and l2 regularization of 0.01. We set the dimension of hidden states and the cell state in the LSTM Channels to 128 and used a dropout rate of 0.5. We used SGD to update the initial model parameters when training. We used crossentropy to evaluate the training error and validate the model on 10 percent of the training set randomly when training and evaluating the model parameters on the test set. We build the system with keras(Theano backend) framework and train the deep learning model with Titan 1080 to accelerated training process. We constructed large datasets of different scales with the proposed data augmentation method, however, there is no previous work for us to compare. We evaluated the proposed data augmentation method by comparing the performance improvement on several compressive models for short text classification. In addition, we also trained models with the dataset generated by Generative Adversarial Networks(GANs).
5.2.1 Experiments on Tan’s evaluation datasets We firstly conducted the experiments on the original dataset and the automatically artificial datasets to explore the best word-level augmentation mechanism with the first three model. Then we conducted the experiments on a multi-granularity artificial dataset with all the above models.
5.3 Multi-granularity data augmentation Firstly, we exploring a best word-level augmentation mechanism as a part of multigranularity data augmentation method. We further explored the systematical data augmentation method by introducing a multi-granularity data augmentation method that contains a word-level, phrase-level, and sentence-level data augmentation method. In this section we list all the experiments on the datasets obtained by using the proposed data augmentation method.
Multimed Tools Appl Table 8 Comparison of CNN model with various data augmentation methods on Tan’s task
Mechanism
Precision
Recall
Fscore
Original
0.8671
0.8667
0.8666
Word-level
0.8826
0.8823
0.8820
Phrase-level
0.8816
0.8811
0.8811
Sentence-level
0.8985
0.8979
0.8978
WP
0.8765
0.8765
0.8765
WPS
0.8870
0.8857
0.8856
GANs
0.8781
0.8757
0.8756
5.4 GAN for text generation We use the original dataset to generate samples that matches the distribution of the original dataset with adversarial training. We get 1650 positive samples and 1650 negative samples. In this section we list and compared all the experiments with the proposed data augmentation method. As can be seen in Table 8,we pre-trained CNN mode with different data augmentation methods, where WP denotes the combination of word-level and phrase-level augmentation methods,WPS represents a combination of WP and the sentence-level augmentation method. The performance of CNNs with various data augmentation mechanisms improved the results to some extent. The sentence-level data augmentation method achieved more significant results on every metrics, relative to the phrase and word level augmentation methods. This is because the CNNs that are fed with word vectors as the processing units, which are able to obtain more local context information; hence, the contribution of each word vector is ignored by max-pooling; however, sentence operation is highlighted. As shown in Table 9,we pre-trained LSTM mode with different data augmentation methods. As can be seen, the performance of LSTM with various data augmentation mechanisms leads to an improvement of different metrics. Particularly,the phrase-level data augmentation method has a greater effect on different degrees compared to the word and sentence levels. This results demonstrate that LSTMs specialize in extracting global context information, and therefore it focus more on the global information by fully utilizing all the long time dependency context information. The data generated by GANs has no obvious promotion on LSTM model. In view of the results shown in Tables 8 and 9, we propose a hybrid neural network model consisting of a combination of CNN and LSTM, named LSCNN, which can both
Table 9 Comparison of LSTM model with various data augmentation methods on Tan’s task
Mechanism
Precision
Recall
Fscore
Original
0.8674
0.8674
0.8674
Word-level
0.8901
0.8903
0.8901
Phrase-level
0.8985
0.8978
0.8979
Sentence-level
0.8984
0.8979
0.8978
WP
0.8910
0.8902
0.8902
WPS
0.8857
0.8864
0.8857
GANs
0.8742
0.878
0.878
Multimed Tools Appl Table 10 Comparison of various data augmentation methods using the LSCNN model on Tan’s task
Mechanism
Precision
Recall
Fscore
Original
0.8764
0.8763
0.8760
Word-level
0.8879
0.8857
0.8855
Phrase-level
0.8859
0.8841
0.8840
Sentence-level
0.8984
0.8979
0.8978
WP
0.9030
0.9024
0.9024
WPS
0.9065
0.9055
0.9054
GANs
0.8845
0.8841
0.884
extract global information via a long short-term channel and the local context information by a convolution channel. As shown in Table 10,we pre-trained LSCNN mode with different data augmentation methods, and lists all the results for the model. Firstly, our model has more improvements in terms of all the augmentation method. Secondly, the fusion model LSCNN obtains the best performance with the data generated from the WPS data augmentation method. This indicates that our fusion neural network model can successfully extract more local context information by the CNN channel with both phrase-level and word-level artificial data. In addition, our model can obtain more global context information from the LSTM channel via sentence-level artificial data. Table 11 demonstrates all the best results of different models, where LSCNN+DA denotes the hybrid model with data augmentation we proposed for short-text sentiment classification.
5.4.1 Experiments on NLPCC datasets In order to further verify the validity of the data augmentation methods, we choose the latest NLPCC classification task for experiments. We use this dataset to experiment on multiple models and various data augmentation mechanisms. As can be seen from Table 12, it demonstrates different data with different augmentation mechanisms applied to LSTM model. As shown in the table, more or less, LSTM with various data augmentation mechanisms leads to a positive effect in various metrics. Interestingly, word-level data augmentation mechanism obtains more improvement than phrase-level,it is mainly due the proposed phrase-level augmentation mechanism is sentiment-oriented. There are also some improvements for WP compared to word-level. Table 13 demonstrates different data augmentation methods applied to the CNN model on NLPCC 2017 corpus, where WP denotes the data from word-level data augmentation and phrase-level data augmentation mechanism. As can be seen from the table, some minor Table 11 Comparison of different models on Tan’s task
Model
Precision
Recall
Fscore
SVM
0.8649
0.8636
0.8634
CNN
0.8671
0.8667
0.8666
LSTM
0.8674
0.8674
0.8674
LSCNN
0.8764
0.8763
0.8760
LSCNN+DA
0.9065
0.9055
0.9054
Multimed Tools Appl Table 12 Comparison of LSTM model with various data augmentation methods on NLPCC task
Mechanism
Precision
Recall
Fscore
Original
0.760
0.747
0.7494
Word-level
0.773
0.764
0.766
Phrase-level
0.763
0.7482
0.750
WP
0.776
0.768
0.767
improvements appear in CNNs, It is relatively obvious that Word-level data augmentation mechanism get the best performance. Table 14 demonstrates that we use different data augmentation methods for our LSCNN model on NLPCC classification corpus, and all the results for the model are listed above. It is interesting that, as can be seen from the Table 14, all the data augmentation methods are positively improved to different extends, especially for WP(word and phrase level artificial data),LSCNN model can learn more information about data latent distribution representation with its more complex neural network model.
5.5 Comparison and analysis 5.5.1 Effect for models with augmentation As for the multi-granularity data shown in Tables 8, 9, 10, the proposed LSCNN model achieved 0.9065 for the precision, which is a significant improvement compare to the baseline model. Thus, almost all the models can be influenced positively with a different granularity of artificial data. We found all the experimental models to be positively affected by almost each of the augmentation methods. This shows that the proposed augmentation mechanisms improved the quantity of data without harming the quality of existing data. This leads to the positive enhancement of all the results obtained by current popular models. This means we can adopt this method in other fields with other popular statistic models.
5.5.2 Different gains between models Table 11 demonstrates that the LSCNN model with data augmentation method clearly and significantly outperformed all baselines methods and other models. However, as shown in Tables 8 and 9, the improvement obtained by LSTM and the CNN model is relatively limited, which is attributable to slightly changes of the artificial data. CNN mainly extracts local context features that depend on the windows size. The extracted local context information of the words, which originates from the proposed augmentation algorithm, may be ignored by max-pooling. LSTM can learn long-term dependence by feeding it with sequences data; Table 13 Comparison of CNN model with various data augmentation methods on NLPCC task
Mechanism
Precision
Recall
Fscore
Original
0.769
0.763
0.764
Word-level
0.779
0.772
0.771
Phrase-level
0.772
0.769
0.771
WP
0.776
0.769
0.771
Multimed Tools Appl Table 14 Comparison of various data augmentation methods using the LSCNN model on NLPCC task
Mechanism
Precision
Recall
Fscore
Original
0.782
0.779
0.778
Word-level
0.793
0.786
0.789
Phrase-level
0.776
0.772
0.774
WP
0.798
0.793
0.797
however, local information is ignored. The LSCNN model can both learn additional global context information by the LSTM channel and obtain local context information by its CNN channel, because the model is sensitive to any data changes even a single word. It is easy to draw the conclusion that the proposed augmentation algorithm with LSCNN may achieve improved performance.
5.5.3 Effect of different augmentation mechanisms As shown in Tables 8, 9, 10, the experimental results of six diverse augmentation training sets have been listed. The W P and W P S augmentation training sets outperformed others on almost every model. This indicates that the mechanism of augmentation and the granularity of data augmentation mechanisms plays a important role. This leads to our future work in that fusion of the best multi-granularity data augmentation mechanism into the fusion neural network model might further solve more various problems in natural language processing.
5.5.4 Performance improvements for LSCNN Figures 2 and 3 demonstrate the tendency of the validation set accuracy when training in every epoch. The red curve denotes the LSCNN model pre-trained on the augmented set, and the green one represents the LSCNN model with the original training data. It is clear to get a conclusion that the LSCNN model with augmented data can easily obtain significant performance in contrast to the original training data. The augmentation mechanism can positively promote the proposed fusion neural network model to learn additional context information including the global and local context, and is capable of obtain the best performance faster. This might caused by the fact that the features extracted from the augmented data carries more appropriate statistic distributions for learning.
Acc VS Epoches
Acc VS Epoches
0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
0.7
0.7
0.65
0.65
0.6
0.6 data augment normal
0.55 0.5 0
2
4
6
8
10
12
14
16
data augment normal
0.55 0.5 18
Fig. 2 Single granularity validation accuracy of LSCNN VS every epoch
0
2
4
6
8
10
12
14
16
18
Multimed Tools Appl Acc VS Epoches
0.95
Acc VS Epoches
0.95
0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
0.7
0.7
0.65
0.65
0.6
0.6 data augment normal
0.55 0.5 0
2
4
6
8
10
12
14
16
data augment normal
0.55 18
0.5
0
2
4
6
8
10
12
14
16
18
Fig. 3 Multi-granularity validation accuracy of LSCNN vs. every epoch
5.5.5 Performance on non-sentiment analysis data As shown in Tables 12, 13, 14 we validate the proposed data augmentation method and confusion neural network model on the NLPCC 2017 news headline classification datasets.As shown in the tables,all the data augmentation mechanisms have some positive promote to the final results, and the proposed confusion neural network model win the best performance. Due to the proposed data augmentation mechanism is sentimental oriented, the improvement for the news headline is small compared to the big improvements for sentiment analysis datasets. A suitable data augmentation mechanism is useful for neural network training.
5.5.6 Limitations of GAN for text generation GAN recently has been widely concerned in the field of images generation. However, it is difficult to perform well in discrete data processing tasks. In this paper, we utilized SeqGan to generate some training data, and pre-train the model with them, finetune the model with original dataset. As shown in the Tables 8, 9, 10, GAN can not perform better than the proposed multi-granularity data augmentation method due to its limitation of adversarial training and difficult in calculating direct gradient on discrete data. More directly, many similar, safe samples are generated by high frequency. It is important to generate a variety of samples when generate artificial data with GAN.
6 Conclusion In this work, we proposed a novel multi-granularity data augmentation method to generate sufficient and large-scale data for data driven representation learning via deep learning model, and successfully applied it to sentiment analysis and short text classification. In order to learn an distribution representation, we design a novel hybrid neural network model named LSCNN, which can learn both global and local context information. To our knowledge, the proposed multi-granularity mechanisms have not previously been used for data augmentation on short text sentiment analysis. We also compare the proposed framework with GANs on the task of reviews sentiment analysis. As can be seen from Table 11, the
Multimed Tools Appl
proposed augmentation method with LSCNN can dramatically improve the performance of other neural network and bow-based models. In addition, an effective data augmentation mechanism can accelerate the LSCNN model to gain superior validation performance and eliminate over-fitting. There are Some future challenges remain. Firstly, data balance is crucial for text classification tasks. Second, unsupervised pre-training on an augmented training set for language models may be beneficial to the neural network model. Finally, a more effective augmentation method could be obtained by fusion to a model capable of overcoming the NLP problem,such as some improve plans on GAN to generate sentences. Acknowledgment The work is supported by the Natural Science Foundation of Anhui Province (1508085QF119) and State Key Program of National Natural Science of China (61432004, 71571058, 61461045). This work was partially supported by the China Postdoctoral Science Foundation funded project (No.2015M580532 and No.2017T100447). This research has been partially supported by National Natural Science Foundation of China under Grant No.61472117.
References 1. Chen H (2013) Classification of commodity evaluation based on parsing. Shanghai Jiao Tong University 2. Collobert R et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(1):2493–2537 3. Fawzi A et al (2016) Adaptive data augmentation for image classification. In: IEEE international conference on image processing IEEE, pp 3688–3692 4. Glover J (2016) Modeling documents with generative adversarial networks. arXiv:1612.09122 5. Goodfellow IJ, Pouget-Abadie J, Mirza M et al (2014) Generative adversarial Networks[J]. Adv Neural Inf Proces Syst 3:2672–2680 6. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5–6):602–610 7. Hochreiter S, Schmidhuber J (2012) Long short-term memory. Neural Comput 9(8):1735 8. Hua L (2014) Study on chinese text sentiment classification. Chongqing University 9. Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. Eprint Arxiv 1 10. Karpathy A, Johnson J, Fei-Fei L (2015) Visualizing and understanding recurrent networks. arXiv:1506.02078 11. Kim Y (2014) Convolutional neural networks for sentence classification. Eprint Arxiv 12. Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv:1312.6114 13. Kiritchenko S et al (2014) NRC-Canada-2014: detecting aspects and sentiment in customer reviews. In: International workshop on semantic evaluation, pp 437–442 14. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: International conference on neural information processing systems curran associates inc., pp 1097–1105 15. Le QV, Mikolov T (2014) Distributed representations of sentences and documents. Computer Science 4:1188–1196 16. Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019 17. Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Meeting on association for computational linguistics association for computational linguistics, pp 115–124 18. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135 19. Rezende DJ, Mohamed S, Wierstra D (2014) Stochastic backpropagation and approximate inference in deep generative Models[J]. Eprint Arxiv, 1278–1286 20. Rosario B, Hearst MA (2004) Classifying semantic relations in bioscience text. Meeting of the association for computational linguistics, 21-26 July, 2004, Barcelona, Spain DBLP, 430–437 21. Ruder S, Ghaffari P, Breslin JG (2016) Insight-1 at semeval-2016 task 5: Deep learning for multilingual aspect-based sentiment analysis. arXiv:1609.02748
Multimed Tools Appl 22. Xiang R, Sun M (2016) Sentiment analysis of Chinese sentences based on word embedding and syntax tree[J]. Computer and Modernization 8:27–31 23. Russell EWB (2015) Real-time topic and sentiment analysis in human-robot conversation. Dissertations & Theses - Gradworks 24. Salamon J, Bello J (2016) Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett 99:1–1 25. Srivastava N et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958 26. Sun X, Li C, Ren F (2016) Sentiment analysis for Chinese microblog based on deep neural networks with convolutional extension features. Neurocomputing 210:227–236 27. Sun X, Pan D, Ren F (2016) Facial expression recognition using ROI-KNN deep convolutional neural networks. Automation Journal 42(6):883–891 28. Tang D, Qin B, Liu T (2016) Aspect level sentiment classification with deep memory network. arXiv:1605.08900 29. Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(10):988–999 30. Wang KK (2015) Image Classification with Pyramid Representation and Rotated Data Augmentation on Torch 7 [EB/OL]. https://hgpu.org/?p=13858 31. Wang J et al (2016) Dimensional sentiment analysis using a regional CNN-LSTM model. In: Meeting of the association for computational linguistics, pp 225–230 32. Yu L, Zhang W, Wang J et al (2017) SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient[C]. AAAI, 2852–2858 33. Zhang L, Han Y, Yang Y, Song M, Yan S, Tian Q (2013) Discovering discriminative graphlets for aerial image categories recognition. IEEE T-IP 22(12):5071–5084 34. Zhang L, Gao Y, Hong C, Feng Y, Zhu J, Cai D (2014) Feature correlation hypergraph: exploiting high-order potentials for multimodal recognition. IEEE T-CYB 44(8):1408–1419 35. Zhang L, Gao Y, Ji R, Dai Q, Li X (2014) Actively learning human gaze shifting paths for photo cropping. IEEE T-IP 23(5):2235–2245 36. Zhang L, Song M, Yang Y, Zhao Q, Zhao C, Sebe N (2014) Weakly supervised photo cropping. IEEE T-MM 16(1):94–107 37. Zhang L, Hong R, Gao Y, Ji R, Dai Q, Li X (2016) Image categorization by learning a propagated graphlet path. IEEE T-NNLS 27(3):674–685 38. Zhang L, Li X, Nie L, Yan Y (2016) Roger zimmermann, semantic photo retargeting under noisy image labels. ACM TOMCCAP 12(3):37 39. Zhang L, Wang M, Hong R, Yin B-C, Li X (2016) Large-scale aerial image categorization using a multitask topological codebook. IEEE T-CYB 46(2):535–545 40. Zhang X, Lecun Y (2015) Text understanding from scratch. arXiv:1502.01710 41. Zhang Y, Marshall I, Wallace BC (2016) Rationale-augmented convolutional neural networks for text classification. EMNLP 2016:795 42. Zhou C et al (2015) A c-LSTM neural network for text classification. Computer Science 1(4):39–44
Multimed Tools Appl
Xiao Sun was born in 1980. He received the M.E. degree in 2004 from the Department of Computer Sciences and Engineering at Dalian University of Technology, and got his double doctor’s degree in Dalian University of Technology(2010) of China and the University of Tokushima(2009) of Japan. He is now working as an associate professor in Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine at Hefei University of Technology. His research interests include Affective Computing, Natural Language Processing, Machine Learning and Human-Machine Interaction.
Jiajin He was born in 1992. He received his Bachelors degree in 2015 from School of Science, Anhui University of Science and Technology, Huainan, China. He is currently studying for a Master’s degree at School of Computer and Information, Hefei University of Technology. His research interest includes Natural Language Processing, Sentiment Analysis, and Neural Networks.