Multimed Tools Appl DOI 10.1007/s11042-017-5236-2
Weakly-supervised image captioning based on rich contextual information Hai-Tao Zheng 1 & Zhe Wang 1 Arun Kumar Sangaiah 3
1
1
2
& Ningning Ma & Jinyuan Chen & Xi Xiao &
Received: 6 August 2017 / Revised: 5 September 2017 / Accepted: 15 September 2017 # Springer Science+Business Media, LLC 2017
Abstract Automatically generation of an image description is a challenging task which attracts broad attention in artificial intelligence. Inspired by methods of computer vision and natural language processing, different approaches have been proposed to solve the problem. However, captions generated by the existing approaches have been lack of enough contextual information to describe the corresponding images completely. The labeled captions in the training set only basically describe images and lack of enough contextual annotations. In this paper, we propose a Weakly-supervised Image Captioning Approach (WICA) to generate captions containing rich contextual information, without complete annotations for the contextual information in datasets. We utilize encoderdecoder neural networks to extract basic captioning features and leverage object detection networks to identify contextual features. Then, we encode the two levels of features by a phrase-based language model in order to generate captions with rich contextual information. The comprehensive experimental results reveal that proposed model outperforms the existing baselines in terms of on the richness and reasonability of contextual information for image captioning. Keywords Image captioning . Weakly-supervised learning . Rich contextual information . Encoder-decoder neural networks . Object detection . Phrase-based language model
* Zhe Wang
[email protected]
1
Tsinghua-Southampton Web Science Laboratory, Graduate School at Shenzhen, Tsinghua University, Shenzhen, Guangdong, China
2
Graduate School at Shenzhen, Tsinghua University, Shenzhen, Guangdong, China
3
The School of Computing Science and Engineering, VIT University, Vellore, Tamil Nadu 632014, India
Multimed Tools Appl
1 Introduction Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. This task is even harder than image classification or sentence generation because computer needs to correctly recognize different objects in images as well as the interaction among them, and generate relevant and reasonable caption [24]. Inspired by the recent advance in deep learning, different approaches are proposed to generate reasonable captions [12, 13, 19, 28, 30]. However, the labeled captions in the existing training dataset only contain the basic information in images while lacking of the complete annotations about the contextual information, namely the object features. The trained models based on the training set only focus on the basic information and the generated captions describe the image basically but lack of descriptions about other important objects. In addition, too little contextual information makes the models unable to generate correct captions, there are syntax errors sometimes. An example is shown in Fig. 1. The generated caption, which is called the basic caption, is BA boat in the water in the water.^ Firstly, this caption does not describe the two people on the boat, and the second problem is redundancy in the generated caption, such as Bin the water^. In this paper, we propose a Weakly-supervised Image Captioning Approach (WICA) to generate captions with rich contextual information. BWeakly-supervised^ means that we utilize the incomplete datasets on contextual level to generate captions with rich contextual information. First, we employ the encoder-decoder neural networks to extract basic captioning features from images. The basic captioning features are defined to describe the image basically. Second, in order to enrich the contextual information in captions, we develop an object detection model to correctly detect the object features in the image and extract contextual information. Third, a phrase-based language model is built to generate the rich and correct captions by combining the basic captioning features with the contextual information. Finally, we define the reasonable degree and the rich degree to rank the captions in syntactic level as well as contextual level. Our contributions of the proposed work are listed as follows: 1. We propose a weakly-supervised image captioning approach to generate rich captions with incomplete annotations about the contextual information of the training images. 2. We extract two levels of features from images: the basic captioning features extracted by encoder-decoder neural networks and the contextual features extracted by object detection model.
Fig. 1 The basic caption with syntax errors lacks of enough contextual information
Multimed Tools Appl
3. A phrase-based language model is constructed to encode the two levels of features and generate the captions with rich contextual information. The paper is organized as follows. In section 2, we discuss related works in image caption generation. Section 3 presents the framework of WICA, including the model of basic captioning feature generation, the model of object feature detection and the phrase-based language model. Section 4 describes our experimental setup and the results on the MSCOCO [18] dataset. Section 5 concludes the highlights of the proposed work and the future work.
2 Related works In this section, we provide the relevant background on image caption generation. There are three types of methods in this area.
Methods based on basic image features The traditional methods on image caption generation leverage different properties in the image, such as objects, colors. Li et al. [17] detected all the objects and recognized the relationships between them. They transferred the objects and relationships into phrases and generated a final description. Farhadi et al. [6] detected different image elements in the image, and built a triplet to represent the elements. The model proposed templates to generate appropriate captions with triplets. Kulkani et al. [14] leveraged a detection graph to represent the whole image and transformed the graph into a corresponding caption using a template-based method. Besides, there are also methods based on other features and relationships [2, 5, 15, 16, 20], and methods in other areas are also helpful for the image captioning field [25, 29]. This kind of approaches is just able to describe the images basically, and the effect is very rough. The captions based on templates are also rigid. Methods based on co-embedding vector space Another type of methods generates captions for an image by ranking captions of the most similar image in the candidate image set. These methods build a co-embedding vector space containing all the images and the corresponding captions of the candidate image set. When they want to generate captions for a query image, they retrieve which image is most similar to the image in the vector space and pick the corresponding captions as the captions of the query image [8, 10, 21]. Neural networks were also used to build the co-embedded vector space [26], and Karpathy et al. [11] proposed a more powerful method, which broke the images and captions into pieces and phrases, so as to utilize neural networks to build the vector space of image pieces and phrases. Although these two approaches utilized neural networks to build the relationship between images and sentences, the process of caption generation was primitive. In conclusion, the approaches based on the existing images and captions cannot generate captions for the objects or images which did not appear in the vector space. Besides, the generated descriptions are also not satisfactory. Methods based on deep learning Some methods have been introduced the neural networks to generate captions. Inspired by the development of sequence-to-sequence model in machine translation, different methods have been proposed to solve the image captioning task [3, 4, 27]. Similar to machine translation which translates source sentences into target sentences, these methods translate images into sentences. Kiros et al. [12] proposed a method which built a multimodal log-bilinear model to transform features the image into captions. Another deep
Multimed Tools Appl
learning model in [13] was designed to do both ranking and caption generation. Mao et al. [19] replaced the feedforward neural network by a Recurrent Neural Network (RNN) to generate captions. Vinyals et al. [28] also changed the sequence-to-sequence model, replacing the encoder RNN by a deep Convolutional Neural Network (CNN), and presented an end-to-end system for the problem. Xu K et al. [30] introduced the successful attention framework in machine translation into image caption generation and improved the performance of sequence-to-sequence model. The methods of Vinyals et al. and Xu K et al. work well, while they do not take the richness of contextual information into consideration, and cannot reduce the syntax errors in captions.
3 Weakly-supervised image captioning approach 3.1 Framework In this section, we describe the weakly-supervised image captioning approach in detail. WICA contains three processes, and the framework is outlined as follows (see Fig. 2): First, we conduct the basic captioning feature extracting process; We use the encoderdecoder neural networks to extract the basic captioning features from images. Second, we develop the object feature extracting process; To enrich captions using the contextual information, we build an object detection model to detect important objects in the images. We extract the contextual objects to obtain the contextual information. Third, the rich caption generating process is conducted based on two levels of features; A phrase-based language model is built to generate captions with rich contextual information combining two levels of features. We define two degrees: the reasonable degree to rank the captions in syntactic level and the rich degree to rank the captions in contextual level.
3.2 Basic captioning feature extraction To extract the basic captioning features, we use the sequence-to-sequence model. In the original sequence-to-sequence model which is used in machine translation, a RNN encodes the source sentence and represents it as a fix-length vector. The vector is used as the initial hidden state of a
Fig. 2 The framework of Weakly-supervised Image Captioning Approach
Multimed Tools Appl
decoder RNN that generates the target sentence. Inspired by the encoder-decoder framework, the sequence-to-sequence model is developed to replace the encoder RNN with a CNN for generating the image captions [28]. The sequence-to-sequence model trains the CNN with an image classification dataset and uses the last hidden layer as an input to the RNN decoder. During the training process, the sequence-to-sequence model maximizes the probability of the caption given the image by the following formulations: θ* ¼ argmaxθ ∑ logPðS jI; θÞ M ;T
N
logPðS jI Þ ¼ ∑ logPðS t jI; S 0 ; …; S t−1 Þ
ð1Þ
ð2Þ
t¼0
where θ are the parameters of sequence-to-sequence model, M is the training images, I is each image in M, S is the corresponding transcription for I, T is the set of all transcriptions and S0 , … , SN are words in each sentence S. At training time, each image I and the corresponding transcription S are fed into the model and the model optimizes the sum of the log probabilities as described in Eq. (2) over the training set with the stochastic gradient descent process. It is effective to compute P(St | I , S0 , … , St − 1) with a Long-Short Term Memory (LSTM) net, where the different number of words existing in the sentence is expressed by a fixed length hidden cell pt. pt is changed after input xt by using the following function: ptþ1 ¼ f ðpt ; xt Þ
ð3Þ
where f is the LSTM function. LSTM can prevent vanishing and exploding gradients, which is the crucial challenge in training RNNs. As the definition of P(St | I , S0 , … , St − 1), the LSTM model is trained to predict words in the caption. At the training time, the input of LSTM model is the high level features of the image and all words before each word. All LSTM cells are trained simultaneously and share the same parameters. At time t, the LSTM cell takes the output pt − 1 of the LSTM cell at time t-1 as input. For example, if I is an image and S = (S0, … , SN) is a caption describing the image, the training process is: x−1 ¼ CNN ðI Þ
ð4Þ
xt ¼ W e S t ; t ∈ 0; …; N −1
ð5Þ
ptþ1 ¼ LSTM ðpt ; xt Þ; t ∈ 0; …; N −1
ð6Þ
The image I is input at t = −1 and each output pt is the input at time t + 1. We is the word embedding. The loss function is as follows: N
LðI; S Þ ¼ − ∑ logpt
ð7Þ
t¼1
After the training process, the sequence-to-sequence model can generate the basic caption. As the training set is incomplete, the model is unable to extract enough object features from
Multimed Tools Appl
images. The basic caption only describes images without rich contextual information. Therefore the basic caption needs to be integrated with the contextual information in the image. In the next section, we introduce how to use the object detection model to extract the contextual information without manual intervention on labeling images.
3.3 Object feature extraction In this section, we introduce object feature extraction. To enrich the contextual information in the captions without manual intervention, we need to automatically obtain important object features. In this work, we choose the state-of-the-art method, Faster-RCNN [7, 9, 23], to detect objects in images, and extract the contextual information. Faster-RCNN takes an image as input and exports a set of rectangular object proposals, each with a score of object. To generate object proposals, Faster-RCNN slides a small network to detect all the object proposals in the image over the convolutional feature map. Each sliding network is mapped to a lowerdimensional vector and the vector is fed into two sibling fully-connected layers, a boxregression layer and a box-classification layer. A box is a candidate region for object proposals. The box-regression layer has four outputs encoding the coordinates of one box and the boxclassification layer outputs an object score that estimates probability of object/not-object for each box. With these two layers, the object proposals will be extracted from images. In WICA, we develop Faster-RCNN model by taking an image as input and generating the phrases of the recognized objects. We propose to directly maximize the probability of the correct classes given the object proposals in the image by the following formulation: λ* ¼ argmax ∑ logPðC jO; λÞ
ð8Þ
O;C
where λ are the parameters of object detection model, O are the object proposals and C are the phrases of object classes to which the object proposals belong. We set 20 object classes: plane, bike, bird, boat, bottle, bus, car, chair, cow, table, cat, dog, horse, motorbike, person, plant, sheep, sofa, train, and television, which are most common in the training images. After extracting the object proposals from an input image, we classify them into this 20 object classes. If one object proposal can be classified into one of the 20 object classes, the model records the phrase of the corresponding class as the contextual information. For example, we take Fig. 3 as input and recognize the important objects in it. The basic captioning feature generated from sequence-to-sequence model is BA dog is lying on the street.^ which does not describe the bike and the people. After the object detection process, Ba bike^ and Ba group of people^ are both identified and recorded. The object features are extracted and regarded as the contextual information. Next, we build a phrase-based language model to combine the basic captioning features with the contextual information. The language model is also utilized to fix the syntax errors in generated captions.
Fig. 3 The process of object feature extraction
Multimed Tools Appl
3.4 Rich caption generation After the two processes above, we get the basic captioning features and the contextual information. We build a phrase-based language model to generate the new captions. First of all, we define the phrase attributes and the grammatical rules to make the generating process reasonable. We define phrases into the following attributes: BNP^, BVP^ and BPP^. BNP^ means noun phrase, BVP^ means verb phrase and BPP^ means preposition phrase. Besides, we define B.^ as the end of a sentence. At the generating process, we break the basic captioning features and the contextual information into phrases and classify them into these attributes. We also make several grammatical rules that should be followed in the process of caption generation: BNP^ should be followed by BVP^, BPP^ and B.^; BVP^ should be followed by BNP^, BPP^, and B.^; BPP^ should be followed by BNP^. In the phrase-based language model, the probability of generating a sentence S is given by: l PðS Þ ¼ PðT 1 ; T 2 ; …; T l Þ ¼ ∏ P T i jT 1 ; …; T i−1
ð9Þ
i¼1
In the Eq. (9), l is the length of the sentence S; T1 , … , Tl are BNP^, BVP^ and BPP^ in the sentence. P(Ti | T1 , … , Ti − 1) is the probability of phrase Ti appearing on position i given T1 , … , Ti-1. The caption probability P(S) is given by all the probabilities of phrases. Indeed, in a sentence, if a phrase should appear on certain position or not is largely decided by several phrases before it. In general, the more distance between two phrases, the lower relevance between them. On this occasion and following the k-1 order Markov Property, we change Eq. (9) with a k-phrases form as follows: l
PðS Þ ¼ PðT 1 ; T 2 ; …; T l Þ≈ ∏ PðT i jT i−kþ1 ; T i−kþ2 ; ::; T i−1 Þ
ð10Þ
i¼1
In the Eq. (10), the phrase on position i is decided by the k-1 phrases before it. P(Ti | Ti − k + … , Ti − 1) is the prior probability of k phrases. The best candidate of the captions is the sentence S which maximizes the likelihood of Eq. (10) over all the possible sizes of sentence. To calculate the prior probabilities P(Ti | Ti − k + 1 , Ti − k + 2 , … , Ti − 1), we choose the labeled captions of the training set as the corpus. All captions in the corpus are broken and classified into BNP^, BVP^ and BPP^. We count occurrence frequency of the k phrases (Ti − k + 1 , Ti − k + 2 , … , Ti) following the grammatical rules to calculate the prior probabilities. After getting the basic captioning features and the contextual information in images, the phrase-based language model uses them to generate captions with rich contextual information by calculating the caption probability P(S). We classify the basic captioning features and the contextual information into BNP^, BVP^ and BPP^, then generate captions using deep-first search following Eq. (10) and the grammatical rules. The process is shown in Fig. 4. We repeat the caption generating process until reaching the end of a sentence. All the phrases appear only once. For example, we choose k = 3 in the phrase-based language model. The basic captioning feature extracted from sequence-to-sequence model is BA boat in the water.^, and the contextual information is Bboat, person, person^. We break them into phrases and drop the repetitive phrases, the results are Ba boat^ (NP), Bin^ (PP), Bthe water^ (NP), Btwo people^ (NP), B.^(.). Then following the grammatical rules we generate the new captions as follows: 1 , Ti − k + 2 ,
Multimed Tools Appl Fig. 4 The sentence-encoding process following the grammatical rules in the phrase-based language model
PðTwo people on a boat in the water:Þ ¼ Pða boat j two people; onÞ* Pðin j on; a boat Þ* Pðthe water j a boat; inÞ* Pð:j in; the waterÞ
For each image, the generation process will generate a certain amount of candidate sentences. To rank the captions, we design the reasonable degree Q(S) and the rich degree R(S). We calculate the reasonable degree Q(S) of each sentence S to measure the captions in syntactic level as follows: pffiffiffiffiffiffiffiffiffiffi ð11Þ Qð S Þ ¼ n P ð S Þ n is the number of k phrases in the S. We utilize the extraction of n root to normalize different length of S. Although all the phrases are extracted from the corresponding image, some sentences might be grammatically correct while have low richness on the contextual level. For instance, the two captions BA man is watching TV and a dog is sleeping next to him.^ and BA man is watching TV.^ are both reasonable, while the second one misses important objects of the image. To measure the captions in contextual level, we calculate the rich degree R(S) of each sentence S by the following equation: NPðS Þ þ VPðS Þ þ PPðS Þ RðS Þ ¼ sigmoid ð12Þ 10 where NP(S) , VP(S) , PP(S) is the total number of NP, VP and PP in the sentence S. We utilize the sigmoid function to normalize the summation. Then we introduce the Score(S) to rank each sentence S as follows: 1 ð13Þ ScoreðS Þ ¼ α* QðS Þ þ RðS Þ 2 As Q(S) and R(S) are in different levels of magnitude, we utilize the parameter α to normalize Q(S), R(S) and Score(S) so as to keep them comparable. After getting candidate sentences, we calculate the Score(S) of all the sentences and pick N-best ones as the final results.
4 Experiment In this section, we introduce our experimental setup and results. We conduct comprehensive experiments to validate the reliability and effectiveness of WICA for caption generation. The
Multimed Tools Appl
sequence- to-sequence model and the object detection model are fully implemented on Tensorflow platform [1]. The phrase-based language model is implemented in Java 1.8.
4.1 Experimental setup In this section, we describe the experimental setup including dataset, data preprocessing, performance measurements and baseline methods.
Dataset We choose the MSCOCO dataset in our experiments. MSCOCO dataset is most commonly used in the field of image caption. This dataset contains over 100,000 images and each image has 5 corresponding referenced captions which are written artificially. Most of the referenced captions describe images basically and do not contain enough annotations of the contextual information in images. MSCOCO dataset can be used to validate the performance of our weakly-supervised image captioning approach. Data preprocessing Before WICA is capable to generate captions with rich contextual information, we have to perform certain preprocessing procedures. First of all, we calculate the prior probability P(Ti | Ti − k + 1 , Ti − k + 2 , … , Ti − 1). We select the referenced captions of the training images in the MSCOCO dataset as our corpus to compute the prior probability. To achieve the best performance, we do experiments in different k (=2, 3, 4). We break the referenced captions in the MSCOCO dataset into BNP^, BVP^, BPP^ and B.^. We eliminate the infrequent phrases in order to limit the number of phrases and generate more reasonable captions. Then we count the occurrence frequency of each k phrase (Ti − k + 1 , Ti − k + 2 , … , Ti) as the prior probability. We use 416,000 referenced captions in total, and the total number of k phrases (k = 2, 3, 4) is shown in Fig. 5. Performance measurement To estimate the quality of generate captions, we introduce different metrics. We compute the BLEU score, which is the most commonly used in the image description literature so far [22]. The BLEU score is a form of precision of n-gram words between generated and reference sentences. Aiming at measuring the improvement of reasonability and richness, we calculate two degrees Q(S) and R(S) mentioned in Section 3.4. While the BLEU score and two degrees cannot show the relevance between the captions and the images, we also introduce the human evaluation into this work to show the improvement of WICA on relevance. Baseline methods To estimate the performance of WICA, we randomly choose 1000 images from the MSCOCO test set and generate captions using four methods: 1) sequence-toFig. 5 The total number of k phrases
1200000 1000000 800000 600000 400000 200000 0
k=2
k=3
k=4
Multimed Tools Appl
sequence model [28]; 2) sequence-to-sequence model + phrase-based language model; 3) Faster-RCNN model [7, 9, 23] + phrase-based language model; 4) sequence-to-sequence model + Faster-RCNN model + phrase-based language model (WICA). All the experiments are based on the three baseline methods and WICA.
4.2 Experimental results Our experiments contain two parts. The first part aims at tuning the parameter k in the phrasebased language model in order to optimize the performance of WICA. The second part is to validate the improvement of WICA. To choose a best k for the phrase-based language model, we generate captions with different k(=2,3,4) and compute the different BLEU scores. The BLEU results are in Table 1, reported from BLEU-1 to BLEU-4. For comparison, we also compute the BLEU score on the captions of the sequence-to-sequence model in the same condition. The results in Table 1 show that, WICA (k = 2) and WICA (k = 3) achieve a better performance compared with the sequence-to-sequence model. The results also demonstrate that the best k is 3. The BLEU scores of WICA (k = 3) are: 36.3 (BLEU-1), 20.1 (BLEU-2), 12.6 (BLEU-3) and 8.4 (BLEU-4). When k = 3, which phrase should appear in a certain position is decided by two phrases before it. Therefore, the limitation from other phrases is appropriate. When k = 2, which phrase should appear in a certain position is only decided by the one phrase before it. The results have a large randomness and might be little related to the corresponding image. When k = 4, which phrase should appear in a certain position is decided by the three phrases before it. The results have much limitation and much of candidate sentences have a low probability. In conclusion, we choose k = 3 as the parameter of the phrase-based language model. Although, the performance of WICA is higher than sequence-to-sequence model, the BLEU metric cannot validate the improvement of reasonability and richness in captions and the relevance between images and captions. In this case, we design experiments to measure our improvement of reasonability, richness and relevance. First of all, we count the total number of NP, VP and PP in captions of complex images to measure the improvement of contextual information. We define that an image is Bcomplex^ if WICA can detect non-existent contextual information in the basic captioning feature from images. The results are shown in Fig. 6. The result shows that, the captions generated from WICA contain most contextual information than other three methods. The percentages of total number in captions generated from WICA are: 17.3%(number > =9), 37%(number > =8), 68.6%(number > =7), 89.8%(number > =6), 98.5%(number > =5), 100%(number > =4). The reason is that WICA extracts the basic captioning features and the object features with sequence-to-sequence model and FasterRCNN model so as to generate captions, therefore contextual information is richer than captions generated only by sequence-to-sequence model. Seq2seq + phrase-based model is a little lower than seq2seq model on contextual level because some redundant contextual information is dropped
Table 1 The BLEU score of seq2seq model and WICA (k = 2, 3, 4) seq2seq WICA(k = 2) WICA(k = 3) WICA(k = 4)
BLEU-1
BLEU-2
BLEU-3
BLEU-4
30.9 35.3 36.3 25.5
17.1 20.3 20.1 15.3
10.6 12.2 12.6 10.1
7.1 8.2 8.4 7.6
Multimed Tools Appl
Fig. 6 The improvement of contextual information in captions
during the phrase-based language model fixes the syntax errors. The Faster-RCNN +phrase-based model is the worst one because the captions do not contain the basic captioning features. In conclusion, WICA can extract more contextual information from images correctly. To validate the improvement of reasonability and richness, we calculate the average of reasonable degree Q(S), rich degree R(S) and Score(S) (α =5000) in all the captions. As Q(S) and R(S) are in different levels of magnitude, we set α =5000 to make Q(S) and R(S) comparable. The results are shown in Table 2. The results show that, WICA obtains the highest Q(S) (=0.441), R(S) (=0.664) and Score(S) (=0.553) with the help of Faster-RCNN model and the phrase-based language model. The improvement of richness is not very high because there are many images which only contain simple scene, leading to the little promotion of contextual information and R(S). We also calculate the average of reasonable degree Q(S), rich degree R(S) and Score(S) (α =5000) in captions of complex images. The results are shown in Table 3, which show a more obvious improvement. The results show that, WICA also obtains the highest Q(S) (=0.425), R(S) (=0.671) and Score(S) (=0.548) in complex images. The phrase-based language model can improve the
Table 2 The average of Q(S), R(S) and Score(S)
seq2seq seq2seq + phrase-based seq2seq + Faster-RCNN + phrase-based(WICA)
α∗Q(S)
R(S)
Score(S)
0.178 0.434 0.441
0.662 0.651 0.664
0.420 0.543 0.553
Multimed Tools Appl Table 3 The average of Q(S), R(S) and Score(S) in complex images
seq2seq seq2seq + phrase-based seq2seq + Faster-RCNN + phrase-based(WICA)
α∗Q(S)
R(S)
Score(S)
0.159 0.414 0.425
0.659 0.651 0.671
0.409 0.533 0.548
reasonability comparing Q(S) of seq2seq model and seq2seq + phrase-based model. The reason is that the phrase-based language model can fix the syntax errors. The richness R(S) of captions becomes lower because redundant contextual information is dropped during fixing the syntax errors. Faster-RCNN model can improve Q(S) and R(S) comparing seq2seq + phrase-based model with WICA, because correct and rich contextual information enhances relevance of phrases and makes sentences reasonable. In conclusion, captions generated from WICA have higher reasonability and richer contextual information, appropriate quality of object features enrich captions with contextual information and do not make captions redundant. To measure the improvement of relevance between captions and images, we also design the experiment of human evaluation. We compare the captions generated from WICA with the captions generated from seq2seq model and seq2seq + phrase-based model for the chosen images. We did not take captions generated from Faster-RCNN +phrase-based model into account because these captions cannot describe images correctly without the basic captioning feature. Human evaluators are asked to give a score with a scale from 1 to 5, on the relevance and usefulness of each caption given the chosen image. Fig. 7 shows the results of the human evaluation. We can see that WICA (seq2seq + Faster-RCNN + phrase-based model) has the
Fig. 7 The results of human evaluation
Multimed Tools Appl
highest relevance compared with other two methods. The percentages of scores in WICA are: 21.7%(score = 5), 58.8%(score > =4), 89.6%(score > =3), 99.6%(score > =2) and 100%(score > =1). The reason is that Faster-RCNN model extracts object features from images and enriches the contextual information. Captions with rich contextual information describe the images in more details, making the relevance between images and captions improved. In addition, the phrase-based language model fixes the syntax errors, this also makes human evaluators comfortable and improves the scores. Both of seq2seq model and seq2seq + phrase-based model are lacking of enough contextual information while the phrasebased language model can reduce the syntax errors. Therefore, the performance of seq2seq + phrase-based model is better than the performance of seq2seq model. In conclusion, the captions generated by WICA have richer contextual information and higher relevance with images. Additionally, the syntax errors in captions are also reduced by WICA. Examples of rated images can be seen in Fig.8(a)-(f). All of them have richer contextual information and no syntax errors. The results also show that, the more complex image scene is, the better performance of WICA is. If the image only contains single scene, the improvement of contextual information will be limited. In addition, although WICA focuses on enriching the contextual information of the captions, we can also correct the unreasonable descriptions in some conditions. The reason is that we generate rich captions based on the prior probability extracted from the language corpus, if the descriptions are unreasonable, the caption probability P(S) will be very low, WICA can drop the caption and build a more reasonable caption.
Fig. 8 Examples of rich captions generated from WICA
Multimed Tools Appl
5 Conclusion and future work In this paper, we have presented a weakly-supervised image captioning approach to generate captions with rich contextual information since the training set is incomplete on the contextual features of images. The sequence-to-sequence model is utilized to extract the basic captioning features and the method of object detection is developed to extract the object features. A phrasebased language model is built to generate captions with rich contextual information and fix the syntax errors using the basic captioning features and object features. We also define the reasonable degree and the rich degree to rank the captions in syntactic level and contextual level. Extensive experiments on MSCOCO dataset show the significant improvement of the proposed method in terms of reasonability, richness and relevance comparing to several baseline methods. We believe the proposed method will play an important role for image captioning in an automatic fashion. In the future, we will try to develop the framework of the sequence-to-sequence model to improve caption generation without object detection. In addition, we will make an attempt on leveraging unsupervised data to generate high quality captions. Acknowledgements This research is supported by National Natural Science Foundation of China (Grant No. 61375054), Natural Science Foundation of Guangdong Province (Grant No. 2014A030313745), Basic Scientific Research Program of Shenzhen City (Grant No. JCYJ20160331184440545), and Cross Fund of Graduate School at Shenzhen, Tsinghua University (Grant No. JC20140001).
References 1. Abadi M, Agarwal A, Barham P et al (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 2. Aker A, Gaizauskas R (2010) Generating image descriptions using dependency relational patterns. Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 1250–1258 3. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 4. Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoderdecoder for statistical machine translation. In EMNLP 5. Elliott D, Keller F (2013) Image description using visual dependency representations. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing pp 1292–1302 6. Farhadi A, Hejrati M, Sadeghi M et al (2010) Every picture tells a story: Generating sentences from images. Computer vision–ECCV pp 15–29 7. Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision, pp 1440–1448 8. Gong Y, Wang L, Hodosh M, et al (2014) Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections. ECCV (4), pp 529–545 9. He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916 10. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899 11. Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems pp 1889–1897 12. Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. Proceedings of the 31st International Conference on Machine Learning (ICML-14). pp 595–603 13. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 14. Kulkarni G, Premraj V, Ordonez V et al (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Multimed Tools Appl 15. Kuznetsova P, Ordonez V, Berg AC et al (2012) Collective generation of natural image descriptions. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long PapersVolume 1. Association for Computational Linguistics, pp 359–368 16. Kuznetsova P, Ordonez V, Berg TL et al (2014) TREETALK: Composition and Compression of Trees for Image Descriptions. TACL 2(10):351–362 17. Li S, Kulkarni G, Berg TL et al (2011) Composing simple image descriptions using web-scale n-grams. Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, pp 220–228 18. Lin T Y, Maire M, Belongie S et al (2014) Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312 19. Mao J, Xu W, Yang Y et al (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 20. Mitchell M, Han X, Dodge J et al (2012) Midge: Generating image descriptions from computer vision detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 747–756 21. Ordonez V, Kulkarni G, Berg TL (2011) Im2text: Describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems, pp 1143–1151 22. Papineni K, Roukos S, Ward T et al (2002) BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318 23. Ren S, He K, Girshick R et al (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149 24. Russakovsky O, Deng J, Su H et al (2014) Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575 25. Shi J, Wu J, Paul A et al (2014) Change Detection in Synthetic Aperture Radar Images Based on Fuzzy Active Contour Models and Genetic Algorithms. Mathematical Problems in Engineering, 2014 26. Socher R, Karpathy A, Le QV et al (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2:207–218 27. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Advances in neural information processing systems, pp 3104–3112 28. Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164 29. Wu J, Paul A, Xing Y et al (2010) Morphological dilation image coding with context weights prediction. Signal Process Image Commun 25(10):717–728 30. Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning, pp 2048–2057
Hai-Tao Zheng got the bachelor degree in Department of Computer Science, Sun Yat-Sen University in 2001, the master degree in Department of Computer Science, Sun Yat-Sen University in 2004, and the Ph.D degree in Medical Informatics Major, Seoul National University in 2009. His research interests include artificial intelligence, semantic web, information retrieval, machine learning, and medical informatics. He has published more than 30 papers including 10 SCI journal papers.
Multimed Tools Appl
Zhe Wang is a graduate student in Graduate School at Shenzhen, Tsinghua University. He got his B.S. degree from Shanghai Jiao Tong University in 2015. His research interests focus on the artificial intelligence and machine learning.
Ningning Ma is a graduate student in Graduate School at Shenzhen, Tsinghua University. He got his B.S. degree from Nankai University in 2015. His research interests focus on artificial intelligence and machine learning.
Jinyuan Chen is a doctoral student of Dr. Hai-Tao Zheng. He got his master degree in computer engineering, Graduate School at Shenzhen, Tsinghua University 2013. His research interests include artificial intelligence, information retrieval and machine learning.
Multimed Tools Appl
Xi Xiao is an associate professor in Graduate School At Shenzhen,Tsinghua University. He got his Ph.D. degree in 2011 in State Key Laboratory of Information Security, Graduate University of Chinese Academy of Sciences. His research interests focus on information security and the computer network.
Arun Kumar Sangaiah had received his Doctor of Philosophy (PhD) degree in Computer Science and Engineering from the VIT University, Vellore, India. He is presently working as an Associate Professor in School of Computer Science and Engineering, VIT University, India. His area of interest includes software engineering, computational intelligence, wireless networks, bio-informatics, and embedded systems.