影片標題和問答是高階視覺數據理解的兩個重要任務。為了解決這兩個任務,我們提出了一個大規模的數據集,並在這個工作中展示了對於這個數據集的幾個模型。一個好的影片標題緊密地描述了最突出的事件,並捕獲觀眾的注意力。相反的,影片字幕產生傾向於產生描述整個影片的句子。雖然自動產生影片標題是非常有用的任務,但它相對於影片字幕處理的較少。我們首次提出用兩種方法將最優秀的影片標題產生器擴展到這項新任務來解決影片標題生成的問題。首先,我們利用精彩片段偵測器讓影片標題產生器敏感於精彩片段,我們的方法能夠訓練一個模型讓它能夠允許同時處理影片標題產生以及影片精彩片段的時間。第二,我們引入高多樣性的句子在影片標題產生器中,使得所產生的標題也是多樣化和引人入勝的。這意味著我們需要大量的句子來學習標題的句子結構。因此,我們提出一種新穎的句子增加方法來訓練標題產生器,利用的是只有句子而沒有相應的影片例子。另一方面,對於影片問答任務,我們提出一個深的模型來回答對於影片上下文的自由形式自然語言問題,我們自動的從網路上收集大量的免費影片以及其描述,因此,大量的問答配對候選就自動的產生而不需要人工標註。接著,我們使用這些問答配對候選來訓練多個由MN、VQA、SA以及SS延伸的影片為主的問答方法,為了要處理非完美的問答配對候選,我們提出了一個自主學習的學習程序迭代地識別它們並減輕其對培訓的影響,為了展示我們的想法,我們收集了18100部的野外大型影片字幕(VTW)數據集,自動抓取用戶生成的影片和標題。我們接著利用一個自動的問答生成器來生成多個問答配對來訓練並從Amazon Mechanical Turk上收集人為產生的問答配對。在VTW上,我們的方法能持續的提高標題預測精度,並實現了自動化的最佳性能和人類評價,我們的句子增加方法也勝過M-VAD數據集的基準。最後,結果顯示我們的自學習程序是有效的,而擴展SS模型也優於各種基準模型。
Video titling and question answering are two important tasks toward high-level visual data understanding. To address those two tasks, we propose a large-scale dataset and demonstrate several models on such dataset in this work. A great video title describes the most salient event compactly and captures the viewer's attention. In contrast, video captioning tends to generate sentences that describe the video as a whole. Although generating a video title automatically is a very useful task, it is much less addressed than video captioning. We address video title generation for the first time by proposing two methods that extend state-of-the-art video captioners to this new task. First, we make video captioners highlight sensitive by priming them with a highlight detector. Our framework allows for jointly training a model for title generation and video highlight localization. Second, we induce high sentence diversity in video captioners, so that the generated titles are also diverse and catchy. This means that a large number of sentences might be required to learn the sentence structure of titles. Hence, we propose a novel sentence augmentation method to train a captioner with additional sentence-only examples that come without corresponding videos. On the other hand, for video question-answering task: we propose to learn a deep model to answer a free-form natural language question about
the contents of a video. We make a program automatically harvests a large number of videos and descriptions freely available online.
Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended from MN, VQA, SA, and SS. In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training. To demonstrate our idea, we collected a large-scale Video Titles in the Wild (VTW) dataset of $18100$ automatically crawled user-generated videos and titles. We then utilize an automatic QA generator to generate a large number of QA pairs for training and collect the manually generated QA pairs from Amazon Mechanical Turk. On VTW, our methods consistently improve title prediction accuracy, and achieve the best performance in both automatic and human evaluation. Next, our sentence augmentation method also outperforms the baselines on the M-VAD dataset. Finally, the results of video question answering show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.
摘要 ii
Abstract iv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 7
2.1 Video Captioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Video Highlight Detection. . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Video Captioning Datasets. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Image-QA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Question generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Video-QA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Dataset Collection 12
3.1 Video Titling Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Collection of Curated UGVs . . . . . . . . . . . . . . . . . . . 13
3.1.2 Dataset Comparison . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Video Question Answering Dataset . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Questions Generation (QG) . . . . . . . . . . . . . . . . . . . 17
3.2.2 Questions and Answers Analysis . . . . . . . . . . . . . . . . . 18
4 Method 21
4.1 From Caption to Title . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Highlight Sensitive Captioning . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Sentence Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Mitigating the Effect of Non-perfect QAs Pairs . . . . . . . . . . . . . 26
4.6 Extened Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Experiment 29
5.1 Video Titling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Implementation of Highlight Detector . . . . . . . . . . . . . . 32
5.1.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.3 Implementation of S2VT and SA . . . . . . . . . . . . . . . . 34
5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Video Question Answering . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 39
5.2.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Conclusion and Future Work 45
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References 47
