帳號:guest(13.58.133.50)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):曾國豪
作者(外文):Zeng, Kuo-Hao
論文名稱(中文):影片標題產生與問答
論文名稱(外文):Video titling and Question-Answering
指導教授(中文):孫民
指導教授(外文):Sun, Min
口試委員(中文):陳冠文
林嘉文
陳縕儂
口試委員(外文):Chen, Kuan-Wen
Lin, Chia-Wen
Chen, Yun-Nung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:103061614
出版年(民國):106
畢業學年度:105
語文別:英文
論文頁數:51
中文關鍵詞:電腦視覺深度學習遞迴式神經網路影片標題問答
外文關鍵詞:CVDLRNNVideoTitleQuestion-Answering
相關次數:
  • 推薦推薦:0
  • 點閱點閱:295
  • 評分評分:*****
  • 下載下載:40
  • 收藏收藏:0
影片標題和問答是高階視覺數據理解的兩個重要任務。為了解決這兩個任務,我們提出了一個大規模的數據集,並在這個工作中展示了對於這個數據集的幾個模型。一個好的影片標題緊密地描述了最突出的事件,並捕獲觀眾的注意力。相反的,影片字幕產生傾向於產生描述整個影片的句子。雖然自動產生影片標題是非常有用的任務,但它相對於影片字幕處理的較少。我們首次提出用兩種方法將最優秀的影片標題產生器擴展到這項新任務來解決影片標題生成的問題。首先,我們利用精彩片段偵測器讓影片標題產生器敏感於精彩片段,我們的方法能夠訓練一個模型讓它能夠允許同時處理影片標題產生以及影片精彩片段的時間。第二,我們引入高多樣性的句子在影片標題產生器中,使得所產生的標題也是多樣化和引人入勝的。這意味著我們需要大量的句子來學習標題的句子結構。因此,我們提出一種新穎的句子增加方法來訓練標題產生器,利用的是只有句子而沒有相應的影片例子。另一方面,對於影片問答任務,我們提出一個深的模型來回答對於影片上下文的自由形式自然語言問題,我們自動的從網路上收集大量的免費影片以及其描述,因此,大量的問答配對候選就自動的產生而不需要人工標註。接著,我們使用這些問答配對候選來訓練多個由MN、VQA、SA以及SS延伸的影片為主的問答方法,為了要處理非完美的問答配對候選,我們提出了一個自主學習的學習程序迭代地識別它們並減輕其對培訓的影響,為了展示我們的想法,我們收集了18100部的野外大型影片字幕(VTW)數據集,自動抓取用戶生成的影片和標題。我們接著利用一個自動的問答生成器來生成多個問答配對來訓練並從Amazon Mechanical Turk上收集人為產生的問答配對。在VTW上,我們的方法能持續的提高標題預測精度,並實現了自動化的最佳性能和人類評價,我們的句子增加方法也勝過M-VAD數據集的基準。最後,結果顯示我們的自學習程序是有效的,而擴展SS模型也優於各種基準模型。
Video titling and question answering are two important tasks toward high-level visual data understanding. To address those two tasks, we propose a large-scale dataset and demonstrate several models on such dataset in this work. A great video title describes the most salient event compactly and captures the viewer's attention. In contrast, video captioning tends to generate sentences that describe the video as a whole. Although generating a video title automatically is a very useful task, it is much less addressed than video captioning. We address video title generation for the first time by proposing two methods that extend state-of-the-art video captioners to this new task. First, we make video captioners highlight sensitive by priming them with a highlight detector. Our framework allows for jointly training a model for title generation and video highlight localization. Second, we induce high sentence diversity in video captioners, so that the generated titles are also diverse and catchy. This means that a large number of sentences might be required to learn the sentence structure of titles. Hence, we propose a novel sentence augmentation method to train a captioner with additional sentence-only examples that come without corresponding videos. On the other hand, for video question-answering task: we propose to learn a deep model to answer a free-form natural language question about
the contents of a video. We make a program automatically harvests a large number of videos and descriptions freely available online.
Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended from MN, VQA, SA, and SS. In order to handle non-perfect candidate QA pairs, we propose a self-paced learning procedure to iteratively identify them and mitigate their effects in training. To demonstrate our idea, we collected a large-scale Video Titles in the Wild (VTW) dataset of $18100$ automatically crawled user-generated videos and titles. We then utilize an automatic QA generator to generate a large number of QA pairs for training and collect the manually generated QA pairs from Amazon Mechanical Turk. On VTW, our methods consistently improve title prediction accuracy, and achieve the best performance in both automatic and human evaluation. Next, our sentence augmentation method also outperforms the baselines on the M-VAD dataset. Finally, the results of video question answering show that our self-paced learning procedure is effective, and the extended SS model outperforms various baselines.
摘要 ii
Abstract iv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 7
2.1 Video Captioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Video Highlight Detection. . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Video Captioning Datasets. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Image-QA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Question generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Video-QA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Dataset Collection 12
3.1 Video Titling Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Collection of Curated UGVs . . . . . . . . . . . . . . . . . . . 13
3.1.2 Dataset Comparison . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Video Question Answering Dataset . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Questions Generation (QG) . . . . . . . . . . . . . . . . . . . 17
3.2.2 Questions and Answers Analysis . . . . . . . . . . . . . . . . . 18
4 Method 21
4.1 From Caption to Title . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Highlight Sensitive Captioning . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Sentence Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Mitigating the Effect of Non-perfect QAs Pairs . . . . . . . . . . . . . 26
4.6 Extened Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Experiment 29
5.1 Video Titling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Implementation of Highlight Detector . . . . . . . . . . . . . . 32
5.1.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.3 Implementation of S2VT and SA . . . . . . . . . . . . . . . . 34
5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Video Question Answering . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 39
5.2.2 Training details . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Conclusion and Future Work 45
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References 47
[1] D. L. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 2011.
[2] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for movie description,” in CVPR, 2015.
[3] A. Torabi, C. J. Pal, H. Larochelle, and A. C. Courville, “Using descriptive video services to create a large data source for video annotation research,” arXiv:1503.01070, 2015.
[4] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence - video to text,” in ICCV, 2015.
[5] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville., “Describing videos by exploiting temporal structure,” in ICCV, 2015.
[6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
[7] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR, 2015.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
[9] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” in NAACL, 2015.
[10] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, 2015.
[11] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: Visual question answering,” in ICCV, 2015.
[12] M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on uncertain input,” in NIPS, 2014.
[13] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “MovieQA: Understanding stories in movies through question-answering,” in CVPR, 2016.
[14] M. Heilman and N. A. Smith, “Good question! statistical ranking for question generation,” in HLT, 2010.
[15] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele, “Translating video content to natural language descriptions,” in ICCV, 2013.
[16] S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus, “End-to-end memory networks,” in NIPS, 2015.
[17] P. Das, C. Xu, R. Doell, and J. Corso, “A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching,” in CVPR, 2013.
[18] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko, “Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition,” in ICCV, 2013.
[19] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama, “Generating natural-language video descriptions using text-mined knowledge,” in AAAI, 2013.
[20] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney, “Integrating language and vision to generate natural language descriptions of videos in the wild,” in COLING, 2014.
[21] A. Barbu, E. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt, J. Shangguan, J. M. Siskind, J. Waggoner, S. Wang, J. Wei, Y. Yin, and Z. Zhang, “Video in sentences out,” in UAI, 2012.
[22] A. Kojima, T. Tamura, and K. Fukunaga, “Natural language escription of human activities from video images based on concept hierarchy of actions,” IJCV, vol. 50, pp. 171–184, nov 2002.
[23] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015.
[24] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” ICML, 2015.
[25] J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille, “Learning like a child: Fast novel visual concept learning from sentence descriptions of images,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2533–2541, 2015.
[26] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” CVPR, 2016.
[27] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” CVPR, 2016.
[28] A. Rohrbach, M. Rohrbach, and B. Schiele, “The long-short story of movie description,” in GCPR, 2015.
[29] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell, “Deep compositional captioning: Describing novel object categories without paired training data,” CVPR, 2016.
[30] H. Wang and C. Schmid, “Action Recognition with Improved Trajectories,” in ICCV, 2013.
[31] R. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly modeling deep video and compositional text to bridge vision and language in a unified framework.,” in AAAI,
pp. 2346–2352, 2015.
[32] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” CVPR, 2016.
[33] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” CVPR, 2016.
[34] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” CVPR, 2016.
[35] D. Yow, B. Yeo, M. Yeung, and B. Liu, “Analysis and presentation of soccer highlights from digital video,” in ACCV, 1995.
[36] Y. Rui, A. Gupta, and A. Acero, “Automatically extracting highlights for tv baseball programs,” in ACM Multimedia, 2000.
[37] S. Nepal, U. Srinivasan, and G. Reynolds, “Automatic detection of goal segments in basketball videos,” in ACM Multimedia, 2001.
[38] E. C. J. Wang andC. Xu and Q. Tian, “Sports highlight detection from keyword sequences using hmm,” in ICME, 2004.
[39] Z. Xiong, R. Radhakrishnan, A. Divakaran, and T. Huang, “Highlights extraction from sports video based on an audio-visual marker detection framework,” in ICME, 2005.
[40] M. Kolekar and S. Sengupta, “Event-importance based customized and automatic cricket highlight generation,” in ICME, 2006.
[41] A. Hanjalic, “Adaptive extraction of highlights from a sport video based on excitement modeling,” EEE Transactions on Multimedia, 2005.
[42] H. Tang, V. Kwatra, M. Sargin, and U. Gargi, “Detecting highlights in sports videos: Cricket as a test case,” in ICME, 2011.
[43] M. Sun, A. Farhadi, and S. Seitz, “Ranking domain-specific highlights by analyzing edited videos,” in ECCV, 2014.
[44] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing web videos using titles,” in CVPR, 2015.
[45] B. Zhao and E. P. Xing, “Quasi real-time summarization for consumer videos,” in CVPR, 2014.
[46] H. Yang, B. Wang, S. Lin, D. Wipf, M. Guo, and B. Guo, “Unsupervised extraction of video highlights via robust recurrent auto-encoders,” in ICCV, 2015.
[47] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele, “Coherent multi-sentence video description with variable level of detail,” in GCPR, 2014.
[48] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[49] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh, “Vizwiz: Nearly real-time answers to visual questions,” in UIST, 2010.
[50] D. Geman, S. Geman, N. Hallonquist, and L. Younes, “Visual turing test for computer vision systems,” PNAS, vol. 112, no. 12, pp. 3618–3623, 2014.
[51] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A neural-based approach to answering questions about images,” in ICCV, 2015.
[52] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are you talking to a machine? dataset and methods for multilingual image question answering,” in NIPS, 2015.
[53] H. Noh, P. H. Seo, and B. Han, “Image question answering using convolutional neural network with dynamic parameter prediction,” CVPR, 2016.
[54] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Deep compositional question answering with neural module networks,” in CVPR, 2016.
[55] L. Ma, Z. Lu, and H. Li, “Learning to answer questions from image using convolutional neural network,” in AAAI, 2016.
[56] V. Rus and J. Lester, “Workshop on question generation,” in Workshop on Question Generation, 2009.
[57] V. Rus and Graessar, “Question generation shared task and evaluation challenge ￿v status report,” in The Question Generation Shared Task and Evaluation Challenge, 2009.
[58] D. M. Gates, “Generating reading comprehension look-back strategy questions from expository texts,” Master’s thesis, Carnegie Mellon University, 2008.
[59] M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in NIPS, 2015.
[60] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S. C. Zhu, “Joint video and text parsing for understanding events and answering queries.,” in IEEE MultiMedia, 2014.
[61] L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering temporal context for video question and answering,” arXiv preprint arXiv:1511.04670, 2015.
[62] S. Hochreiter and J. Schmidhuber, “Long short-term memory.,” Neural Computation, pp. 1735–1780, 1997.
[63] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
[64] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015.
[65] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv:1301.3781, 2013.
[66] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
[67] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensor-Flow: Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org.
[68] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in NIPS, 2015.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *