基於雙鑑別器生成對抗網路之影像敘事__國立清華大學博碩士論文全文影像系統

帳號：guest(216.73.216.146) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	王少博
作者(外文):	Wang, Shao-Po
論文名稱(中文):	基於雙鑑別器生成對抗網路之影像敘事
論文名稱(外文):	Dual Discriminator GAN-based Visual Storytelling
指導教授(中文):	林嘉文
指導教授(外文):	Lin, Chia-Wen
口試委員(中文):	許秋婷林彥宇彭彥璁
口試委員(外文):	Hsu, Chiu-Ting Lin, Yen-Yu Peng, Yan-Tsung
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	108061600
出版年(民國):	110
畢業學年度:	110
語文別:	中文
論文頁數:	28
中文關鍵詞:	影像敘事、產生器、雙鑑別器、對抗式網路、強化學習
外文關鍵詞:	Visual storytelling、generative models、dual discriminator、generative adversarial networks、reinforcement learning
相關次數:	推薦:0 點閱:889 評分: 下載:0 收藏:0

隨著深度神經網路的發展，在影像描述（Image Captioning）上的成果已能對圖片的內容產生良好的描述。不同於影像描述對單一圖片的描述，影像敘事（Visual Storytelling）不僅是針對多張圖片進行描述，還要找尋圖片與圖片之間的關聯，以形成多段相聯的描述構築成完整的故事。
在影像敘事的資料集中，絕大部分的描述含有不同的風格以及想像的概念，此種特性相較於影像描述針對圖片主體的正確描述，使得影像敘事的任務更為困難以及複雜。此外，過去的方法在使用最大概似估計或者利用強化學習最優化人工定義分數時，無法有效地生成好的句子。
生產對抗式網路（GANs）擅長生成符合常理但不存在的資料，在近年的發展下，除了圖片以外也能用來生成文字。對抗式訓練在影像描述上已被證實能有效地提升生成的句子。然而，在現今方法中，以往鑑別器的架構在面臨影像敘事多變化性的句子時，無法有效地提升結果。
在這篇論文中，我們提出了基於雙鑑別器生成對抗網路的方法。首先，為了增加故事的關聯性，我們調整了生成器的架構；再者，為了評量生成句子的兩個觀點: 像人寫的句子以及與影像符合的，我們使用了兩種不同架構的鑑別器。實驗結果顯示我們相較於過去方法的優勢。

With the development of deep neural networks, great performance on image captioning has been achieved. Different from image captioning, a new task visual storytelling has been introduced, this task gives descriptions of image streams instead of a single description of a single image. The descriptions from visual storytelling required not only relevant but also related to the image streams to complete a full story.
In visual storytelling dataset (VIST), most descriptions have unique styles and imaginary concepts. This kind of property makes visual storytelling more complex compared to image captioning. Furthermore, past methods face the limitation of maximum likelihood estimation and strong bias problems caused by hand-crafted rewards (e.g., BLUE, METEOR, CIDEr, etc.) through reinforcement learning, which makes it hard to improve the performance of the results.
Generative adversarial networks (GANs) are good at generating reasonable but non-exist data not only images but also captions. Reinforcement learning method with adversarial-training-structure has been proven to improve the performance of the image captioning results. However, the existing discriminator structure could not take advantage of other techniques due to small and complex visual storytelling dataset. In our method, we propose Dual-RL, a dual discriminator GAN-based algorithm for visual storytelling. First, to make the story more relevant, we adjust the generator architecture. Second, we take two aspects to assess the story: human-like and story-related. It is shown in the experiments that the advantage of our proposed method compared to previous works.

摘要 ii
Abstract iii
Content iv
Chapter 1 Introduction 6
Chapter 2 Related Work 9
2.1 Visual Storytelling 9
2.2 Reinforcement Training 9
2.3 GAN-based method 10
Chapter 3 Proposed method 12
3.1 Overview 12
3.2 Generator 13
3.3 Discriminator 15
3.4 Loss Functions 16
Chapter 4 Experiments and Discussion 19
4.1 Datasets 19
4.2 Implementation Detail 19
4.3 Performance Evaluation 20
4.4 Metric Selection 20
4.5 Hyperparameter Selection 21
4.6 Components Analysis 21
4.7 Examples 22
Chapter 5 Conclusion 25
References 26

[1] Kim, Y. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746– 1751.
[2] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
[3] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693– 1701.
[4] Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual storytelling. In the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016).
[5] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; and Zitnick, C. L. 2014. Microsoft ´ coco: Common objects in context. In European conference on computer vision, 740–755. Springer.
[6] Yu, L.; Bansal, M.; and Berg, T. 2017. Hierarchically attentive RNN for album summarization and storytelling. In EMNLP. Yu, L.; Bansal, M.; and Berg, T. 2017. Hierarchically attentive RNN for album summarization and storytelling. In EMNLP.
[7] Huang, Q.; Gan, Z.; Celikyilmaz, A.; Wu, D.; Wang, J.; and He, X. 2019. Hierarchically structured reinforcement learning for topically coherent visual story generation. In AAAI.
[8] Wang, X.; Chen, W.; Wang, Y.-F.; and Wang, W. Y. 2018b. No metrics are perfect: Adversarial reward learning for visual storytelling. In ACL.
[9] Park and Gunhee Kim. 2015. Expressing an image stream with a sequence of natural sentences. In Advances in Neural Information Processing Systems, pages 73–81.
[10] Yunjae Jung, Dahun Kim; Sanghyun Woo; Kyungsu Kim; and Sungjin Kim In So Kweon. 2020. Hide-and-Tell: Learning to Bridge Image streams for Visual Storytelling. In AAAI.
[11] Junjie Hu; Yu Cheng; Zhe Gan; Jingjing Liu; Jianfeng Gao; and Graham Neubig. 2020. What Makes A Good Story? Designing Composite Rewards for Visual Storytelling. In the Thirty-Fourth Conference on AAAI.
[12] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311–318. Association for Computational Linguistics.
[13] Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and William Yang Wang. 2018b. Video captioning via hierarchical reinforcement learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks.
[15] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems.
[16] Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI.
[17] Chen Chen; Shuai Mu; Wanpeng Xiao; Zexiong Ye; Liesi Wu; and Qi Ju. 2019. Improving Image Captioning with Conditional Generative Adversarial Nets. In AAAI.
[18] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
[19] Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3156–3164.
[20] S. Yan, F. Wu, J. S. Smith, W. Lu and B. Zhang, "Image Captioning using Adversarial Networks and Reinforcement Learning," 2018 24th International Conference on Pattern Recognition (ICPR), 2018, pp. 248-253, doi: 10.1109/ICPR.2018.8545049.
[21] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文