帳號:guest(18.220.226.214)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):陳增鴻
作者(外文):Chen, Tseng-Hung
論文名稱(中文):基於對抗式訓練生成跨域影像描述
論文名稱(外文):Generating Cross-domain Visual Description via Adversarial Learning
指導教授(中文):孫民
指導教授(外文):Sun, Min
口試委員(中文):林嘉文
陳縕儂
陳冠文
口試委員(外文):Lin, Chia-Wen
Chen, Yun-Nung
Chen, Kuan-Wen
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:104061544
出版年(民國):106
畢業學年度:105
語文別:英文
論文頁數:43
中文關鍵詞:深度學習圖像字幕生成遷移學習電腦視覺對抗式訓練增強學習
外文關鍵詞:Deep LearningImage CaptioningTransfer LearningComputer VisionAdversarial TrainingReinforcement Learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:557
  • 評分評分:*****
  • 下載下載:39
  • 收藏收藏:0
一個好的影像字幕,必須從大量的訓練資料,即成對圖像和句子數據集學習(例如,MSCOCO)。然而,遷移到一個不同且沒有配對訓練數據(稱為跨域圖像字幕)的目標域(Target Domain)仍然未被探索。我們提出一種對抗訓練程序來利用目標域中的未配對數據。兩個評論網絡(Critics)來指引字幕產生器(Captioner),即Domain Critic和Multi-modal Critic。Domain Critic評估生成的句子是否與目標域中的句子不可區分;Multi-modal Critic評估圖像及其生成的句子是否是有效的對。在訓練過程中,評論網絡和字幕產生器作為對手(Adversaries) - 字幕產生器的目標是產生無法區分的句子,而評論網絡則是要區分它們。字幕產生器接收來自評論網路的收益(Reward)並且透過策略梯度(Policy Gradient)更新來改進。在推論(Inference)中,我們進一步提出了一種基於評論的規劃方法(Critic-based Planning)來選擇高質量的句子而無需額外的監督。為了評估,我們使用MSCOCO作為來源域(Source Domain)和四個其他數據集(CUB-200-2011,Oxford-102,TGIF和Flickr30k)作為目標域。我們的方法在所有數據集上一貫表現良好。推論中利用Critic-based Planning進一步提升了CUB-200和Oxford-102的整體表現。此外,我們將我們的方法擴展到影片字幕的產生。我們觀察到在大規模影片字幕數據集(如MSR-VTT,M-VAD和MPII-MD)之間的自適應任務上也有所改進。
Impressive image captioning results are achieved in domains with plenty of training image and sentence pairs (e.g., MSCOCO). However, transferring to a target domain with significant domain shifts but no paired training data (referred to as cross-domain image captioning) remains largely unexplored. We propose a novel adversarial training procedure to leverage unpaired data in the target domain. Two critic networks are introduced to guide the captioner, namely domain critic and multi-modal critic. The domain critic assesses whether the generated sentences are indistinguishable from sentences in the target domain. The multi-modal critic assesses whether an image and its generated sentence are a valid pair. During training, the critics and captioner act as adversaries -- captioner aims to generate indistinguishable sentences, whereas critics aim at distinguishing them. The assessment improves the captioner through policy gradient updates. During inference, we further propose a novel critic-based planning method to select high-quality sentences without additional supervision (e.g., tags). To evaluate, we use MSCOCO as the source domain and four other datasets (CUB-200-2011, Oxford-102, TGIF, and Flickr30k) as the target domains. Our method consistently performs well on all datasets. Utilizing the learned critic during inference further boosts the overall performance in CUB-200 and Oxford-102. Furthermore, we extend our method to the task of video captioning. We observe improvements for the adaptation between large-scale video captioning datasets such as MSR-VTT, M-VAD and MPII-MD.
Declaration ii
致謝 iii
摘要 iv
Abstract v
1 Introduction 1
1.1 MotivationandProblemDescription................... 1
1.2 MainContribution ............................ 2
1.3 RelatedWork ............................... 4
1.4 ThesisStructure.............................. 5
2 Preliminaries 6
2.1 RecurrentModels............................. 6
2.1.1 RecurrentNeuralNetwork(RNN)................ 6
2.1.2 Long-ShortTermMemory(LSTM) ............... 7
2.2 TrainingAlgorithmsforSequencePrediction . . . . . . . . . . . . . . 8
2.2.1 CrossEntropyTraining...................... 9
2.2.2 ExposureBias .......................... 9
2.2.3 ScheduledSampling ....................... 9
2.2.4 ProfessorForcing......................... 10
2.2.5 ReinforcementLearning ..................... 11
2.3 Evaluation................................. 11
2.3.1 HumanJudgments ........................ 12
2.3.2 AutomaticEvaluation ...................... 12
3 Cross-domain Image Captioning 14
3.1 CaptionerasanAgent........................... 15
3.2 Critics................................... 18
3.2.1 DomainCritic........................... 18
3.2.2 Multi-modalCritic ........................ 20
3.3 AdversarialTraining ........................... 23
3.4 Critic-basedPlanning........................... 23
4 Experiments 25
4.1 Introduction................................ 25 4.2 ImplementationDetails.......................... 25
4.3 ExperimentalResults ........................... 26
4.3.1 Baseline.............................. 27
4.3.2 Datasets.............................. 28
4.3.3 Sentence-levelDistribution.................... 29
4.3.4 Critic-basedPlanning....................... 29
4.4 AblationStudy .............................. 31
4.5 DesignChoicesforCritics ........................ 33
4.6 ExtensiontoVideoDomain........................ 33
4.7 In-domainCaptioning........................... 36
5 Conclusion References 38 39
[1] T.-H.Chen,Y.-H.Liao,C.-Y.Chuang,W.-T.Hsu,J.Fu,andM.Sun,“Show,adapt and tell: Adversarial training of cross-domain image captioner,” arXiv preprint arXiv:1705.00930, 2017. ii
[2] S. Hochreiter and J. Schmidhuber, “Long short-term memory.,” Neural Computa- tion, pp. 1735–1780, 1997. viii, 8
[3] T.-H. Chen, K.-H. Zeng, W.-T. Hsu, and M. Sun, “Video captioning via sentence augmentation and spatio-temporal attention,” in Asian Conference on Computer Vision, pp. 269–286, Springer, Cham, 2016. viii, 8
[4] A.Goyal,A.Lamb,Y.Zhang,S.Zhang,A.C.Courville,andY.Bengio,“Professor forcing: A new algorithm for training recurrent networks,” in NIPS, 2016. viii, 4, 10
[5] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, 2015. viii, 1, 4, 8, 14, 15
[6] K.-H. Zeng, T.-H. Chen, J. C. Niebles, and M. Sun, “Title generation for user generated videos,” in ECCV, 2016. ix, 3, 4, 27, 36
[7] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating im- age descriptions,” in CVPR, 2015. ix, 1, 4, 25, 37
[8] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015. 1, 4
[9] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embed- dings with multimodal neural language models,” TACL, 2015. 1, 4
[10] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014. 2, 3, 4, 25
[11] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset,” tech. rep., 2011. 2, 3, 25
[12] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, S. Kate, and T. Dar- rell, “Deep compositional captioning: Describing novel object categories without paired training data,” in CVPR, 2016. 2, 3, 27
39
[13] S. Venugopalan, L. A. Hendricks, M. Rohrbach, R. J. Mooney, T. Darrell, and K. Saenko, “Captioning images with diverse objects,” CoRR, vol. abs/1606.07770, 2016. 2
[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large- Scale Hierarchical Image Database,” in CVPR, 2009. 2
[15] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Guided open vocabu- lary image captioning with constrained beam search,” CoRR, vol. abs/1612.00576, 2016. 2, 3
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014. 2
[17] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al., “Policy gradient methods for reinforcement learning with function approximation.,” 3
[18] S. Reed, Z. Akata, H. Lee, and B. Schiele, “Learning deep representations of fine- grained visual descriptions,” in CVPR, 2016. 3
[19] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in ICVGIP, 2008. 3, 25
[20] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event de- scriptions,” TACL, 2014. 3, 4, 25
[21] Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo, “A new dataset and benchmark on animated gif description,” in CVPR, 2016. 3, 25, 28
[22] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” ICML, 2015. 4, 8
[23] S.Venugopalan,M.Rohrbach,J.Donahue,R.Mooney,T.Darrell,andK.Saenko, “Sequence to sequence-video to text,” in ICCV, 2015. 4, 8
[24] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in ICCV, 2015. 4, 8
[25] R.Bernardi,R.Cakici,D.Elliott,A.Erdem,E.Erdem,N.Ikizler-Cinbis,F.Keller, A. Muscat, and B. Plank, “Automatic description generation from images: A sur- vey of models, datasets, and evaluation measures.,” 2016. 4
[26] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” ICLR, 2016. 4
[27] S.Bengio,O.Vinyals,N.Jaitly,andN.Shazeer,“Scheduledsamplingforsequence prediction with recurrent neural networks.,” in NIPS, 2015. 4, 9, 26
[28] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio, “An actor-critic algorithm for sequence prediction,” in ICLR, 2017. 4
40
[29] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Optimization of image description metrics using policy gradient methods,” CoRR, vol. abs/1612.00370, 2016. 4
[30] S.J.Rennie,E.Marcheret,Y.Mroueh,J.Ross,andV.Goel,“Self-criticalsequence training for image captioning,” CoRR, vol. abs/1612.00563, 2016. 4, 36, 37
[31] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell, “Generating visual explanations,” in ECCV, 2016. 5
[32] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: sequence generative adversarial nets with policy gradient,” AAAI, 2017. 5, 33
[33] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand, “Domain- adversarial neural networks,” in NIPS workshop on Transfer and Multi-Task Learning: Theory meets Practice, 2014. 5
[34] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural net- works,” JMLR, vol. 17, no. 59, pp. 1–35, 2016. 5
[35] J. Hoffman, D. Wang, F. Yu, and T. Darrell, “Fcns in the wild: Pixel-level adver- sarial and constraint-based adaptation,” CoRR, vol. abs/1612.02649, 2016. 5
[36] I.Sutskever,O.Vinyals,andQ.V.Le,“Sequencetosequencelearningwithneural networks,” in Advances in neural information processing systems, pp. 3104–3112, 2014. 8
[37] A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summarization,” arXiv preprint arXiv:1509.00685, 2015. 8
[38] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270–280, 1989. 9
[39] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, pp. 41–48, ACM, 2009. 10
[40] F. Huszár, “How (not) to train your generative model: Scheduled sampling, likeli- hood, adversary?,” arXiv preprint arXiv:1511.05101, 2015. 10
[41] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, vol. 1. 11
[42] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist
reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992. 11
[43] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318, Association for Computa- tional Linguistics, 2002. 11, 12
41
[44] A. Lavie and M. J. Denkowski, “The meteor metric for automatic evaluation of machine translation,” Machine translation, vol. 23, no. 2, pp. 105–115, 2009. 11, 12
[45] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based im- age description evaluation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575, 2015. 11, 12, 13
[46] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III, “Midge: Generating image descriptions from computer vision detections,” in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–756, Association for Computational Linguistics, 2012. 12
[47] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, “Composing simple im- age descriptions using web-scale n-grams,” in Proceedings of the Fifteenth Con- ference on Computational Natural Language Learning, pp. 220–228, Association for Computational Linguistics, 2011. 12
[48] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out: Proceedings of the ACL-04 workshop, vol. 8, Barcelona, Spain, 2004. 12, 13
[49] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propo- sitional image caption evaluation,” in European Conference on Computer Vision, pp. 382–398, Springer, 2016. 12, 13
[50] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015. 12
[51] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 423–430, Association for Computational Linguistics, 2003. 13
[52] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning, “Generating semantically precise scene graphs from textual descriptions for improved image retrieval,” in Workshop on Vision and Language (VL15), (Lisbon, Portugal), As- sociation for Computational Linguistics, September 2015. 13
[53] Y. Kim, “Convolutional neural networks for sentence classification,” EMNLP, 2014. 18, 19
[54] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models,” AAAI, 2016. 18, 33
[55] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015. 19
[56] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015. 20, 33
42
[57] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 5288–5296, 2016. 25, 33, 36
[58] A. Torabi, C. J. Pal, H. Larochelle, and A. C. Courville, “Using descriptive video services to create a large data source for video annotation research,” arXiv:1503.01070, 2015. 25, 33, 36
[59] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for movie de- scription,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212, 2015. 25, 33, 36
[60] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni- tion,” in CVPR, 2016. 26
[61] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015. 26
[62] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” CVPR, 2017. 26
[63] G. Coppersmith and E. Kelly, “Dynamic wordclouds and vennclouds for ex- ploratory data analysis,” in Workshop on Interactive Language Learning, Visu- alization, and Interfaces. 28
[64] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advances in neural information processing systems, pp. 3294–3302, 2015. 29
[65] L. Van Der Maaten, “Barnes-hut-sne,” arXiv preprint arXiv:1301.3342, 2013. 29
[66] A. Rohrbach, M. Rohrbach, and B. Schiele, “The long-short story of movie de- scription,” in German Conference on Pattern Recognition, pp. 209–221, Springer, 2015. 36
[67] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” arXiv preprint arXiv:1412.4729, 2014. 36
[68] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune, “Plug & play generative networks: Conditional iterative generation of images in latent space,” arXiv preprint arXiv:1612.00005, 2016. 38
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *