帳號:guest(3.15.148.252)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):張馨云
作者(外文):Chang, Hsin Yun
論文名稱(中文):基於精簡文字說明之相片摘要技術
論文名稱(外文):Photo Album Summarization Based on Concise Captioning
指導教授(中文):林嘉文
指導教授(外文):Lin, Chia Wen
口試委員(中文):張正尚
孫民
莊永裕
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:103061547
出版年(民國):106
畢業學年度:105
語文別:英文
論文頁數:34
中文關鍵詞:相簿標籤相簿文字描述
外文關鍵詞:Album descriptionsAlbum tags
相關次數:
  • 推薦推薦:0
  • 點閱點閱:615
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在本篇論文中,我們提出了一個可以針對相簿類別,自動對相簿產生文字描述的方法。藉由資料庫中各類別的影像資訊,首先利用模型產生各類別文字資料,分析各個類別中每個詞彙對於該類別的重要性,進而透過使用者相簿的類別,對相簿產生並挑選與其相關的文字敘述。
我們的方法主要分為三個部分。第一個步驟中,我們從相簿中提取出各個相片的影像與文字資訊,並藉此得到相簿資訊。影像部分包含時間資訊、人臉資訊與影像特徵,文字資訊是利用模型對於各個相片初步產生的結果。透過時間資訊,利用相片之間的時間差,由時間差的分布求出事件分隔時間點,並以訓練好的模型投票決定相簿各個時間事件的類別。藉由影像相關資訊,利用實驗室過去相簿整理的方式,取得以觀看者為出發點的個人化相簿摘要。
第二步驟中,我們從相簿中得到對於關鍵相片的關鍵詞。利用同一時間事件中相片彼此相關的特性,以關鍵相片為中心取得同事件中鄰近相片的文字敘述作為補充說明,再針對文字以事件類別加權得到關鍵詞。其中類別權重是在事前訓練當中,利用Google搜尋引擎與Flickr相簿,以各類別關鍵詞搜尋相關圖片和相簿作為資料庫,根據資料庫文字分析加以獲得我們修正過的tf-idf權重。
由於我們目標是生成更符合相簿整體內容的文字描述,於第三步驟中希望在產生敘述的過程當中提高各個關鍵詞的生成機率,並藉由關鍵詞篩選出最後相簿的文字敘述,且可以利用此敘述來計算產生相簿標籤。
實驗中我們利用主觀評測,讓使用者觀看整體相簿並評斷文字敘述是否能夠描述該相簿。由實驗能證明我們的相簿敘述相較於其他方法更能符合觀看者對於相簿的感受,並且所生成的關鍵詞的確對於篩選文字敘述有所幫助。
In this paper, we present an automatic album descriptions generation system. Since we build a class text dataset from image dataset by caption generation model, we can measure how the importance from the word to each class. After we have weighting to each word, given an album, the proposed method would generate and choose the descriptions included keywords.
Our method includes three parts: the first part is the image and text information extraction from album. Image information include time information, face data and image feature extraction. Text information is the caption results from generation model for each image in album. The photo’s information would help us to split the time events from time gaps, and will vote for the time event class by pre-train model. Based on the image information, we can get the image summary result that is depended on who the viewer is by previous work.
In second part, we get the keywords to each key frame from text information. Based on time event, we can get other captions as this key frame supplement, and get keywords from counting time event class weighting sum. The class weighting is the tf-idf weighting that we pre-define by the text dataset. Text dataset is the caption results to the images we collect from Google Image search and Flickr album.
Because we want to generate the descriptions that more consistent with full album, we increase the keywords’ probabilities during caption generation and select the final description by keywords in last part.
We design the questionnaire to allow user to view the whole album and judge whether the description of the album. Compare to other methods, ours can be more in line with user thought. The keywords also help to select captions effectively.
摘 要 2
Abstract 3
Content 4
Chapter 1 Introduction 5
1.1 Research Background and Motivation 5
1.2 Research Objective 6
1.3 Thesis Organization 7
Chapter 2 Related Work 8
2.1 Traditional Image Caption Method 8
2.2 Image Caption in Neural Networks 9
2.3 NeuralTalk2 9
Chapter 3 Proposed Method 11
3.1 Overview of Proposed Method 11
3.2 Image Preprocessing 12
3.3 Album Information 13
3.4 Keywords 15
3.5 Caption Model 18
3.6 Generation Results 20
Chapter 4 Experiments and Discussions 22
4.1 Data Collection 22
4.2 Generation Results 23
4.3 Subjective Assessment 25
Chapter 5 Conclusion 30
Reference 31
[1] R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, B. Plank, “Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures”, Journal of Artificial Intelligence Research (JAIR), pp. 409-442, 2016.
[2] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll ́ar, J. Gao, X. He, M. Mitchell, J. Platt, C. L. Zitnick and G. Zweig, “From captions to visual concepts and back.” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1473-1482, 2015.
[3] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, T. L. Berg, “Baby talk: Understanding and generating simple image descriptions.” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1601-1608, 2011.
[4] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg and Y. Choi, “Composing simple image descriptions using web-scale n-grams.” The SIGNLL Conf. on Computational Natural Language Learning (CoNLL), pp. 220-228, 2011.
[5] I. Sutskever, J. Martens and G. E. Hinton, “Generating text with recurrent neural networks.” International Conf. on Machine Learning (ICML), pp. 1017-1024, 2011.
[6] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang and A. L. Yuille, “Explain Images with Multimodal Recurrent Neural Networks.” Advances in Neural Information Processing Systems Deep Learning Workshop (NIPS workshop), 2014.
[7] I. Sutskever, O. Vinyals and Q. V. Le, “Sequence to sequence learning with neural networks.” Advances in Neural Information Processing Systems (NIPS), pp. 3104-3112, 2014.
[8] R. Kiros, R. Salakhutdinov, R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models.” Advances in Neural Information Processing Systems Deep Learning Workshop (NIPS workshop), 2014.
[9] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions.” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137, 2015.
[10] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description.” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2625-2634, 2015.
[11] A. Karpathy, A. Joulin and L. Fei-Fei, “Deep Fragment Embeddings for Bidirectional Image Sentence Mapping.” Advances in Neural Information Processing Systems (NIPS), pp. 1889-1897, 2014.
[12] O. Vinyals, A. Toshev, S. Bengio and D. Erhan, “Show and Tell: A Neural Image Caption Generator.” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3157-3164, 2015.
[13] K. Simonyan, and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition.” Computing Research Repository (CoRR), abs/1409.1556, 2014.
(此全文未開放授權)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *