帳號:guest(18.118.93.185)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):劉 璨
作者(外文):Liu, Can
論文名稱(中文):基於語言影像預訓練模型引導的多語句文本圖像生成
論文名稱(外文):Multi-sentence Text to Image Generation Guided by a Language-Image Pretraining Model
指導教授(中文):蘇豐文
指導教授(外文):Soo, Von-Wun
口試委員(中文):陳鴻文
郭柏志
陳素燕
口試委員(外文):Chen, Hown-Wen
Kuo, Po-Chih
Chen, Su-Yen
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:108065468
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:58
中文關鍵詞:深度學習文本生成圖像
外文關鍵詞:deep learningtex to image
相關次數:
  • 推薦推薦:0
  • 點閱點閱:330
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
文字生成圖像在深度學習中,是一個具有挑戰性,且潛在應用非常廣泛的領域。
有別於以前的研究多數侷限於利用單一句子生成圖像,我們提出了利用多個句子甚至段落產生一個圖像的方法。在限制單一場景情況下,多個句子能夠從多個面向更完整的描述一個圖像。在本論文研究中,我們採用了向量量化-對抗生成網路(VQGAN)與OpenAI提出的語言影像預訓練模型(CLIP)的架構,通過多語句處理和文本摘要,實現了多語句的高清圖像生成。這項技術可運用在插畫生成、藝術輔助創作等領域,另外我們也討論了語言影像預訓練模型引導生成圖像的缺點和未來展望。
我們的研究方法在主客觀評估下較過去的模型富有彈性,相對於單語句模型能夠產生更有完整度與語義匹配度的影像,且針對現實場景或者抽象場景組合都可以產生多樣的結果。
Text-to-image generation is a challenging and potentially wide-ranging field in deep learning.Compared with previous work, which is basically limited to using a single sentence to generate images, we propose a method to generate an image from multiple sentences or even paragraphs.In the case of limiting a single scene, multiple sentences can describe an image more completely from multiple aspects. In this thesis, we adopt VQGAN and CLIP proposed by OpenAI, and use multiple sentences processing and text summarization technology to achieve high-definition image generation with multiple sentences.It can be used in the fields of illustration generation and art-assisted creation.In addition, we also discuss the inherent shortcomings and future prospects of CLIP-guided generation of images.
Our research method is more flexible than previous models under subjective and objective evaluation. Compared with the single-sentence model, it can produce images with more completeness and semantic matching, and can produce various results for real scenes or combinations of abstract scenes.
Abstract (Chinese) I
Abstract II
Acknowledgements (Chinese) III
Contents IV
List of Figures VI
List of Tables VIII
1 Introduction 1
2 Related Work 7
2.1 GAN Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 DC-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Stack-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Attn-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 DM-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 MSH-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 CLIP Guided Methods . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 DALL·E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Big Sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Methodology 17
3.1 VQGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Multi-Sentence Text to Image Generation . . . . . . . . . . . . . . . 23
3.4.1 Text Summarization . . . . . . . . . . . . . . . . . . . . . . 24
3.4.2 Multi-sentence Encoding and Weights . . . . . . . . . . . . . 25
3.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.1 Basic Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Some Produce Results . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Experiments and Results 35
4.1 Experiment Data Preparation . . . . . . . . . . . . . . . . . . . . . 35
4.2 Objective Evaluation Experiments . . . . . . . . . . . . . . . . . . . 36
4.3 Subjective Evaluation Experiments . . . . . . . . . . . . . . . . . . 40
4.3.1 The Questionnaire . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.2 Subjective Evaluation Results . . . . . . . . . . . . . . . . . 41
5 Conclusion and Discussion 45
References 47
A Supplementary Experimental Content 51
A.1 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
[1] Reed, s., akata, z., yan, x., logeswaran, l., schiele, b., lee, h. (2016, june).
generative adversarial text to image synthesis. in international conference on
machine learning (pp. 1060-1069). pmlr.
[2] Zhang, h., xu, t., li, h., zhang, s., wang, x., huang, x., metaxas, d. n. (2017).
stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. in proceedings of the ieee international conference on computer vision (pp. 5907-5915).
[3] Xu, t., zhang, p., huang, q., zhang, h., gan, z., huang, x., he, x. (2018).
attngan: Fine-grained text to image generation with attentional generative
adversarial networks. in proceedings of the ieee conference on computer vision
and pattern recognition (pp. 1316-1324).
[4] Ramesh, a., pavlov, m., goh, g., gray, s., voss, c., radford, a., ... sutskever,
i. (2021, july). zero-shot text-to-image generation. in international conference
on machine learning (pp. 8821-8831). pmlr.
[5] Esser, p., rombach, r., ommer, b. (2021). taming transformers for highresolution image synthesis. in proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 12873-12883).
[6] Radford, a., kim, j. w., hallacy, c., ramesh, a., goh, g., agarwal, s., ...
sutskever, i. (2021, july). learning transferable visual models from natural language supervision. in international conference on machine learning (pp.
8748-8763). pmlr.
[7] Goodfellow, i., pouget-abadie, j., mirza, m., xu, b., warde-farley, d., ozair, s.,
... bengio, y. (2014). generative adversarial nets. advances in neural information processing systems, 27.
[8] Pejhan, e., ghasemzadeh, m. (2021). multi-sentence hierarchical generative
adversarial network gan (msh-gan) for automatic text-to-image generation.
journal of ai and data mining, 9(4), 475-485.
[9] Zhu, m., pan, p., chen, w., yang, y. (2019). dm-gan: Dynamic memory
generative adversarial networks for text-to-image synthesis. in proceedings
of the ieee/cvf conference on computer vision and pattern recognition (pp.
5802-5810).
[10] Sharma, s., suhubdy, d., michalski, v., kahou, s. e., bengio, y. (2018). chatpainter: Improving text to image generation using dialogue. arxiv preprint
arxiv:1802.08216.
[11] Johnson, j., gupta, a., fei-fei, l. (2018). image generation from scene graphs. in
proceedings of the ieee conference on computer vision and pattern recognition
(pp. 1219-1228).
[12] Vo, d. m., sugimoto, a. (2020, august). visual-relation conscious image generation from structured-text. in european conference on computer vision (pp.
290-306). springer, cham.
[13] Brock, a., donahue, j., simonyan, k. (2018). large scale gan training for high
fidelity natural image synthesis. arxiv preprint arxiv:1809.11096.
[14] Ulyanov, d., vedaldi, a., lempitsky, v. (2018). deep image prior. in proceedings
of the ieee conference on computer vision and pattern recognition (pp. 9446-
9454).
[15] Dhariwal, p., nichol, a. (2021). diffusion models beat gans on image synthesis.
advances in neural information processing systems, 34.
[16] Vaswani, a., shazeer, n., parmar, n., uszkoreit, j., jones, l., gomez, a. n., ...
polosukhin, i. (2017). attention is all you need. advances in neural information
processing systems, 30.
[17] Crowson, k., biderman, s., kornis, d., stander, d., hallahan, e., castricato, l.,
raff, e. (2022). vqgan-clip: Open domain image generation and editing with
natural language guidance. arxiv preprint arxiv:2204.08583.
[18] Van den oord, a., vinyals, o. (2017). neural discrete representation learning.
advances in neural information processing systems, 30.
[19] He, k., zhang, x., ren, s., sun, j. (2016). deep residual learning for image
recognition. in proceedings of the ieee conference on computer vision and
pattern recognition (pp. 770-778).
[20] Dosovitskiy, a., beyer, l., kolesnikov, a., weissenborn, d., zhai, x., unterthiner,
t., ... houlsby, n. (2020). an image is worth 16x16 words: Transformers for
image recognition at scale. arxiv preprint arxiv:2010.11929.
[21] Nikolai ilinykh, sina zarrieß, and david schlangen. 2019. tell me more: A
dataset of visual scene description sequences. in proceedings of the 12th international conference on natural language generation, pages 152–157, tokyo,
japan. association for computational linguistics.
[22] Deng, j., dong, w., socher, r., li, l. j., li, k., fei-fei, l. (2009, june). imagenet: A
large-scale hierarchical image database. in 2009 ieee conference on computer
vision and pattern recognition (pp. 248-255). ieee.
[23] Karras, t., laine, s., aila, t. (2019). a style-based generator architecture for
generative adversarial networks. in proceedings of the ieee/cvf conference on
computer vision and pattern recognition (pp. 4401-4410).
[24] El-kassas, w. s., salama, c. r., rafea, a. a., mohamed, h. k. (2021). automatic
text summarization: A comprehensive survey. expert systems with applications, 165, 113679.
[25] Lewis, m., liu, y., goyal, n., ghazvininejad, m., mohamed, a., levy, o., ...
zettlemoyer, l. (2019). bart: Denoising sequence-to-sequence pre-training for
natural language generation, translation, and comprehension. arxiv preprint
arxiv:1910.13461.
[26] Barratt, s., sharma, r. (2018). a note on the inception score. arxiv preprint
arxiv:1801.01973.
[27] Heusel, m., ramsauer, h., unterthiner, t., nessler, b., hochreiter, s. (2017).
gans trained by a two time-scale update rule converge to a local nash equilibrium. advances in neural information processing systems, 30.
[28] Barratt, s., sharma, r. (2018). a note on the inception score. arxiv preprint
arxiv:1801.01973.
[29] Frolov, s., hinz, t., raue, f., hees, j., dengel, a. (2021). adversarial text-toimage synthesis: A review. neural networks, 144, 187-209.
[30] Mu, n., kirillov, a., wagner, d., xie, s. (2021). slip: Self-supervision meets
language-image pre-training. arxiv preprint arxiv:2112.12750.
(此全文未開放授權)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *