作者(中文):劉 璨
作者(外文):Liu, Can
論文名稱(外文):Multi-sentence Text to Image Generation Guided by a Language-Image Pretraining Model
指導教授(外文):Soo, Von-Wun
口試委員(外文):Chen, Hown-Wen
Kuo, Po-Chih
Chen, Su-Yen
外文關鍵詞:deep learningtex to image
Text-to-image generation is a challenging and potentially wide-ranging field in deep learning.Compared with previous work, which is basically limited to using a single sentence to generate images, we propose a method to generate an image from multiple sentences or even paragraphs.In the case of limiting a single scene, multiple sentences can describe an image more completely from multiple aspects. In this thesis, we adopt VQGAN and CLIP proposed by OpenAI, and use multiple sentences processing and text summarization technology to achieve high-definition image generation with multiple sentences.It can be used in the fields of illustration generation and art-assisted creation.In addition, we also discuss the inherent shortcomings and future prospects of CLIP-guided generation of images.
Our research method is more flexible than previous models under subjective and objective evaluation. Compared with the single-sentence model, it can produce images with more completeness and semantic matching, and can produce various results for real scenes or combinations of abstract scenes.
Abstract (Chinese) I
Abstract II
Acknowledgements (Chinese) III
Contents IV
List of Figures VI
List of Tables VIII
1 Introduction 1
2 Related Work 7
2.1 GAN Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 DC-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Stack-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Attn-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 DM-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 MSH-GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 CLIP Guided Methods . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 DALL·E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Big Sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Methodology 17
3.1 VQGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Multi-Sentence Text to Image Generation . . . . . . . . . . . . . . . 23
3.4.1 Text Summarization . . . . . . . . . . . . . . . . . . . . . . 24
3.4.2 Multi-sentence Encoding and Weights . . . . . . . . . . . . . 25
3.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.1 Basic Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Some Produce Results . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Experiments and Results 35
4.1 Experiment Data Preparation . . . . . . . . . . . . . . . . . . . . . 35
4.2 Objective Evaluation Experiments . . . . . . . . . . . . . . . . . . . 36
4.3 Subjective Evaluation Experiments . . . . . . . . . . . . . . . . . . 40
4.3.1 The Questionnaire . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.2 Subjective Evaluation Results . . . . . . . . . . . . . . . . . 41
5 Conclusion and Discussion 45
References 47
A Supplementary Experimental Content 51
A.1 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
