SR2-REC：基於句子重解讀和樣式正則化的適應性指述理解技術_

帳號：guest(216.73.216.18) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	羅文生
作者(外文):	Laleau, W. Olivier
論文名稱(中文):	SR2-REC：基於句子重解讀和樣式正則化的適應性指述理解技術
論文名稱(外文):	SR2-REC: Sentence Reinterpretation and Style Regularization for Adaptable Referring Expression Comprehension
指導教授(中文):	林嘉文黃敬群
指導教授(外文):	Lin, Chia-Wen Huang, Chi-Chun
口試委員(中文):	李祈均林彦宇
口試委員(外文):	Lee, Chi-Chun Lin, Yen-Yu
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	通訊工程研究所
學號:	107064426
出版年(民國):	110
畢業學年度:	109
語文別:	英文
論文頁數:	41
中文關鍵詞:	參照表達理解、Transformers、Beamsearch、條件語言生成、語言風格適應
外文關鍵詞:	Referring Expression Comprehension、Transformers、Beamsearch、conditional language generation、language style adaptation
相關次數:	推薦:0 點閱:686 評分: 下載:0 收藏:0

參考表達式理解（REC）是一項視覺語言學任務，目的是識別圖像上給定的參考表達式的對象。目前最先進的REC模型將參考目標作為一個乾淨的來源，因此它們沒有考慮到表達對目標對象的描述不佳的情況。此外，表達式可以用不同的交流方式表達類似的想法，因此REC模型應該有辦法適應不同的交流方式，以實現正確的檢測。在本文中，我們提出了SR2-REC轉化器，它將引用表達式作為輸入，然後根據目標風格（句子風格規範化）輸出多種解釋（句子重新解釋），這些解釋可以被輸入任何REC模型進行目標識別。對於句子風格正則化，我們使用一個場景圖解析器來識別一個統一的目標風格，我們使用beamsearch解碼算法來生成多個句子。我們將我們的SR2-REC網絡與最先進的REC模型相結合，包括ViLBert、VL-Bert和MCN。在RefCOCO、RefCOCO+和RefCOCOg的測試中，目標識別的準確性顯示了所提出的句子處理方法即使在領域轉移任務中也是有效的

The Referring Expression Comprehension (REC) is a visual linguistic task that
aims to identify an object on an image given a referring expression. Current stateoftheart
REC models treat the referring expression as a clean source, as a result
they fail to consider cases where the expression gives a poor description of the
target object. Moreover, expressions can express similar ideas in different communication
styles, consequently REC models should have a way to adapt to different
communication styles in order to attain the correct detection. In this paper,
we propose SR2REC
transformer which takes a referring expression as input then
outputs multiple interpretations (sentence reinterpretation) biased on a target style
(sentence style regularization) which can be fed to any REC model for target identification.
For sentence style regularization, we use a scene graph parser to identify
a unified target style and we use the beamsearch decoding algorithm generate multiple
sentences. We have integrated our SR2REC
network with stateoftheart
REC models, including ViLBert, VLBert,
and MCN. The target identification accuracy,
tested in the RefCOCO, RefCOCO+, and RefCOCOg, shows the proposed
sentence processing method’s effectiveness even in domain transfer tasks.

Contents
Acknowledgements
摘要 i
Abstract ii
1 Introduction 1
2 Related Works 5
2.1 Referring expression generation . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Referring expression comprehension . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Scene graph parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Beamsearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Text style transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Method 9
3.1 Style classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Noising function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Embedding layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.2 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.3 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iii
3.5 Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.1 Generator loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5.2 Discriminator loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6.1 Beamsearch for sentence reinterpretation . . . . . . . . . . . . . . . . 22
3.6.2 Integration with existing REC models . . . . . . . . . . . . . . . . . . 22
3.6.3 Multiple output generation for REC . . . . . . . . . . . . . . . . . . . 23
3.6.4 Single output generation for REC . . . . . . . . . . . . . . . . . . . . 23
4 Experiments 25
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Viref . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 RefCOCO datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.3 COCOcaptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.4 CopsRef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.5 ILSVRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 Content retention evaluation for text generation . . . . . . . . . . . . . 28
4.3.2 REC detection evaluation . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.3 Multiple output generation evaluation . . . . . . . . . . . . . . . . . . 29
4.3.4 Single output generation evaluation . . . . . . . . . . . . . . . . . . . 30
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 Detection Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 Detection performance by IoU . . . . . . . . . . . . . . . . . . . . . . 33
4.4.3 Selection Algorithm performance . . . . . . . . . . . . . . . . . . . . 34
4.4.4 Comparison with StateofTheArt . . . . . . . . . . . . . . . . . . . . 35
iv
4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Conclusion 41
References 43

References
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, and
I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, (Red Hook, NY, USA),
p. 6000–6010, Curran Associates Inc., 2017.
[2] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, “Mattnet: Modular
attention network for referring expression comprehension,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1307–1315, 2018.
[3] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining taskagnostic visiolinguistic
representations for visionandlanguage tasks,” Advances in Neural Information Processing Systems 32 (NIPS), vol. 32, 2019.
[4] G. Luo, Y. Zhou, X. Sun, L. Cao, C. Wu, C. Deng, and R. Ji, “Multitask collaborative
network for joint referring expression comprehension and segmentation,” in Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10034–
10043, 2020.
[5] S. Yang, G. Li, and Y. Yu, “Graphstructured referring expression reasoning in the wild,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 9952–9961, 2020.
[6] X. Liu, Z. Wang, J. Shao, X. Wang, and H. Li, “Improving referring expression grounding
with crossmodal attentionguided erasing,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 1950–1959, 2019.
[7] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko, “Modeling relationships in
referential expressions with compositional modular networks,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[8] D. Liu, H. Zhang, F. Wu, and Z.J. Zha, “Learning to assemble neural module tree networks for visual grounding,” in Proceedings of the IEEE/CVF International Conference
on Computer Vision, pp. 4673–4682, 2019.
[9] V. Cirik, T. BergKirkpatrick, and L.P. Morency, “Using syntax to ground referring expressions in natural images,” Vol. 32 No. 1 (2018): ThirtySecond AAAI Conference on
Artificial Intelligence, 2018.
[10] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VLBERT: pretraining of generic
visuallinguistic representations,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, OpenReview.net, 2020.
43
[11] J. Liu, L. Wang, and M.H. Yang, “Referring expression generation and comprehension
via attributes,” in Proceedings of the IEEE International Conference on Computer Vision,
pp. 4856–4864, 2017.
[12] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko, “Modeling relationships in
referential expressions with compositional modular networks,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1115–1124, 2017.
[13] S. Yang, G. Li, and Y. Yu, “Relationshipembedded representation learning for grounding
referring expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
pp. 1–1, 2020.
[14] Y. Liao, S. Liu, G. Li, F. Wang, Y. Chen, C. Qian, and B. Li, “A realtime crossmodality
correlation filtering method for referring expression comprehension,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10880–
10889, 2020.
[15] A. Akula, S. Gella, Y. AlOnaizan, S.C. Zhu, and S. Reddy, “Words aren’t enough, their
order matters: On the robustness of grounding visual referring expressions,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (Online), pp. 6555–6565, Association for Computational Linguistics, July 2020.
[16] S. Yang, G. Li, and Y. Yu, “Dynamic graph attention for referring expression comprehension,” in Proceedings of the IEEE/CVF International Conference on Computer Vision,
pp. 4644–4653, 2019.
[17] S. Schuster, R. Krishna, A. Chang, L. FeiFei, and C. D. Manning, “Generating semantically precise scene graphs from textual descriptions for improved image retrieval,” in
Proceedings of the fourth workshop on vision and language, pp. 70–80, 2015.
[18] C. Meister, R. Cotterell, and T. Vieira, “If beam search is the answer, what was the question?,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), (Online), pp. 2173–2185, Association for Computational Linguistics, Nov. 2020.
[19] T. Karras, S. Laine, and T. Aila, “A stylebased generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 4401–4410, 2019.
[20] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), June 2020.
[21] R. He and J. McAuley, “Ups and downs: Modeling the visual evolution of fashion trends
with oneclass collaborative filtering,” in proceedings of the 25th international conference
on world wide web, pp. 507–517, 2016.
[22] S. Rao and J. Tetreault, “Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer,” in Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), (New Orleans, Louisiana),
pp. 129–140, Association for Computational Linguistics, June 2018.
44
[23] N. Dai, J. Liang, X. Qiu, and X.J. Huang, “Style transformer: Unpaired text style transfer
without disentangled latent representation,” in Proceedings of the 57th Annual Meeting
of the Association for Computational Linguistics, pp. 5997–6007, 2019.
[24] J. Lee, Z. Xie, C. Wang, M. Drach, D. Jurafsky, and A. Ng, “Neural text style transfer
via denoising and reranking,” in Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, (Minneapolis, Minnesota), pp. 74–81,
Association for Computational Linguistics, June 2019.
[25] Z. Xie, G. Genthial, S. Xie, A. Ng, and D. Jurafsky, “Noising and denoising natural language: Diverse backtranslation for grammar correction,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), (New Orleans, Louisiana),
pp. 619–628, Association for Computational Linguistics, June 2018.
[26] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
[27] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov,
and L. Zettlemoyer, “BART: Denoising sequencetosequence pretraining for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, (Online), pp. 7871–7880, Association for Computational Linguistics, July 2020.
[28] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP), pp. 1532–1543, 2014.
[29] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects
in photographs of natural scenes,” in Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP), pp. 787–798, 2014.
[30] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and
comprehension of unambiguous object descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20, 2016.
[31] X. Chen, H. Fang, T.Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv eprints, pp. arXiv–
1504, 2015.
[32] H. Anayurt, S. A. Ozyegin, U. Cetin, U. Aktas, and S. Kalkan, “Searching for ambiguous
objects in videos using relational referring expressions,” in Proceedings of the British
Machine Vision Conference (BMVC), 2019.
[33] Z. Chen, P. Wang, L. Ma, K.Y. K. Wong, and Q. Wu, “Copsref: A new dataset and task
on compositional referring expression comprehension,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 10086–10095, 2020.
45
[34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet Large Scale Visual
Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3,
pp. 211–252, 2015.
[35] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu, “Bleu: a method for automatic evaluation
of machine translation,” in Proceedings of the 40th annual meeting of the Association for
Computational Linguistics, pp. 311–318, 2002.
[36] N. Reimers and I. Gurevych, “SentenceBERT: Sentence embeddings using Siamese
BERTnetworks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), (Hong Kong, China), pp. 3982–3992, Association
for Computational Linguistics, Nov. 2019.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文