基於文本修正的目標指代理解及零樣本設計__國立清華大學博碩士論文全文影像系統

帳號：guest(18.223.203.153) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	王德樂
作者(外文):	Wang, De-Le
論文名稱(中文):	基於文本修正的目標指代理解及零樣本設計
論文名稱(外文):	Query-Guided Referring Expression Comprehension with Zero-Shot Setting
指導教授(中文):	林嘉文林彥宇
指導教授(外文):	Lin, Chia-Wen Lin, Yen-Yu
口試委員(中文):	陳駿丞許志仲
口試委員(外文):	Chen, Jun-Cheng Hsu, Chih-Chung
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	電機工程學系
學號:	107061470
出版年(民國):	112
畢業學年度:	111
語文別:	英文
論文頁數:	34
中文關鍵詞:	目標指代理解、稀疏圖、檢測框修正、零樣本學習
外文關鍵詞:	Referring expression comprehension、Sparse graph、Bounding box regression、Zero-shot
相關次數:	推薦:0 點閱:497 評分: 下載:0 收藏:0

目前，深度學習在單一感知領域的研究發展勢不可擋，例如電腦視覺和自然語言處理。如何利用兩種不同的感知去融合信息成為人們關註的課題。目標指代理解 (REC)是一項綜合視覺-語言處理的任務，旨在從視覺和語言的感知中提取特征,並映射到共同的空間中進行特征融合。該任務被廣泛應用在自動駕駛、智慧家居、人機交互等許多場合。

縱觀近幾年的研究，解決目標指代理解問題的主流思路是基於圖神經網路的方法。然而，這種方法中RPN模組產生的候選框往往會出現許多不相關和不準確的結果。在大多數情況中，許多方法直接採用構建完全圖或稠密图的方式，沒有排除不相關的候選框。在預測出候選框結果後，他們也缺乏簡單有效的途徑去修正不準確候選選框的誤差。與此同時在現實生活中，我們可能需要預測沒有訓練過的新類別，需要我們應用零樣本學習去解決。然而目前只有一篇論文嘗試解決零樣本學習的目標指代理解的問題，它的結果也遠遠不能令人滿意。

本篇論文利用了候選框在空間上的關聯性提出了一種創新的稀疏圖構建方法，改進了基於圖神經網路的目標指代理解的模型。同時，為了去修正不準確的候選框，本篇論文利用了文本資訊設計了一種簡單有效的候選框修正模組。與基於圖神經網路的現有領先方法相比，我們通過對稀疏圖，修正模組，以及文本特征上的改進，模型在目標指代理解的任務上具有領先的優良性能。最後，本篇論文是第一個基於圖神經網路的方法，實現零樣本學習設定的目標物指代理解任務。通過引入CLIP模組的設計，在實驗中我們的模型顯著優於之前的零樣本學習的目標物指代理解的結果。

Nowadays, the advance of Deep Learning in single perception like Computer Vision or Natural Language Processing is overwhelming. Referring Expression Comprehension (REC) is a vision-language task, which aims to extract feature from two different modalities, and then projects and fusions them in the common space. This task is widely applied in many occasions such as Automatic Driving, Smart Home, Human–Computer Interaction and so on.

Among recent works, one of the most prevailing solutions for REC task is based on graph neural network. Nevertheless, graph-based REC methods adopt region proposal network to generate proposals for selection, most of which are irrelevant or imprecise one. Previous approaches modelize the relationships of proposals with complete graph as topological structure and they are also lack of simple but effective way to calibrate the mistake of imprecise proposals. Another challenge for REC task is that it requires to predict new categories in reality. While there is only one work, addressing zero-shot REC issue whose performance is far from satisfaction.

In this paper, we proposed an innovative sparse graph construction utilizing the spatial relations of proposals. To calibrate the imprecise predicted bounding boxes, we design regression module to refine predictions under the guidance of query. With the improvement of graph construction, regression module and linguistic attention, our model reaches favorable performance comparing to existing advanced graph-based approaches. Moreover, our work is the first work to realize zero-shot REC with graph-based scenario outperforming previous methods remarkably.

Contents
1 Introduction 1
2 Related Work 5
2.1 Graph-based REC . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Zero-shot REC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Proposed Method 9
3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Query-Guided Prediction . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 GNN reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Zero-shot Extension . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Experiments 19
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Quantitative Results on Graph Construction . . . . . . . . . . . . . 22
4.4 Quantitative Results on QGR Module . . . . . . . . . . . . . . . . 24
4.5 Zero-shot Extension . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.6 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7 Failure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Conclusion 31
References 32

References
[1] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and
Zhang, L. Bottom-up and top-down attention for image captioning and visual
question answering. In CVPR (2018).
[2] Bansal, A., Sikka, K., Sharma, G., Chellappa, R., and Divakaran, A. Zero-shot
object detection. In ECCV (2018).
[3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are
few-shot learners. NeurIPS (2020).
[4] Bucher, M., Vu, T.-H., Cord, M., and Pérez, P. Zero-shot semantic segmenta-
tion. NeurIPS (2019).
[5] Chen, L., Ma, W., Xiao, J., Zhang, H., and Chang, S.-F. Ref-nms: Breaking
proposal bottlenecks in two-stage referring expression grounding. In AAAI
(2021).
[6] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805 (2018).
[7] Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A.
The pascal visual object classes (voc) challenge. IJCV (2010).
[8] Fan, Q., Zhuo, W., Tang, C.-K., and Tai, Y.-W. Few-shot object detection with
attention-rpn and multi-relation detector. In CVPR (2020).
[9] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition (2016).
[10] He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual
networks. In ECCV (2016).
[11] Huang, Z., Xu, W., and Yu, K. Bidirectional lstm-crf models for sequence
tagging. arXiv preprint arXiv:1508.01991 (2015).
[12] Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-
H., Li, Z., and Duerig, T. Scaling up visual and vision-language representation
learning with noisy text supervision. In ICML (2021).
[13] Kaur, P., Pannu, H. S., and Malhi, A. K. Comparative analysis on cross-modal
information retrieval: A review. CS Review (2021).
[14] Kazemzadeh, S., Ordonez, V., Matten, M., and Berg, T. Referitgame: Refer-
ring to objects in photographs of natural scenes. In EMNLP (2014).
[15] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S.,
Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. Visual genome: Connecting lan-
guage and vision using crowdsourced dense image annotations. IJCV (2017).
32
[16] Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C., and Li, B. A real-time
cross-modality correlation filtering method for referring expression compre-
hension. In CVPR (2020).
[17] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár,
P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV
(2014).
[18] Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining task-agnostic
visiolinguistic representations for vision-and-language tasks. Proc. Advances
Neural Inf. Process. Syst (2019).
[19] Luo, G., Zhou, Y., Sun, X., Cao, L., Wu, C., Deng, C., and Ji, R. Multi-
task collaborative network for joint referring expression comprehension and
segmentation. In CVPR (2020).
[20] Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., and Murphy, K.
Generation and comprehension of unambiguous object descriptions. In CVPR
(2016).
[21] Miller, G. A. Wordnet: a lexical database for english. Communications of the
ACM (1995).
[22] Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for
word representation. In EMNLP (2014).
[23] Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J.,
and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspon-
dences for richer image-to-sentence models. In ICCV (2015).
[24] Qiao, Y., Deng, C., and Wu, Q. Referring expression comprehension: A survey
of methods and datasets. TMM (2020).
[25] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry,
G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models
from natural language supervision. In ICML (2021).
[26] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.
Language models are unsupervised multitask learners. OpenAI blog (2019).
[27] Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time
object detection with region proposal networks. NeurIPS (2015).
[28] Rong, X., Yi, C., and Tian, Y. Unambiguous scene text segmentation with
referring expression comprehension. TIP (2019).
[29] Sadhu, A., Chen, K., and Nevatia, R. Zero-shot grounding of objects from
natural language queries. In ICCV (2019).
[30] Sak, H., Senior, A., and Beaufays, F. Long short-term memory based recurrent
neural network architectures for large vocabulary speech recognition. arXiv
preprint arXiv:1402.1128 (2014).
33
[31] Sariyildiz, M. B., Perez, J., and Larlus, D. Learning visual representations with
caption annotations. In ECCV (2020).
[32] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The
graph neural network model. IEEE transactions on neural networks (2008).
[33] Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., and Manning, C. D. Generat-
ing semantically precise scene graphs from textual descriptions for improved
image retrieval. In Proceedings of the fourth workshop on vision and language
(2015).
[34] Shen, S., Li, C., Hu, X., Xie, Y., Yang, J., Zhang, P., Rohrbach, A., Gan, Z.,
Wang, L., Yuan, L., et al. K-lite: Learning transferable visual models with
external knowledge. arXiv preprint arXiv:2204.09222 (2022).
[35] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., and Dai, J. Vl-
bert: Pre-training of generic visual-linguistic representations. arXiv preprint
arXiv:1908.08530 (2019).
[36] Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L., and Hengel, A. v. d. Neighbour-
hood watch: Referring expression comprehension via language-guided graph
attention networks. In CVPR (2019).
[37] Wang, W., Zheng, V. W., Yu, H., and Miao, C. A survey of zero-shot learning:
Settings, methods, and applications. ACM TIST (2019).
[38] Yan, C., Li, L., Zhang, C., Liu, B., Zhang, Y., and Dai, Q. Cross-modality
bridging and knowledge transferring for image understanding. TMM (2019).
[39] Yang, S., Li, G., and Yu, Y. Cross-modal relationship inference for grounding
referring expressions. In CVPR (2019).
[40] Yang, S., Li, G., and Yu, Y. Dynamic graph attention for referring expression
comprehension. In ICCV (2019).
[41] Yang, S., Li, G., and Yu, Y. Graph-structured referring expression reasoning
in the wild. In CVPR (2020).
[42] Yelamarthi, S. K., Reddy, S. K., Mishra, A., and Mittal, A. A zero-shot frame-
work for sketch based image retrieval. In ECCV (2018).
[43] Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., and Berg, T. L. Mattnet:
Modular attention network for referring expression comprehension. In CVPR (2018).

電子全文
摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文