帳號:guest(3.128.204.143)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):柯景誠
作者(外文):Ke, Jing-Cheng
論文名稱(中文):基於圖的指代表達理解及其推廣
論文名稱(外文):Referring Expression Comprehension in a Graph-based Perspective and Its Generalizations
指導教授(中文):林嘉文
林彥宇
指導教授(外文):Lin, Chia-Wen
Lin, Yen-Yu
口試委員(中文):陳駿丞
賴尚宏
李祈均
王聖智
徐繼聖
口試委員(外文):Chen, Jun-Cheng
Lai, Shang-Hong
Lee, Chi-Chun
Wang, Sheng-Jyh
Hsu, Gee-Sern
學位類別:博士
校院名稱:國立清華大學
系所名稱:通訊工程研究所
學號:108064872
出版年(民國):113
畢業學年度:112
語文別:英文
論文頁數:125
中文關鍵詞:指代表達理解圖方法圖選擇濾波網絡域自適應網絡動態門約束基於 視覺引導的表達擴散模型
外文關鍵詞:Referring expression comprehensiongraph-based methodsgraph selective filtering networkdomain adaptive networkdynamic gate constraintvision-guided expression diffusion model
相關次數:
  • 推薦推薦:0
  • 點閱點閱:8
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
指代表達理解(REC)是一項語言到視覺的匹配任務,其核心思想是尋找
文本表達式和視覺對象的共同特徵域。由於視覺對象和文本表達式之間的關
係建模的複雜性,基於圖的REC方法被廣泛用。然而,現有的基於圖的REC方
法存在兩個明顯的缺陷。1、沒有充分利用來自視覺對象和文本表達式的有
效信息來進行圖的構建和推理,導致無法準確地捕捉到視覺對象和文本表
達式之間的潛在關係。2、在圖的構建和推理過程中,納入了大量與表達式
無關的目標,這導致引入了大量的噪音。3、現有的基於圖的方法的性能過
分依賴於檢測器的性能,如果檢測器沒法把表達式想要描述的目標精確偵測到,
那將導致識別錯誤。此外,大多數現有的REC方法只關注於識別其類別被訓
練數據複蓋的視覺對象。這將它們的泛化限制在已經見過的類別。與此同
時,REC數據集存在表達式不充分的問題。雖然,近期提出的基於變壓器的方
法在REC上取得了巨大的成功,但這些方法需要大量的資料做預訓練處理。而
且,這些方法對計算資源的要求非常高。為了解決上述問題,本文為REC任
務開發了四個框架。第一種方法提出了一種圖選擇濾波網絡(GSFN),該
網絡構造了一個表達式引導濾波器,以從對象的特徵圖中自適應地選擇相
關和重要的視覺特徵。然後,選擇的視覺對象特徵和文本表達式特徵被聯合
用於圖的構建和推理。在將標準REC擴展到零樣本REC的過程中,第二種方
法提出了一種稱為CLIPREC的域自適應網絡。該網絡集成了對比語言圖像預
訓練模型(CLIP)。所提出的CLIPREC由兩個有向圖的圖協作注意力模塊組
成:一個用於圖像中的對象,另一個用於其對應的分類標籤。為了進一步增
強圖像中對象與文本表達式之間的關係,第三種方法引入了一個由子文本表
達式引導的即插即用模塊,稱為動態門約束(DGC),該模塊可以在推理過
程中自適應地禁用基於圖的REC方法中不相關的視覺對象及其連接。同時,
我們進一步引入了一種表達式引導回歸策略(EGR)來改進視覺目標的位置
預測。最後,我們為REG任務提出了一種新的基於視覺引導的文本擴散模型(VIE-DM),VIE-DM所生成的具有多樣性的圖像-文本樣本對用以輔助REC模
型訓練。在RefCOCO、RefCOCO+、RefCOCOg、Flickr30K以及RefClef和Ref推
理數據集上的大量實驗結果表明,我們的方法持續改進了現有的REC方法,從
而獲得了最先進的性能。
Referential Expression Comprehension (REC) is a task of matching language to vision, with the core idea being to find a common feature domain between textual expressions and visual objects. Due to the complexity of modeling the relationship between visual objects and textual expressions, graph-based REC methods are widely used. However, Existing graph-based REC methods have three obvious flaws. First,
they fail to fully utilize the effective information from both visual objects and textual expressions for graph construction and inference, resulting in the inability to accurately capture the potential relationship between visual objects and textual expressions. Second,
during the process of graph construction and inference, a large number of expression irrelevant targets are included, leading to the introduction of significant noise. Third, the performance of existing graph-based methods overly depends on the performance of detectors. If the detector fails to detect the targets that the expression intends to describe, it will result in identification errors. Furthermore, most existing REC methods only focus on identifying visual objects whose categories are covered by the training data, limiting their generalization to seen categories. At the same time, there is a problem of insufficient expression in REC datasets. Although recently proposed transformer-based methods have achieved great success in REC, these methods require a large amount of data for pre-training. Moreover, these methods have very high computational resource
requirements. To address the aforementioned issues, this thesis develops four frameworks for REC task. The first method proposes a graph selective filtering network (GSFN) that constructs an expression-guided filter to adaptively select relevant and important visual features from the feature map of an object. Then, the selected visual object features and the textual features of the expression are jointly used for graph construction and reasoning. In extending the standard REC to zero-shot REC, the second method proposes a domain adaptive network called CLIPREC, which integrates the Contrastive Language-Image Pretraining (CLIP) model for graph-based REC. The proposed CLIPREC is composed of a graph collaborative attention module with two directed graphs: one for objects in an image and the other for their corresponding categorical labels. To further enhance the relationship between objects in an image and the expression, the
third method introduces a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graph-based REC methods during reasoning. Finally, we propose a
novel VIsion-guided Expression Diffusion Model (VIE-DM) for the REG task, where diverse synonymous expressions adhering to both image and text contexts of the target object are generated to assist REC model training. Extensive experimental results on RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, andRefClef and Ref-reasoning datasets
demonstrate that our method consistently improves existing REC methods, leading to state-of-the-art performances.
Acknowledgements I
Abstract (Chinese) II
Abstract IV
Contents VI
List of Figures X
List of Tables XVI
1 Introduction . . . . . . . . . . . . . . . . . . . . . . .1
1.1 Background and Motivation . . . . . . . . . . . . . . . 1
1.1.1 Zero Shot REC using CLIP . . . . . . . .. . . . . . . 2
1.1.2 Effective Attention Graph Construction . .. . . . . . 4
1.1.3 Dynamic Subgraph Reasoning . . . . . . . . . . . . . 6
1.1.4 Vision-Guided Expression Diffusion . . . . . . . . . 8
1.2 Contributions . . . . . . . . . . .. . . . . . . . . . 10
2 Related Work . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Graph Neural Network-based RE. . . . . . . . . . . . . 13
2.2 Transformer-based REC . . . . . . . . . . . . . . . . .15
2.3 Multi-Steps Reasoning . . . . . . . . . . . . . . . . .16
2.4 Zero-Shot Learning with CLIP model . . .. . . . . . . 16
2.5 Text-guided Object Segmentation and Detection . .. . . 18
2.6 Referring Expression Generation (REG) .. . . . . . . . 19
2.7 Diffusion Models for Sentence Generation . . . . . . . 21
3 Zero Shot REC using CLIP . . . . . . . . . . . . . . . . 22
3.1 Overview . . . . . . . . . . .. . . . . . . . . . . . 22
3.2 Proposed Method . . .. . . . . . . . . . . . . . . . . 24
3.2.1 Language Parser . . . . . . . . . . . . . . . . . . 24
3.2.2 Graph Collaborative Attention Module . . . . . . . . 26
3.2.3 Multi-Step Reasoning Module . . . . . . . . . . . . 29
3.2.4 Loss Function and Matching Module . .. . . . . . . . 31
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Datasets and Implementation Details . . . . . . . . 32
3.3.2 Comparisons with State-of-the-Arts for Zero-Shot RE .36
3.3.3 Comparisons with State-of-the-Arts for standard REC .38
3.3.4 Ablation Studies . . . . . . . . . . . . . . . . . . 39
3.3.5 Qualitative Results of Zero-Shot REC . . . . . . . . 43
4 Effective Attention Graph Construction . . . . . . . . . 46
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Proposed Method . . . . . . . . . . . . . . . . . . . 47
4.2.1 Expression Decomposition . . . . . . . . . . . . . . 47
4.2.2 Multi-Modal Attention Graph Construction . . . . . . 51
4.2.3 Noun-Oriented Reasoning with Multi-Modal Graph . . . 56
4.2.4 Matching and Loss Functions . . . . . . . . . . . . 58
4.3 Experimental Results . . . . . . . . . . . . . . . . . 59
4.3.1 Evaluation Datasets . . . . . . . . . . . . . . . . 59
4.3.2 Implementation Details . . . . . . . . . . . . . . . 61
4.3.3 Comparisons with State-of-the-Art Methods . . . . . 62
4.3.4 Ablation Studies . . . . . . . . . . . . . . . . . . 65
4.3.5 Qualitative Evaluation . . . . . . . . . . . . . . . 69
5 Dynamic Subgraph Reasoning . . . . . . . . . . . . . . . 71
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Proposed Method . . . . . . . . . . . . . . . . . . . 72
5.2.1 Language Parser . . . . . . . . . . . . . . . . . . 72
5.2.2 Bimodal Graph Attention Module . . . . . . . . . . . 73
5.2.3 Reasoning with Dynamic Gate Constraints . . . .. . . 74
5.2.4 Matching and Expression-guided Regression . . . . . 78
5.3 Experimental Results . . . . . . . . . . . . . . . . . 79
5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . 79
5.3.2 Implementation Details . . . . . . . . . . . . . . . 81
5.3.3 Evaluation Results . . . . . . . . . . . . . . . . . 81
5.3.4 Ablation Studies . . . . . . . . . . . . . . . . . . 83
5.3.5 Qualitative Results . . . . . . .. . . . . . . . . . 86
6 Vision-Guided Expression Diffusion . . . . . . . . . . . 90
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Proposed method . . . . . . . .. . . . . . . . . . . . 90
6.2.1 Diffusion Models . . . . . . . . . . . . . . . . . . 90
6.2.2 Preliminary . . . . . . . . . . . . . . . . . . . . 92
6.2.3 Forward and Reverse Processes for Expression Diffusion. 92
6.2.4 Vision-Text Condition (VTC) . . . . . .. . . . . . . 93
6.2.5 Training and Sampling . . . . . . . . . . . . . . . 94
6.2.6 Dataset Augmentation for REC . . . . . . . . . . . . 96
6.3 Experimental Results . . . . . . . . . . . . . . . . . 96
6.3.1 Performance Evaluation Results . . . . . . . . . . . 98
6.3.2 Comparison between REG and Image Captioning . . . . 103
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 104
7.1 Limitation and Future Works . . . . . . . . .. . . . . 106
Bibliography . . . . . . . . . . . . . . . . . . . . . . . 107
[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 6077–6086, Jun. 2018.

[2] Mohit Bajaj, Lanjun Wang, and Leonid Sigal. G3raphground: Graph-based language grounding. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 4281–4290, Oct. 2019.

[3] Steven Bird and Edward Loper. NLTK: The natural language toolkit. In Proc. Assn. for. Comput. Linguistics, page 214–217, Jul. 2004.

[4] Arthur W Burks, Don W Warren, and Jesse B Wright. An analysis of a logical machine using parenthesis-free notation. Math. Tables Other Aids Comput., (46):53–57, 1954.

[5] Kan Chen, Rama Kovvuri, and Ram Nevatia. Query-guided regression network with context policy for phrase grounding. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 824–832, Oct. 2017.

[6] Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding. Proc. AAAI Conf. Artif. Intell., 35(2):1036–1044, May 2020.

[7] Sijia Chen and Baochun Li. Multi-modal dynamic graph transformer for visual grounding. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 15534–15543, Jun. 2022.

[8] Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, and Jiebo Luo. Real-time referring expression comprehension by single-stage grounding network. CoRR, 2018.

[9] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In Proc. European Conf. Comput. Vis., pages 104–120, Sept. 2020.

[10] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In Proc. European Conf. Comput. Vis., pages 104–120, Sept. 2020.

[11] Ying Cheng, Ruize Wang, Jiashuo Yu, Rui-Wei Zhao, Yuejie Zhang, and Rui Feng. Exploring logical reasoning for referring expression comprehension. In Proc. ACM Multimedia, pages 5047––5055, Oct. 2021.

[12] Sujatha Das Gollapalli and Cornelia Caragea. Extracting keyphrases from research papers using citation networks. Proc. AAAI Conf. Artif. Intell., 28(1), Jun. 2014.

[13] Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, and Ali Farhadi. Robothor: An open simulation-to-real embodied aiplatform. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 3164–3174, Jun. 2020.

[14] Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. Visual grounding via accumulated attention. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 7746–7755, Jun. 2018.

[15] Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2009.
[16] Jiajun Deng, Zhengyuan Yang, Tianlang Chen,Wengang Zhou, and Houqiang Li. TransVG: End-to-end visual grounding with transformers. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 1769–1779, Oct. 2021.

[17] Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, and Wanli Ouyang. TransVG++: End-to-end visual grounding with language conditioned vision transformer. CoRR, 2022.

[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, 2018.

[19] Pelin Dogan, Leonid Sigal, and Markus Gross. Neural sequential phrase grounding (seqground). In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4175–4184, Jun. 2019.

[20] Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. Visual grounding with transformers. In Proc. IEEE Int. Conf. Multimedia Expo, 2022.

[21] Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Promptdet: Towards open-vocabulary detection using uncurated images. In Proc. Eur. Conf. Comput. vis., pages 701–717, Nov. 2022.

[22] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc' Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In Proc. Adv. Neural Inf. Process. Syst., volume 26, pages 2121–2129, 2013.

[23] Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu.Large-scale adversarial training for vision-and-language representation learning. Proc. Adv. Neural Inf. Process. Syst., 33:6616–6628, 2020.

[24] Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision-and-language representation learning. In Proc. Adv. Neural Inf. Process. Syst., pages 6616–6628, 2020.

[25] Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. In Proc. Int. Conf. Learn. Represent., Feb. 2023.

[26] Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-autoregressive neural machine translation. In Proc. Int. Conf. Learn. Represent., 2018.

[27] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In Proc. Int. Conf. Learn. Represent., 2022.

[28] Zhengfu He, Tianxiang Sun, Qiong Tang, KuanningWang, Xuanjing Huang, and Xipeng Qiu. DiffusionBERT: Improving generative masked language models with diffusion models. In Proc. Assn. for. Comput. Linguistics, 2023.

[29] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proc. Adv. Neural Inf. Process. Syst., pages 6840–6851, 2020.

[30] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In Proc. Adv. Neural Inf. Process. Syst., 2021.

[31] Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. Learning to compose and reason with language tree structures for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell., 44(2):684–696, Apr. 2022.

[32] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to reason: End-to-end module networks for visual question answering. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 804–813, Oct.2017.

[33] Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships in referential expressions with compositional modular networks. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 1115–1124, Jul. 2017.

[34] Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships in referential expressions with compositional modular networks. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 1115–1124, Jul. 2017.

[35] Binbin Huang, Dongze Lian, Weixin Luo, and Shenghua Gao. Look before you leap: Learning landmark features for one-stage visual grounding. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 16888–16897, Jun. 2021.

[36] Dashan Huang, Fuwei Jiang, Kunpeng Li, Guoshi Tong, and Guofu Zhou. Scaled pca: A new approach to dimension reduction. Management Sci., 68(3):1678–1695, Jun. 2022.

[37] Jianqiang Huang, Yu Qin, Jiaxin Qi, Qianru Sun, and Hanwang Zhang. Deconfounded visual grounding. Proc. AAAI Conf. Artif. Intell., 36(1):998–1006, Jun.2022.

[38] Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 6700–6709, 2019.

[39] Liu J. and Hockenmaier J. Phrase grounding by soft-label chain conditional random field. In Proc. Empirical Methods in Natural Language Process., pages 5112–5122, Nov. 2019.

[40] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Proc. Int. Conf. Mach. Learn., pages 4904–4916, 2021.

[41] Yang Jin, Wenhao Jiang, Yi Yang, and Yadong Mu. Zero-shot video event detection with high-order semantic concept discovery and matching. IEEE Trans. Multimedia, 24:1896–1908, Apr. 2022.
[42] Chenchen Jing, Yuwei Wu, Mingtao Pei, Yao Hu, Yunde Jia, and Qi Wu. Visual semantic graph matching for visual grounding. In Proc. ACM Multimedia, page 4041–4050, Oct. 2020.

[43] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr - modulated detection for end-to-end multi-modal understanding. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 1780–1790, Oct. 2021.

[44] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Conf. Empir. Methods in Nat.Language Proc., page 787–798, October 2014.

[45] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Refer-ItGame: Referring to objects in photographs of natural scenes. In Proc. Empirical Methods in Natural Language Process., 2014.

[46] Anees Kazi, Luca Cosmo, Seyed-Ahmad Ahmadi, Nassir Navab, and Michael Bronstein. Differentiable graph module (dgm) for graph convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell., pages 1–13, Apr. 2022.

[47] Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proc. Assn. for. Comput. Linguistics, page 388–395, Jul. 2004.

[48] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123(06):32–73, Feb. 2017.

[49] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis, 123(1):32–73, Feb. 2017.

[50] Alon Lavie and Abhaya Agarwal. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proc. Assn. for. Comput. Linguistics Workshop, page 228–231, Jun. 2007.

[51] Ang Li, Allan Jabri, Armand Joulin, and Laurens van der Maaten. Learning visual n-grams from web data. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 4193–4202, Oct. 2017.

[52] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proc. Int. Conf. Mach. Learn., Jul. 2022.

[53] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. CoRR, 2019.

[54] Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. Clip-event: Connecting text and images with event structures. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 16420–16429, Jun. 2022.

[55] Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori BHashimoto. Diffusion-lm improves controllable text generation. In Proc. Adv. Neural Inf. Process. Syst., pages 4328–4343, 2022.

[56] Zhihui Li, Lina Yao, Xiaoqin Zhang, XianzhiWang, Salil Kanhere, and Huaxiang Zhang. Zero-shot object detection with textual descriptions. Proc. AAAI Conf. Artif. Intell., 33(1):8690–8697, 2019.

[57] Chen Liang,WenguanWang, Tianfei Zhou, Jiaxu Miao, Yawei Luo, and Yi Yang. Local-global context aware transformer for language-guided video segmentation.In CoRR, 2022.

[58] Chen Liang,WenguanWang, Tianfei Zhou, Jiaxu Miao, Yawei Luo, and Yi Yang. Local-global context aware transformer for language-guided video segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 45(8):10055–10069, Mar. 2023.

[59] Chen Liang, YuWu, Yawei Luo, and Yi Yang. Clawcranenet: Leveraging objectlevel relation for text-based video segmentation. In CoRR, 2021.

[60] Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 10880–10889, 2020.

[61] Chin-Yew” Lin. ROUGE: A package for automatic evaluation of summaries. In Proc. Assn. for. Comput. Linguistics, page 74–81, Jul. 2004.

[62] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR,2014.

[63] Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. In CoRR, 2023.

[64] Daqing Liu, Hanwang Zhang, FengWu, and Zheng-Jun Zha. Learning to assemble neural module tree networks for visual grounding. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4673–4682, Oct. 2019.

[65] Daqing Liu, Hanwang Zhang, FengWu, and Zheng-Jun Zha. Learning to assemble
neural module tree networks for visual grounding. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4673–4682, Oct. 2019.

[66] Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, and R. Manmatha. Polyformer: Referring image segmentation as sequential polygon generation. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 18653–18663, Jun. 2023.

[67] Jingyu Liu, LiangWang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 4856–4864, Oct. 2017.

[68] Jingyu Liu, Wei Wang, Liang Wang, and Ming-Hsuan Yang. Attribute-guided attention for referring expression generation and comprehension. IEEE Trans. Image Process., pages 5244 – 5258, 2020.

[69] Xihui Liu, ZihaoWang, Jing Shao, XiaogangWang, and Hongsheng Li. Improving referring expression grounding with cross-modal attention-guided erasing. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 1950–1959, Oct. 2019.

[70] Yongfei Liu, Bo Wan, Xiaodan Zhu, and Xuming He. Learning cross-modal context graph for visual grounding. Proc. AAAI Conf. Artif. Intell., 34(7):11645–11652, Apr. 2020.

[71] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining taskagnostic visiolinguistic representations for vision-and-language tasks. In Proc. Adv. Neural Inf. Process. Syst., 2019.

[72] Chenxu Luo, Xiaodong Yang, and Alan Yuille. Self-supervised pillar motion learning for autonomous driving. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2021.

[73] Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 10034–10043, Jun. 2020.

[74] Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng,
and Rongrong Ji. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., pages 10034–10043, 2020.

[75] Ruotian Luo and Gregory Shakhnarovich. Comprehension-guided referring expressions. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 7102–7111, Jul. 2017.

[76] Ruotian Luo and Gregory Shakhnarovich. Comprehension-guided referring expressions. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 7102–7111, Jul. 2017.

[77] Yawei Luo, Ping Liu, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. Category-level adversarial adaptation for semantic segmentation using purified features. IEEE Trans. Pattern Anal. Mach. Intell., 44(8):3940–3956, Mar. 2022.

[78] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 11–20, Jun. 2016.

[79] Varun K. Nagaraja, Vlad I. Morariu, and Larry S. Davis. Modeling context between objects for referring expression understanding. In Proc. European Conf. Comput. Vis., pages 792–807, Sept. 2016.

[80] Yulei Niu, Hanwang Zhang, Zhiwu Lu, and Shih-Fu Chang. Variational context: Exploiting visual and textual context for grounding referring expressions. IEEE Trans. Pattern Anal. Mach. Intell., (1):347 – 359, Jul. 2021.

[81] Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned photographs. In Proc. Adv. Neural Inf. Process. Syst., volume 24, pages 1143–1151, 2011.

[82] Kishore Papineni, Salim Roukos, ToddWard, andWei-Jing Zhu. Bleu: A method for automatic evaluation of machine translation. In Proc. Assn. for. Comput. Linguistics, page 311–318, Jul. 2002.

[83] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Globalvectors for word representation. In Proc. Conf. Empir. Methods in Nat. Language,
2014.

[84] Bryan A. Plummer, Paige Kordas, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, and Svetlana Lazebnik. Conditional image-text embedding networks. In Proc. European Conf. Comput. Vis., pages 249–264, Sept. 2018.

[85] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-tophrase correspondences for richer image-to-sentence models. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 2641–2649, Dec. 2015.

[86] Jordi Pont-Tuset and Luc Van Gool. Boosting object proposals: From pascal to coco. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 1546–1554, Feb. 2015.

[87] Yanyuan Qiao, Chaorui Deng, and Qi Wu. Referring expression comprehension: A survey of methods and datasets. IEEE Trans. Multimedia, 23:4426–4440, Dec.2021.

[88] Heqian Qiu, Hongliang Li, Qingbo Wu, Fanman Meng, Hengcan Shi, Taijin Zhao, and King Ngi Ngan. Language-aware fine-grained object representation for referring expression comprehension. In Proc. ACM Multimedia, pages 4171—-4180, Oct. 2020.

[89] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proc. Machine Learning Research, pages 8748–
8763, Jul. 2021.

[90] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do CIFAR-10 classifiers generalize to cifar-10? In CoRR, 2018.

[91] Machel Reid, Vincent Josua Hellendoorn, and Graham Neubig. DiffusER: Diffusion via edit-based reconstruction. In Proc. Int. Conf. Learn. Represent., Mar. 2023.

[92] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proc. Adv. Neural Inf. Process. Syst., volume 28, pages 91—-99, 2015.

[93] Yi Ren, Jinglin Liu, Xu Tan, Zhou Zhao, Sheng Zhao, and Tie-Yan Liu. A study of non-autoregressive model for sequence generation. In Proc. Assn. for. Comput. Linguistics, 2020.

[94] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. Grounding of textual phrases in images by reconstruction. In Proc. European Conf. Comput. Vis., pages 817–834, Sept. 2016.

[95] Arka Sadhu, Kan Chen, and Ram Nevatia. Zero-shot grounding of objects from natural language queries. In Proc. IEEE/CVF Int. Conf. Comput. vis., pages 4694–4703, Oct. 2019.

[96] Chitwan Saharia,William Chan, Saurabh Saxena, and Mohammad Norouzi. Nonautoregressive machine translation with latent alignments. In Proc. Conf. Empir. Methods in Nat. Language, 2020.

[97] Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher D. Manning. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proc.Workshop Vis. Language, page 70–80, Sept. 2015.

[98] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 8430–8439, Oct. 2019.

[99] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proc. Assn. Comput. Linguistics, page 2556–2565, Jul. 2018.

[100] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, 2014.

[101] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Proc. Adv. Neural Inf. Process. Syst., volume 26, pages 935–943, 2013.

[102] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In Proc. Int. Conf. Learn. Represent., Jan. 2021.

[103] Wei Su, Peihan Miao, Huanzhang Dou, Gaoang Wang, Liang Qiao, Zheyang Li, and Xi Li. Language adaptive weight generation for multi-task visual grounding. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 10857–10866, Jun. 2023.

[104] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT: Pre-training of generic visual-linguistic representations. CoRR, 2019.

[105] Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. ReCLIP: A strong zero-shot baseline for referring expression comprehension. In The Assn. Comput. Linguistics, pages 5198–5215, May 2022.

[106] Mengyang Sun, Wei Suo, Peng Wang, Yanning Zhang, and Qi Wu. A proposalfree one-stage framework for referring expression comprehension and generation via dense cross-attention. IEEE Trans. Multimedia, 25:2446 – 2458, 2023.

[107] Wei Suo, Mengyang Sun, Peng Wang, and Qi Wu. Proposal-free one-stage referring expression via grid-word cross-attention. In CoRR, 2021.

[108] Freyr Sverrisson, Jean Feydy, Bruno E. Correia, and Michael M. Bronstein. Fast end-to-end learning on protein surfaces. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 15272–15281, Jun. 2021.

[109] Mikihiro Tanaka, Takayuki Itamochi, Kenichi Narioka, Ikuro Sato, Yoshitaka Ushiku, and Tatsuya Harada. Generating easy-to-understand referring expressions for target identifications. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 5794–
5803, Oct. 2019.

[110] Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-centric spatio-temporal video grounding with visual transformers. IEEE Trans. Circuits Syst. Video Technol, 32(12):8238–8249, Jun. 2022.

[111] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider:Consensus-based image description evaluation. In Proc. IEEE/CVF Conf. Comput.Vis. Pattern Recognit., pages 4566–4575, Jun. 2015.

[112] JiaWang, Jingcheng Ke, Hong-Han Shuai, Yung-Hui Li, andWen-Huang Cheng.Referring expression comprehension via enhanced cross-modal graph attention networks. ACM Trans. Multimedia Comput. Commun. Appl., 19(2):1551–6857,Feb. 2023.

[113] Josiah Wang and Lucia Specia. Phrase localization without paired training examples. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 4663–4672, Oct.2019.

[114] MingzheWang, Mahmoud Azab, NoriyukiKojima, Rada Mihalcea, and Jia Deng. Structured matching for phrase localization. In Proc. European Conf. Comput. Vis, pages 696–711, Sept. 2016.

[115] Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Antonvan den Hengel. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proc. IEEE/CVF Conf. Comput.Vis. Pattern Recognit., pages 1960–1968, Jun. 2019.

[116] PengWang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proc. Int. Conf. Mach. Learn., volume 162, pages 23318–23340, 2022.

[117] Spencer Whitehead, HuiWu, Heng Ji, Rogerio Feris, and Kate Saenko. Separating skills and concepts for novel visual question answering. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2021.

[118] Ning Xu, An-An Liu, Yongkang Wong, Weizhi Nie, Yuting Su, and Mohan Kankanhalli. Scene graph inference via multi-scale context modeling. IEEE Trans. Circuits Syst. Video Technol., 31(3):1031–1041, Mar. 2021.

[119] Li Yang, Yan Xu, Chunfeng Yuan,Wei Liu, Bing Li, andWeiming Hu. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 9499–9508, Jun. 2022.

[120] Sibei Yang, Guanbin Li, and Yizhou Yu. Dynamic graph attention for referring expression comprehension. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 4644–4653, Oct. 2019.

[121] Sibei Yang, Guanbin Li, and Yizhou Yu. Graph-structured referring expression reasoning in the wild. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 9952–9961, Jun. 2020.

[122] Sibei Yang, Guanbin Li, and Yizhou Yu. Relationship-embedded representation learning for grounding referring expressions. IEEE Trans. Pattern Anal. Mach. Intell., 43(8):2765–2779, Feb. 2021.

[123] Zhengyuan Yang, ianlang Chen, Liwei Wang, and Jiebo Luo. Improving onestage visual grounding by recursive sub-query construction. In Proc. European Conf. Comput. Vis., pages 387–404, Nov. 2020.

[124] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. Crossing the format boundary of text and boxes: Towards unified vision-language modeling. CoRR, 2021.

[125] Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. A fast and accurate one-stage approach to visual grounding. In Proc. IEEE/CVF Int. Conf. Comput. Vis., pages 4683–4693, October 2019.

[126] Ziyan Yang, Kushal Kafle, Franck Dernoncourt, and Vicente Ordonez. Improving visual grounding by encouraging consistent gradient-based explanations. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 19165–19174, Jun. 2023.

[127] Jiabo Ye, Junfeng Tian, Ming Yan, Xiaoshan Yang, XuwuWang, Ji Zhang, Liang He, and Xin Lin. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 15502–15512, Jun. 2022.

[128] Fei Yu, Jiji Tang,Weichong Yin, Yu Sun, Hao Tian, HuaWu, and HaifengWang. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. Proc. AAAI Conf. Artif. Intell., 35(4):3208–3216, 2021.

[129] Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol, 30(12):4467–4480, Dec. 2020.

[130] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. Mattnet: Modular attention network for referring expression comprehension. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 1307–1315, Jun. 2018.

[131] Licheng Yu, Patrick Poirson, Shan Yang, and Tamara L. Berg. Modeling context in referring expressions. In Proc. European Conf. Comput. Vis., pages 69–85, Sept. 2016.

[132] LichengYu, Hao Tan, Mohit Bansal, and Tamara L. Berg. Ajoint speaker-listenerreinforcer model for referring expressions. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 7282–7290, Jul. 2017.

[133] Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. Rethinking diversified and discriminative proposal generation for visual grounding. In Proc. Int. Joint Conf. Artif. Intell., pages 1114–1120, Sept. 2018.

[134] Jian Zhang, Jianing Yang, Jun Yu, and Jianping Fan. Semisupervised image classification by mutual learning of multiple self-supervised models. Int. J. Intell. Syst, 37(5):3117–3141, May 2022.

[135] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, JianweiYang, Lei Zhang, LijuanWang, Yejin Choi, and Jianfeng Gao. VinVL: Revisiting visual representations in visionlanguage models. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 5579–5588, 2021.

[136] Songyang Zhang, Yang Yang, Jun Xiao, Xiaoming Liu, Yi Yang, Di Xie, and Yueting Zhuang. Fusing geometric features for skeleton-based action recognition using multilayer lstm networks. IEEE Trans. Multimedia, 20(9):2330–2343, Feb. 2018.

[137] Weixia Zhang, Chao Ma, Qi Wu, and Xiaokang Yang. Language-guided navigation via cross-modal grounding and alternate adversarial learning. IEEE Trans. Circuits Syst. Video Technol, 31(9):3469–3481, Sept. 2021.

[138] Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation. In CoRR, 2023.

[139] Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. SeqTR: A simple yet universal network for visual grounding. In CoRR, 2022.

[140] Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. Parallel attention: Aunified framework for visual object discovery through dialogs and queries. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 4252–4261, Jun. 2018.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *