|
References [1] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR (2018). [2] Bansal, A., Sikka, K., Sharma, G., Chellappa, R., and Divakaran, A. Zero-shot object detection. In ECCV (2018). [3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. NeurIPS (2020). [4] Bucher, M., Vu, T.-H., Cord, M., and Pérez, P. Zero-shot semantic segmenta- tion. NeurIPS (2019). [5] Chen, L., Ma, W., Xiao, J., Zhang, H., and Chang, S.-F. Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In AAAI (2021). [6] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [7] Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. IJCV (2010). [8] Fan, Q., Zhuo, W., Tang, C.-K., and Tai, Y.-W. Few-shot object detection with attention-rpn and multi-relation detector. In CVPR (2020). [9] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016). [10] He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In ECCV (2016). [11] Huang, Z., Xu, W., and Yu, K. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015). [12] Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.- H., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML (2021). [13] Kaur, P., Pannu, H. S., and Malhi, A. K. Comparative analysis on cross-modal information retrieval: A review. CS Review (2021). [14] Kazemzadeh, S., Ordonez, V., Matten, M., and Berg, T. Referitgame: Refer- ring to objects in photographs of natural scenes. In EMNLP (2014). [15] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. Visual genome: Connecting lan- guage and vision using crowdsourced dense image annotations. IJCV (2017). 32 [16] Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C., and Li, B. A real-time cross-modality correlation filtering method for referring expression compre- hension. In CVPR (2020). [17] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In ECCV (2014). [18] Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proc. Advances Neural Inf. Process. Syst (2019). [19] Luo, G., Zhou, Y., Sun, X., Cao, L., Wu, C., Deng, C., and Ji, R. Multi- task collaborative network for joint referring expression comprehension and segmentation. In CVPR (2020). [20] Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., and Murphy, K. Generation and comprehension of unambiguous object descriptions. In CVPR (2016). [21] Miller, G. A. Wordnet: a lexical database for english. Communications of the ACM (1995). [22] Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In EMNLP (2014). [23] Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspon- dences for richer image-to-sentence models. In ICCV (2015). [24] Qiao, Y., Deng, C., and Wu, Q. Referring expression comprehension: A survey of methods and datasets. TMM (2020). [25] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML (2021). [26] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog (2019). [27] Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS (2015). [28] Rong, X., Yi, C., and Tian, Y. Unambiguous scene text segmentation with referring expression comprehension. TIP (2019). [29] Sadhu, A., Chen, K., and Nevatia, R. Zero-shot grounding of objects from natural language queries. In ICCV (2019). [30] Sak, H., Senior, A., and Beaufays, F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128 (2014). 33 [31] Sariyildiz, M. B., Perez, J., and Larlus, D. Learning visual representations with caption annotations. In ECCV (2020). [32] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE transactions on neural networks (2008). [33] Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., and Manning, C. D. Generat- ing semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language (2015). [34] Shen, S., Li, C., Hu, X., Xie, Y., Yang, J., Zhang, P., Rohrbach, A., Gan, Z., Wang, L., Yuan, L., et al. K-lite: Learning transferable visual models with external knowledge. arXiv preprint arXiv:2204.09222 (2022). [35] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., and Dai, J. Vl- bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019). [36] Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L., and Hengel, A. v. d. Neighbour- hood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR (2019). [37] Wang, W., Zheng, V. W., Yu, H., and Miao, C. A survey of zero-shot learning: Settings, methods, and applications. ACM TIST (2019). [38] Yan, C., Li, L., Zhang, C., Liu, B., Zhang, Y., and Dai, Q. Cross-modality bridging and knowledge transferring for image understanding. TMM (2019). [39] Yang, S., Li, G., and Yu, Y. Cross-modal relationship inference for grounding referring expressions. In CVPR (2019). [40] Yang, S., Li, G., and Yu, Y. Dynamic graph attention for referring expression comprehension. In ICCV (2019). [41] Yang, S., Li, G., and Yu, Y. Graph-structured referring expression reasoning in the wild. In CVPR (2020). [42] Yelamarthi, S. K., Reddy, S. K., Mishra, A., and Mittal, A. A zero-shot frame- work for sketch based image retrieval. In ECCV (2018). [43] Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., and Berg, T. L. Mattnet: Modular attention network for referring expression comprehension. In CVPR (2018). |