|
References [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, (Red Hook, NY, USA), p. 6000–6010, Curran Associates Inc., 2017. [2] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, “Mattnet: Modular attention network for referring expression comprehension,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315, 2018. [3] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining taskagnostic visiolinguistic representations for visionandlanguage tasks,” Advances in Neural Information Processing Systems 32 (NIPS), vol. 32, 2019. [4] G. Luo, Y. Zhou, X. Sun, L. Cao, C. Wu, C. Deng, and R. Ji, “Multitask collaborative network for joint referring expression comprehension and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10034– 10043, 2020. [5] S. Yang, G. Li, and Y. Yu, “Graphstructured referring expression reasoning in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9952–9961, 2020. [6] X. Liu, Z. Wang, J. Shao, X. Wang, and H. Li, “Improving referring expression grounding with crossmodal attentionguided erasing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1950–1959, 2019. [7] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko, “Modeling relationships in referential expressions with compositional modular networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [8] D. Liu, H. Zhang, F. Wu, and Z.J. Zha, “Learning to assemble neural module tree networks for visual grounding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4673–4682, 2019. [9] V. Cirik, T. BergKirkpatrick, and L.P. Morency, “Using syntax to ground referring expressions in natural images,” Vol. 32 No. 1 (2018): ThirtySecond AAAI Conference on Artificial Intelligence, 2018. [10] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VLBERT: pretraining of generic visuallinguistic representations,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, OpenReview.net, 2020. 43 [11] J. Liu, L. Wang, and M.H. Yang, “Referring expression generation and comprehension via attributes,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4856–4864, 2017. [12] R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko, “Modeling relationships in referential expressions with compositional modular networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1115–1124, 2017. [13] S. Yang, G. Li, and Y. Yu, “Relationshipembedded representation learning for grounding referring expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020. [14] Y. Liao, S. Liu, G. Li, F. Wang, Y. Chen, C. Qian, and B. Li, “A realtime crossmodality correlation filtering method for referring expression comprehension,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10880– 10889, 2020. [15] A. Akula, S. Gella, Y. AlOnaizan, S.C. Zhu, and S. Reddy, “Words aren’t enough, their order matters: On the robustness of grounding visual referring expressions,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (Online), pp. 6555–6565, Association for Computational Linguistics, July 2020. [16] S. Yang, G. Li, and Y. Yu, “Dynamic graph attention for referring expression comprehension,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4644–4653, 2019. [17] S. Schuster, R. Krishna, A. Chang, L. FeiFei, and C. D. Manning, “Generating semantically precise scene graphs from textual descriptions for improved image retrieval,” in Proceedings of the fourth workshop on vision and language, pp. 70–80, 2015. [18] C. Meister, R. Cotterell, and T. Vieira, “If beam search is the answer, what was the question?,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (Online), pp. 2173–2185, Association for Computational Linguistics, Nov. 2020. [19] T. Karras, S. Laine, and T. Aila, “A stylebased generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019. [20] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [21] R. He and J. McAuley, “Ups and downs: Modeling the visual evolution of fashion trends with oneclass collaborative filtering,” in proceedings of the 25th international conference on world wide web, pp. 507–517, 2016. [22] S. Rao and J. Tetreault, “Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), (New Orleans, Louisiana), pp. 129–140, Association for Computational Linguistics, June 2018. 44 [23] N. Dai, J. Liang, X. Qiu, and X.J. Huang, “Style transformer: Unpaired text style transfer without disentangled latent representation,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5997–6007, 2019. [24] J. Lee, Z. Xie, C. Wang, M. Drach, D. Jurafsky, and A. Ng, “Neural text style transfer via denoising and reranking,” in Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, (Minneapolis, Minnesota), pp. 74–81, Association for Computational Linguistics, June 2019. [25] Z. Xie, G. Genthial, S. Xie, A. Ng, and D. Jurafsky, “Noising and denoising natural language: Diverse backtranslation for grammar correction,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), (New Orleans, Louisiana), pp. 619–628, Association for Computational Linguistics, June 2018. [26] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019. [27] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising sequencetosequence pretraining for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (Online), pp. 7871–7880, Association for Computational Linguistics, July 2020. [28] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014. [29] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798, 2014. [30] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20, 2016. [31] X. Chen, H. Fang, T.Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv eprints, pp. arXiv– 1504, 2015. [32] H. Anayurt, S. A. Ozyegin, U. Cetin, U. Aktas, and S. Kalkan, “Searching for ambiguous objects in videos using relational referring expressions,” in Proceedings of the British Machine Vision Conference (BMVC), 2019. [33] Z. Chen, P. Wang, L. Ma, K.Y. K. Wong, and Q. Wu, “Copsref: A new dataset and task on compositional referring expression comprehension,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10086–10095, 2020. 45 [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015. [35] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002. [36] N. Reimers and I. Gurevych, “SentenceBERT: Sentence embeddings using Siamese BERTnetworks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), (Hong Kong, China), pp. 3982–3992, Association for Computational Linguistics, Nov. 2019. |