|
[1] T.-H.Chen,Y.-H.Liao,C.-Y.Chuang,W.-T.Hsu,J.Fu,andM.Sun,“Show,adapt and tell: Adversarial training of cross-domain image captioner,” arXiv preprint arXiv:1705.00930, 2017. ii [2] S. Hochreiter and J. Schmidhuber, “Long short-term memory.,” Neural Computa- tion, pp. 1735–1780, 1997. viii, 8 [3] T.-H. Chen, K.-H. Zeng, W.-T. Hsu, and M. Sun, “Video captioning via sentence augmentation and spatio-temporal attention,” in Asian Conference on Computer Vision, pp. 269–286, Springer, Cham, 2016. viii, 8 [4] A.Goyal,A.Lamb,Y.Zhang,S.Zhang,A.C.Courville,andY.Bengio,“Professor forcing: A new algorithm for training recurrent networks,” in NIPS, 2016. viii, 4, 10 [5] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, 2015. viii, 1, 4, 8, 14, 15 [6] K.-H. Zeng, T.-H. Chen, J. C. Niebles, and M. Sun, “Title generation for user generated videos,” in ECCV, 2016. ix, 3, 4, 27, 36 [7] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating im- age descriptions,” in CVPR, 2015. ix, 1, 4, 25, 37 [8] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015. 1, 4 [9] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embed- dings with multimodal neural language models,” TACL, 2015. 1, 4 [10] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014. 2, 3, 4, 25 [11] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset,” tech. rep., 2011. 2, 3, 25 [12] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, S. Kate, and T. Dar- rell, “Deep compositional captioning: Describing novel object categories without paired training data,” in CVPR, 2016. 2, 3, 27 39 [13] S. Venugopalan, L. A. Hendricks, M. Rohrbach, R. J. Mooney, T. Darrell, and K. Saenko, “Captioning images with diverse objects,” CoRR, vol. abs/1606.07770, 2016. 2 [14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large- Scale Hierarchical Image Database,” in CVPR, 2009. 2 [15] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Guided open vocabu- lary image captioning with constrained beam search,” CoRR, vol. abs/1612.00576, 2016. 2, 3 [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014. 2 [17] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al., “Policy gradient methods for reinforcement learning with function approximation.,” 3 [18] S. Reed, Z. Akata, H. Lee, and B. Schiele, “Learning deep representations of fine- grained visual descriptions,” in CVPR, 2016. 3 [19] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in ICVGIP, 2008. 3, 25 [20] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event de- scriptions,” TACL, 2014. 3, 4, 25 [21] Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo, “A new dataset and benchmark on animated gif description,” in CVPR, 2016. 3, 25, 28 [22] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” ICML, 2015. 4, 8 [23] S.Venugopalan,M.Rohrbach,J.Donahue,R.Mooney,T.Darrell,andK.Saenko, “Sequence to sequence-video to text,” in ICCV, 2015. 4, 8 [24] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in ICCV, 2015. 4, 8 [25] R.Bernardi,R.Cakici,D.Elliott,A.Erdem,E.Erdem,N.Ikizler-Cinbis,F.Keller, A. Muscat, and B. Plank, “Automatic description generation from images: A sur- vey of models, datasets, and evaluation measures.,” 2016. 4 [26] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” ICLR, 2016. 4 [27] S.Bengio,O.Vinyals,N.Jaitly,andN.Shazeer,“Scheduledsamplingforsequence prediction with recurrent neural networks.,” in NIPS, 2015. 4, 9, 26 [28] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio, “An actor-critic algorithm for sequence prediction,” in ICLR, 2017. 4 40 [29] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Optimization of image description metrics using policy gradient methods,” CoRR, vol. abs/1612.00370, 2016. 4 [30] S.J.Rennie,E.Marcheret,Y.Mroueh,J.Ross,andV.Goel,“Self-criticalsequence training for image captioning,” CoRR, vol. abs/1612.00563, 2016. 4, 36, 37 [31] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell, “Generating visual explanations,” in ECCV, 2016. 5 [32] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: sequence generative adversarial nets with policy gradient,” AAAI, 2017. 5, 33 [33] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand, “Domain- adversarial neural networks,” in NIPS workshop on Transfer and Multi-Task Learning: Theory meets Practice, 2014. 5 [34] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural net- works,” JMLR, vol. 17, no. 59, pp. 1–35, 2016. 5 [35] J. Hoffman, D. Wang, F. Yu, and T. Darrell, “Fcns in the wild: Pixel-level adver- sarial and constraint-based adaptation,” CoRR, vol. abs/1612.02649, 2016. 5 [36] I.Sutskever,O.Vinyals,andQ.V.Le,“Sequencetosequencelearningwithneural networks,” in Advances in neural information processing systems, pp. 3104–3112, 2014. 8 [37] A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summarization,” arXiv preprint arXiv:1509.00685, 2015. 8 [38] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural computation, vol. 1, no. 2, pp. 270–280, 1989. 9 [39] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, pp. 41–48, ACM, 2009. 10 [40] F. Huszár, “How (not) to train your generative model: Scheduled sampling, likeli- hood, adversary?,” arXiv preprint arXiv:1511.05101, 2015. 10 [41] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, vol. 1. 11 [42] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992. 11 [43] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318, Association for Computa- tional Linguistics, 2002. 11, 12 41 [44] A. Lavie and M. J. Denkowski, “The meteor metric for automatic evaluation of machine translation,” Machine translation, vol. 23, no. 2, pp. 105–115, 2009. 11, 12 [45] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based im- age description evaluation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575, 2015. 11, 12, 13 [46] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III, “Midge: Generating image descriptions from computer vision detections,” in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–756, Association for Computational Linguistics, 2012. 12 [47] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, “Composing simple im- age descriptions using web-scale n-grams,” in Proceedings of the Fifteenth Con- ference on Computational Natural Language Learning, pp. 220–228, Association for Computational Linguistics, 2011. 12 [48] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out: Proceedings of the ACL-04 workshop, vol. 8, Barcelona, Spain, 2004. 12, 13 [49] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propo- sitional image caption evaluation,” in European Conference on Computer Vision, pp. 382–398, Springer, 2016. 12, 13 [50] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015. 12 [51] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 423–430, Association for Computational Linguistics, 2003. 13 [52] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning, “Generating semantically precise scene graphs from textual descriptions for improved image retrieval,” in Workshop on Vision and Language (VL15), (Lisbon, Portugal), As- sociation for Computational Linguistics, September 2015. 13 [53] Y. Kim, “Convolutional neural networks for sentence classification,” EMNLP, 2014. 18, 19 [54] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models,” AAAI, 2016. 18, 33 [55] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015. 19 [56] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015. 20, 33 42 [57] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 5288–5296, 2016. 25, 33, 36 [58] A. Torabi, C. J. Pal, H. Larochelle, and A. C. Courville, “Using descriptive video services to create a large data source for video annotation research,” arXiv:1503.01070, 2015. 25, 33, 36 [59] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for movie de- scription,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212, 2015. 25, 33, 36 [60] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni- tion,” in CVPR, 2016. 26 [61] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015. 26 [62] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” CVPR, 2017. 26 [63] G. Coppersmith and E. Kelly, “Dynamic wordclouds and vennclouds for ex- ploratory data analysis,” in Workshop on Interactive Language Learning, Visu- alization, and Interfaces. 28 [64] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advances in neural information processing systems, pp. 3294–3302, 2015. 29 [65] L. Van Der Maaten, “Barnes-hut-sne,” arXiv preprint arXiv:1301.3342, 2013. 29 [66] A. Rohrbach, M. Rohrbach, and B. Schiele, “The long-short story of movie de- scription,” in German Conference on Pattern Recognition, pp. 209–221, Springer, 2015. 36 [67] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” arXiv preprint arXiv:1412.4729, 2014. 36 [68] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune, “Plug & play generative networks: Conditional iterative generation of images in latent space,” arXiv preprint arXiv:1612.00005, 2016. 38 |