|
[1] D. L. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 2011. [2] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for movie description,” in CVPR, 2015. [3] A. Torabi, C. J. Pal, H. Larochelle, and A. C. Courville, “Using descriptive video services to create a large data source for video annotation research,” arXiv:1503.01070, 2015. [4] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence - video to text,” in ICCV, 2015. [5] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville., “Describing videos by exploiting temporal structure,” in ICCV, 2015. [6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014. [7] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR, 2015. [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012. [9] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, “Translating videos to natural language using deep recurrent neural networks,” in NAACL, 2015. [10] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, 2015. [11] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: Visual question answering,” in ICCV, 2015. [12] M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on uncertain input,” in NIPS, 2014. [13] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “MovieQA: Understanding stories in movies through question-answering,” in CVPR, 2016. [14] M. Heilman and N. A. Smith, “Good question! statistical ranking for question generation,” in HLT, 2010. [15] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele, “Translating video content to natural language descriptions,” in ICCV, 2013. [16] S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus, “End-to-end memory networks,” in NIPS, 2015. [17] P. Das, C. Xu, R. Doell, and J. Corso, “A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching,” in CVPR, 2013. [18] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko, “Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition,” in ICCV, 2013. [19] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama, “Generating natural-language video descriptions using text-mined knowledge,” in AAAI, 2013. [20] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney, “Integrating language and vision to generate natural language descriptions of videos in the wild,” in COLING, 2014. [21] A. Barbu, E. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt, J. Shangguan, J. M. Siskind, J. Waggoner, S. Wang, J. Wei, Y. Yin, and Z. Zhang, “Video in sentences out,” in UAI, 2012. [22] A. Kojima, T. Tamura, and K. Fukunaga, “Natural language escription of human activities from video images based on concept hierarchy of actions,” IJCV, vol. 50, pp. 171–184, nov 2002. [23] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015. [24] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” ICML, 2015. [25] J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille, “Learning like a child: Fast novel visual concept learning from sentence descriptions of images,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2533–2541, 2015. [26] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” CVPR, 2016. [27] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional localization networks for dense captioning,” CVPR, 2016. [28] A. Rohrbach, M. Rohrbach, and B. Schiele, “The long-short story of movie description,” in GCPR, 2015. [29] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell, “Deep compositional captioning: Describing novel object categories without paired training data,” CVPR, 2016. [30] H. Wang and C. Schmid, “Action Recognition with Improved Trajectories,” in ICCV, 2013. [31] R. Xu, C. Xiong, W. Chen, and J. J. Corso, “Jointly modeling deep video and compositional text to bridge vision and language in a unified framework.,” in AAAI, pp. 2346–2352, 2015. [32] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” CVPR, 2016. [33] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” CVPR, 2016. [34] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” CVPR, 2016. [35] D. Yow, B. Yeo, M. Yeung, and B. Liu, “Analysis and presentation of soccer highlights from digital video,” in ACCV, 1995. [36] Y. Rui, A. Gupta, and A. Acero, “Automatically extracting highlights for tv baseball programs,” in ACM Multimedia, 2000. [37] S. Nepal, U. Srinivasan, and G. Reynolds, “Automatic detection of goal segments in basketball videos,” in ACM Multimedia, 2001. [38] E. C. J. Wang andC. Xu and Q. Tian, “Sports highlight detection from keyword sequences using hmm,” in ICME, 2004. [39] Z. Xiong, R. Radhakrishnan, A. Divakaran, and T. Huang, “Highlights extraction from sports video based on an audio-visual marker detection framework,” in ICME, 2005. [40] M. Kolekar and S. Sengupta, “Event-importance based customized and automatic cricket highlight generation,” in ICME, 2006. [41] A. Hanjalic, “Adaptive extraction of highlights from a sport video based on excitement modeling,” EEE Transactions on Multimedia, 2005. [42] H. Tang, V. Kwatra, M. Sargin, and U. Gargi, “Detecting highlights in sports videos: Cricket as a test case,” in ICME, 2011. [43] M. Sun, A. Farhadi, and S. Seitz, “Ranking domain-specific highlights by analyzing edited videos,” in ECCV, 2014. [44] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing web videos using titles,” in CVPR, 2015. [45] B. Zhao and E. P. Xing, “Quasi real-time summarization for consumer videos,” in CVPR, 2014. [46] H. Yang, B. Wang, S. Lin, D. Wipf, M. Guo, and B. Guo, “Unsupervised extraction of video highlights via robust recurrent auto-encoders,” in ICCV, 2015. [47] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele, “Coherent multi-sentence video description with variable level of detail,” in GCPR, 2014. [48] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [49] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, and T. Yeh, “Vizwiz: Nearly real-time answers to visual questions,” in UIST, 2010. [50] D. Geman, S. Geman, N. Hallonquist, and L. Younes, “Visual turing test for computer vision systems,” PNAS, vol. 112, no. 12, pp. 3618–3623, 2014. [51] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A neural-based approach to answering questions about images,” in ICCV, 2015. [52] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are you talking to a machine? dataset and methods for multilingual image question answering,” in NIPS, 2015. [53] H. Noh, P. H. Seo, and B. Han, “Image question answering using convolutional neural network with dynamic parameter prediction,” CVPR, 2016. [54] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Deep compositional question answering with neural module networks,” in CVPR, 2016. [55] L. Ma, Z. Lu, and H. Li, “Learning to answer questions from image using convolutional neural network,” in AAAI, 2016. [56] V. Rus and J. Lester, “Workshop on question generation,” in Workshop on Question Generation, 2009. [57] V. Rus and Graessar, “Question generation shared task and evaluation challenge v status report,” in The Question Generation Shared Task and Evaluation Challenge, 2009. [58] D. M. Gates, “Generating reading comprehension look-back strategy questions from expository texts,” Master’s thesis, Carnegie Mellon University, 2008. [59] M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in NIPS, 2015. [60] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S. C. Zhu, “Joint video and text parsing for understanding events and answering queries.,” in IEEE MultiMedia, 2014. [61] L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering temporal context for video question and answering,” arXiv preprint arXiv:1511.04670, 2015. [62] S. Hochreiter and J. Schmidhuber, “Long short-term memory.,” Neural Computation, pp. 1735–1780, 1997. [63] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015. [64] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015. [65] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv:1301.3781, 2013. [66] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015. [67] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensor-Flow: Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org. [68] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in NIPS, 2015. |