|
[1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NIPS, 2015. [2] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision making,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4705–4713, 2015. [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015. [4] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in ICCV, 2013. [5] S. Hochreiter and J. Schmidhuber, “Long short-term memory.,” Neural Computation, 1997. [6] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [7] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012. [8] Google Inc., “Google self-driving car project monthly report,” May 2015. [9] N. Highway Traffic Safety Administration, “2012 motor vehicle crashes: overview,” 2013. [10] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena, “Recurrent neural networks for driver activity anticipation via sensory-fusion architecture,” in ICRA, 2016. [11] V. V. Valenzuela, R. D. Lins, and H. M. De Oliveira, “Application of enhanced-2d-cwt in topographic images for mapping landslide risk areas,” in International Conference Image Analysis and Recognition, pp. 380–388, Springer, 2013. [12] S. M. Arietta, A. A. Efros, R. Ramamoorthi, and M. Agrawala, “City forensics: Using visual elements to predict non-visual city attributes,” IEEE transactions on visualization and computer graphics, vol. 20, no. 12, pp. 2624–2633, 2014. [13] A. Khosla, B. An An, J. J. Lim, and A. Torralba, “Looking beyond the visible scene,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3710–3717, 2014. [14] M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in ICCV, 2011. [15] M. Hoai and F. De la Torre, “Max-margin early event detectors,” in CVPR, 2012. [16] T. Lan, T.-C. Chen, and S. Savarese, “A hierarchical representation for future action prediction,” in ECCV, 2014. [17] K. M. Kitani, B. D. Ziebart, J. A. D. Bagnell, and M. Hebert , “Activity forecasting,” in ECCV, 2012. [18] J. Yuen and A. Torralba, “A data-driven approach for event prediction,” in ECCV, 2010. [19] J. Walker, A. Gupta, and M. Hebert, “Patch to the future: Unsupervised visual prediction,” in CVPR, 2014. [20] Z. Wang, M. Deisenroth, H. Ben Amor, D. Vogt, B. Schölkopf, and J. Peters, “Probabilistic modeling of human movements for intention inference,” in RSS, 2012. [21] H. S. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” PAMI, vol. 38, no. 1, pp. 14–29, 2016. [22] H. S. Koppula, A. Jain, and A. Saxena, “Anticipatory planning for human-robot teams,” in ISER, 2014. [23] J. Mainprice and D. Berenson, “Human-robot collaborative manipulation planning using early prediction of human motion,” in IROS, 2013. [24] H. Berndt, J. Emmert, and K. Dietmayer, “Continuous driver intention recognition with hidden markov models,” in Intelligent Transportation Systems, 2008. [25] B. Frohlich, M. Enzweiler, and U. Franke, “Will this car change the lane? - turn signal recognition in the frequency domain,” in Intelligent Vehicles Symposium (IV), 2014. [26] P. Kumar, M. Perrollaz, S. Lefèvre, and C. Laugier, “Learning-based approach for online lane change intention prediction,” in Intelligent Vehicles Symposium (IV), 2013. [27] M. Liebner, M. Baumann, F. Klanner, and C. Stiller, “Driver intent inference at urban intersections using the intelligent driver model,” in Intelligent Vehicles Symposium (IV), 2012. [28] B. Morris, A. Doshi, and M. Trivedi, “Lane change intent prediction for driver assistance: On-road design and evaluation,” in Intelligent Vehicles Symposium (IV), 2011. [29] A. Doshi, B. Morris, and M. Trivedi, “On-road prediction of driver’s intent with multimodal sensory cues,” IEEE Pervasive Computing, vol. 10, no. 3, pp. 22–34, 2011. [30] M. M. Trivedi, T. Gandhi, and J. McCall, “Looking-in and looking-out of a vehicle: Computer-vision-based enhanced vehicle safety,” IEEE Transactions on Intelligent Transportation Systems, vol. 8, no. 1, pp. 108–120, 2007. [31] A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car that knows before you do: Anticipating maneuvers via learning temporal driving models,” in ICCV, 2015. [32] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville., “Describing videos by exploiting temporal structure,” in ICCV, 2015. [33] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” arXiv preprint arXiv:1502.03044, 2015. [34] V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu, “Recurrent models of visual attention,” in NIPS, 2014. [35] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” in ICLR, 2015. [36] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in ECCV, 2008. [37] B. Leibe, N. Cornelis, K. Cornelis, and L. V. Gool, “Dynamic 3d scene analysis from a moving vehicle,” in CVPR, 2007. [38] T. Scharwächter, M. Enzweiler, S. Roth, and U. Franke, “Efficient multi-cue scene segmentation,” in GCPR, 2013. [39] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in CVPR Workshop on The Future of Datasets in Vision, 2015. [40] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587, 2014. [41] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 886–893, IEEE, 2005. [42] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004. [43] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in European Conference on Computer Vision, pp. 346–361, Springer, 2014. [44] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448, 2015. [45] W. Choi, “Near-online multi-target tracking with aggregated local flow descriptor,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3029–3037, 2015. [46] B. D. Lucas, T. Kanade, et al., “An iterative image registration technique with an application to stereo vision,” 1981. [47] G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” Image analysis, pp. 363–370, 2003. [48] H. W. Kuhn, “The hungarian method for the assignment problem,” 50 Years of Integer Programming 1958-2008, pp. 29–47, 2010. [49] R. E. Kalman et al., “A new approach to linear filtering and prediction problems,” 1960. [50] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” arXiv preprint arXiv:1211.5063, 2012. [51] P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990. [52] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010. [53] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, pp. 740–755, Springer, 2014. [54] C. Vondrick, D. Patterson, and D. Ramanan, “Efficiently scaling up crowdsourced video annotation,” International Journal of Computer Vision, pp. 1–21. 10.1007/s11263-012-0564-1. [55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [56] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016. [57] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPr, 2005. [58] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3d: generic features for video analysis,” CoRR, abs/1412.0767, vol. 2, p. 7, 2014. [59] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [60] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C.itro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016. |