帳號:guest(3.144.13.164)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):詹富翔
作者(外文):Chan, Fu-Hsiang
論文名稱(中文):基於深度學習預測交通意外及事故
論文名稱(外文):Anticipating Accidents based on Deep Learning in Dashcam Videos
指導教授(中文):孫民
指導教授(外文):Sun, Min
口試委員(中文):陳煥宗
賴尚宏
王鈺強
口試委員(外文):Chen, Hwann-Tzong
Lai, Shang-Hong
Wang, Yu-Chiang
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:104061522
出版年(民國):106
畢業學年度:105
語文別:英文
論文頁數:38
中文關鍵詞:電腦視覺機器學習深度學習
外文關鍵詞:Computer VisionMachine LearningDeep Learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:3112
  • 評分評分:*****
  • 下載下載:181
  • 收藏收藏:0
目前我們所提出的深度學習模型(Dynamic-Spatial-Attention (DSA) Recurrent Neural Network (RNN))(如 Fig.1.1)來預測行車紀錄器中的車禍時間點,我們的模型能夠學習到(1)在一個時間點上的場景中哪個物體可能會比較危險並特別注意那幾個物體 (2) 並分析上下的時間點來觀察比較危險的物體有沒有可能發生車禍。預測車禍不像是預測行車行為(Ex:切換道路、轉彎)那麼簡單就能分析,因為車禍都是很突然的發生而且在路上不常發生,所以我們利用(1)最先進的物體偵測的演算法(Faster-RCNN [1])來偵測物體和物體追蹤的演法算(MDP [2])來追蹤物體 (2)並利用場景、物體的外觀和軌跡的資訊來預測車禍時間點。我們收集了 968 部(如(Fig. 5.1))有發生不同形式的車禍(Ex:機車撞機車、汽車撞機車...等)的台灣行車紀錄影片,並且每部影片都有標註發生車禍的時間點以及發生車禍的物體種類,所以我們可以利用這些資料來做監督式學習訓練並量化訓練後的結果。利用我們提出的模型能夠在1.22秒前會有 80% recall 和 46.92% precision 而且能夠達到最高 63.98\% mean average precision。
We propose a Dynamic-Spatial-Attention (DSA) Recurrent Neural Network (RNN) for anticipating accidents in dashcam videos (Fig. 1.1). Our DSA-RNN learns to (1) distribute soft-attention to candidate objects dynamically to gather subtle cues and (2) model the temporal dependencies of all cues to robustly anticipate an accident. Anticipating accidents is much less addressed than anticipating events such as changing a lane, making a turn, etc.,
since accidents are rare to be observed and can happen in many different ways mostly in a sudden. To overcome these challenges, we (1) utilize state-of-the-art object detector [1] and tracking-by-detection [2] to detect and track candidate objects, and (2) incorporate full-frame and object-based appearance and motion features in our model. We also harvest a diverse dataset of 968 dashcam accident videos on the web (Fig.5.1). The dataset is unique, since various accidents (e.g., a motorbike hits a car, a car hits another car, etc.) occur in all videos. We manually mark the time-location of accidents and use them as supervision to train and evaluate our method. We show that our method anticipates accidents about 1.22 seconds before they occur with 80% recall and 46.92% precision. Most importantly, it achieves the highest mean average precision (63.98\%) outperforming other baselines without attention or RNN.
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 5
2.1 Risk Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Anticipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 RNN with Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Dashcam Video Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Preliminaries 9
3.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Multiple Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 Standard RNN . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Long-Short Term Memory Cells. . . . . . . . . . . . . . . . . . 12
3.3.3 Gated Recurrent Unit Cells. . . . . . . . . . . . . . . . . . . . 14
4 Our Method 16
4.1 Global RNN and Local Tracking RNN . . . . . . . . . . . . . . . . . . 17
4.2 Dynamic Spatial Attention . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Combining Frame-level and Tracklet-level Features . . . . . . . . . . . 19
4.4 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.1 Anticipation loss. . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Experiment 22
5.1 Dashcam Accident Dataset . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Fine-tune Faster-RCNN . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 Object Detection Dataset . . . . . . . . . . . . . . . . . . . . . 24
5.2.2 Detection Result . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.1 Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
v
5.3.2 Candidate objects. . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.3 Model learning. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 Evaluation Metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.5 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.6.1 Average time-to-accident (ToA). . . . . . . . . . . . . . . . . . 30
5.6.2 Typical examples . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Conclusion 34
References 35
[1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
[2] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision making,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4705–4713, 2015.
[3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
[4] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in ICCV, 2013.
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory.,” Neural Computation, 1997.
[6] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[7] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012.
[8] Google Inc., “Google self-driving car project monthly report,” May 2015.
[9] N. Highway Traffic Safety Administration, “2012 motor vehicle crashes: overview,” 2013.
[10] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena, “Recurrent neural networks for driver activity anticipation via sensory-fusion architecture,” in ICRA, 2016.
[11] V. V. Valenzuela, R. D. Lins, and H. M. De Oliveira, “Application of enhanced-2d-cwt in topographic images for mapping landslide risk areas,” in International Conference Image Analysis and Recognition, pp. 380–388, Springer, 2013.
[12] S. M. Arietta, A. A. Efros, R. Ramamoorthi, and M. Agrawala, “City forensics: Using visual elements to predict non-visual city attributes,” IEEE transactions on visualization and computer graphics, vol. 20, no. 12, pp. 2624–2633, 2014.
[13] A. Khosla, B. An An, J. J. Lim, and A. Torralba, “Looking beyond the visible scene,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3710–3717, 2014.
[14] M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in ICCV, 2011.
[15] M. Hoai and F. De la Torre, “Max-margin early event detectors,” in CVPR, 2012.
[16] T. Lan, T.-C. Chen, and S. Savarese, “A hierarchical representation for future action prediction,” in ECCV, 2014.
[17] K. M. Kitani, B. D. Ziebart, J. A. D. Bagnell, and M. Hebert , “Activity forecasting,” in ECCV, 2012.
[18] J. Yuen and A. Torralba, “A data-driven approach for event prediction,” in ECCV, 2010.
[19] J. Walker, A. Gupta, and M. Hebert, “Patch to the future: Unsupervised visual prediction,” in CVPR, 2014.
[20] Z. Wang, M. Deisenroth, H. Ben Amor, D. Vogt, B. Schölkopf, and J. Peters, “Probabilistic modeling of human movements for intention inference,” in RSS, 2012.
[21] H. S. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” PAMI, vol. 38, no. 1, pp. 14–29, 2016.
[22] H. S. Koppula, A. Jain, and A. Saxena, “Anticipatory planning for human-robot teams,” in ISER, 2014.
[23] J. Mainprice and D. Berenson, “Human-robot collaborative manipulation planning using early prediction of human motion,” in IROS, 2013.
[24] H. Berndt, J. Emmert, and K. Dietmayer, “Continuous driver intention recognition with hidden markov models,” in Intelligent Transportation Systems, 2008.
[25] B. Frohlich, M. Enzweiler, and U. Franke, “Will this car change the lane? - turn signal recognition in the frequency domain,” in Intelligent Vehicles Symposium (IV), 2014.
[26] P. Kumar, M. Perrollaz, S. Lefèvre, and C. Laugier, “Learning-based approach for online lane change intention prediction,” in Intelligent Vehicles Symposium (IV), 2013.
[27] M. Liebner, M. Baumann, F. Klanner, and C. Stiller, “Driver intent inference at urban intersections using the intelligent driver model,” in Intelligent Vehicles Symposium (IV), 2012.
[28] B. Morris, A. Doshi, and M. Trivedi, “Lane change intent prediction for driver assistance: On-road design and evaluation,” in Intelligent Vehicles Symposium (IV), 2011.
[29] A. Doshi, B. Morris, and M. Trivedi, “On-road prediction of driver’s intent with multimodal sensory cues,” IEEE Pervasive Computing, vol. 10, no. 3, pp. 22–34, 2011.
[30] M. M. Trivedi, T. Gandhi, and J. McCall, “Looking-in and looking-out of a vehicle: Computer-vision-based enhanced vehicle safety,” IEEE Transactions on Intelligent Transportation Systems, vol. 8, no. 1, pp. 108–120, 2007.
[31] A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car that knows before you do: Anticipating maneuvers via learning temporal driving models,” in ICCV, 2015.
[32] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville., “Describing videos by exploiting temporal structure,” in ICCV, 2015.
[33] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” arXiv preprint arXiv:1502.03044, 2015.
[34] V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu, “Recurrent models of visual attention,” in NIPS, 2014.
[35] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” in ICLR, 2015.
[36] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in ECCV, 2008.
[37] B. Leibe, N. Cornelis, K. Cornelis, and L. V. Gool, “Dynamic 3d scene analysis from a moving vehicle,” in CVPR, 2007.
[38] T. Scharwächter, M. Enzweiler, S. Roth, and U. Franke, “Efficient multi-cue scene segmentation,” in GCPR, 2013.
[39] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in CVPR Workshop on The Future of Datasets in Vision, 2015.
[40] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587, 2014.
[41] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 886–893, IEEE, 2005.
[42] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
[43] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional
networks for visual recognition,” in European Conference on Computer Vision, pp. 346–361, Springer, 2014.
[44] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448, 2015.
[45] W. Choi, “Near-online multi-target tracking with aggregated local flow descriptor,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3029–3037, 2015.
[46] B. D. Lucas, T. Kanade, et al., “An iterative image registration technique with an application to stereo vision,” 1981.
[47] G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” Image analysis, pp. 363–370, 2003.
[48] H. W. Kuhn, “The hungarian method for the assignment problem,” 50 Years of Integer Programming 1958-2008, pp. 29–47, 2010.
[49] R. E. Kalman et al., “A new approach to linear filtering and prediction problems,” 1960.
[50] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” arXiv preprint arXiv:1211.5063, 2012.
[51] P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
[52] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
[53] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference
on Computer Vision, pp. 740–755, Springer, 2014.
[54] C. Vondrick, D. Patterson, and D. Ramanan, “Efficiently scaling up crowdsourced video annotation,” International Journal of Computer Vision, pp. 1–21. 10.1007/s11263-012-0564-1.
[55] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[56] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
[57] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPr, 2005.
[58] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3d: generic features for video analysis,” CoRR, abs/1412.0767, vol. 2, p. 7, 2014.
[59] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[60] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C.itro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *