帳號:guest(18.225.56.51)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):王庭偉
作者(外文):Wang, Ting-Wei
論文名稱(中文):基於Transformer模型的多模態行人穿越意圖預測任務
論文名稱(外文):Multi-Modal Pedestrian Crossing Intention Prediction with Transformer-Based Model
指導教授(中文):賴尚宏
指導教授(外文):Lai, Shang-Hong
口試委員(中文):許秋婷
陳煥宗
陳奕廷
口試委員(外文):Hsu, Chiou-Ting
Chen, Hwann-Tzong
Chen, Yi-Ting
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:109065509
出版年(民國):112
畢業學年度:111
語文別:英文
論文頁數:62
中文關鍵詞:行人預測行人防護行人行為預測先進駕駛輔助系統車用電腦視覺行人穿越意圖車用深度學習
外文關鍵詞:pedestrian crossing intention predictionpedestrian action predictionpedestrian protectionAdvanced Driver Assistance Systemscomputer vision in automotivedeep learning in automotive
相關次數:
  • 推薦推薦:0
  • 點閱點閱:30
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
優秀的新技術不僅能為人類帶來便利,也能提升人身安全,改變人類的生活方式。自動駕駛與先進輔助駕駛系統的普及,將有機會降低數以千計的車禍事故發生與人員傷亡。其中,針對行人的預測與防護,更是此類系統中刻不容緩的發展重點。基於電腦視覺的行人穿越意圖或動作預測,可以幫助上述系統提早對車輛前方的行人進行風險評估,相關的研究也在近年不斷推陳出新。然而,經由對過往文獻的觀察,我們認為尚有許多值得探索的資訊尚未被妥善利用,更豐富的資訊將能幫助深度神經網路理解行人複雜多變的行為與動態。
為了解決以上問題,我們提出了一種基於Transformer模型的多模態行人穿越意圖預測架構。我們的方法藉由Transformer模型優秀的時序理解能力,通過多種輸入資訊的關聯性與互補效果,使模型在此類預測任務上得以穩定發揮,並且Transformer可平行化的優勢也將提升此類系統在車輛上實時運行的可行性。此方法也首次將道路環境資訊以新穎的方式來表達與訓練,使此資訊能更充分的被利用。我們也使用了由行人二維姿態提升的三維姿態數據,以及行人三維頭部朝向資訊,使模型對行人的整體姿態能有更進一步的掌握。最終,我們的實驗結果證明了所提出的方法可以在測試基準數據集中達到最先進的準確率。
The popularity of autonomous driving and advanced driver assistance systems can potentially reduce thousands of car accidents and casualties. In particular, pedestrian prediction and protection is an urgent development priority for such systems. Prediction of pedestrians' intentions of crossing the road or their actions based on computer vision can help such systems to assess the risk of pedestrians in front of vehicles in advance. Relevant research has been continuously reported in recent years.
However, we believe that previous works have not fully exploited all the available information to make the prediction.
We propose a multi-modal pedestrian crossing intention prediction framework based on the Transformer model to address the above issues. Our method exploits the excellent sequential modeling ability and the parallelization advantage of the Transformer enabling the model to perform stably and smoothly in this task. We also represent traffic environment information in a novel way, allowing such information can be fully exploited. Moreover, We uses lifted 3D human pose data and 3D head orientation data for pedestrians, allowing the model to understand pedestrian posture better. Finally, our experimental results show the proposed system provides state-of-the-art accuracy on benchmarking datasets.
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Work 6
2.1 Sequential Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Exploration of Novel Inputs . . . . . . . . . . . . . . . . . . . . . 7
2.3 The Rise of the Transformer Model . . . . . . . . . . . . . . . . . 12
3 Proposed Method 13
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Module Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Feature Pre-processing Module . . . . . . . . . . . . . . . 14
3.2.2 Prediction Module . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Experiments 35
4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Implementation Details . . . . . . . . . . . . . . . . . . . . 37
4.2 Evaluation of the Proposed Model . . . . . . . . . . . . . . . . . . 41
4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Importance of Input Data . . . . . . . . . . . . . . . . . . . 44
4.3.2 Comparison of Different Fusion Methods . . . . . . . . . . 47
4.3.3 Comparison of Traffic Awareness Feature Fusions . . . . . 48
4.4 Qualitative Justification . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Discussion of Failure Case and Future Directions . . . . . . 53
5 Conclusions 58
References 59
[1] Achaji, L., Moreau, J., Fouqueray, T., Aioun, F., and Charpillet, F. Is attention to bounding boxes all you need for pedestrian action prediction? In 2022 IEEE Intelligent Vehicles Symposium (IV) (2022), pp. 895–902.
[2] Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., and Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 961–971.
[3] Bhattacharyya, A., Fritz, M., and Schiele, B. Long-term on-board prediction of people in traffic scenes under uncertainty, 2018.
[4] Bhattacharyya, A., Fritz, M., and Schiele, B. Long-term on-board prediction of people in traffic scenes under uncertainty. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 4194–4202.
[5] Bouhsain, S. A., Saadatnejad, S., and Alahi, A. Pedestrian intention prediction: A multi-task perspective. arXiv preprint arXiv:2010.10270 (2020).
[6] Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., and Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields, 2019.
[7] Carreira, J., and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 6299–6308.
[8] Chen, Z., Sugimoto, A., and Lai, S.-H. Learning monocular 3d human pose estimation with skeletal interpolation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), IEEE, pp. 4218–4222.
[9] Gesnouin, J., Pechberti, S., Stanciulcscu, B., and Moutarde, F. Trouspi-net: Spatio-temporal attention on parallel atrous convolutions and u-grus for skeletal pedestrian crossing prediction. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021) (2021), IEEE, pp. 01–07.
[10] Hempel, T., Abdelrahman, A. A., and Al-Hamadi, A. 6d rotation representation for unconstrained head pose estimation. In 2022 IEEE International Conference on Image Processing (ICIP) (2022), pp. 2496–2500.
[11] Kim, U.-H., Ka, D., Yeo, H., and Kim, J.-H. A real-time predictive pedestrian collision warning service for cooperative intelligent transportation systems using 3d pose estimation, 2022.
[12] Kooij, J. F. P., Schneider, N., Flohr, F., and Gavrila, D. M. Context-based pedestrian path prediction. In Computer Vision – ECCV 2014 (Cham, 2014), D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., Springer International Publishing, pp. 618–633.
[13] Kotseruba, I., Rasouli, A., and Tsotsos, J. K. Benchmark for Evaluating Pedestrian Action Prediction. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV) (2021), pp. 1258–1268.
[14] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (2017), pp. 2980–2988.
[15] Liu, B., Adeli, E., Cao, Z., Lee, K.-H., Shenoi, A., Gaidon, A., and Niebles, J. C. Spatiotemporal relationship reasoning for pedestrian intent prediction. IEEE Robotics and Automation Letters 5, 2 (2020), 3485–3492.
[16] Lorenzo, J., Parra, I., and Sotelo, M. Intformer: Predicting pedestrian intention with the aid of the transformer architecture. arXiv preprint arXiv:2105.08647 (2021).
[17] Manh, H., and Alaghband, G. Scene-lstm: A model for human trajectory prediction. arXiv preprint arXiv:1808.04018 (2018).
[18] Neogi, S., Hoy, M., Dang, K., Yu, H., and Dauwels, J. Context model for pedestrian intention prediction using factored latent-dynamic conditional random fields. IEEE Transactions on Intelligent Transportation Systems 22, 11 (2021), 6821–6832.
[19] Perdana, M. I., Anggraeni, W., Sidharta, H. A., Yuniarno, E. M., and Purnomo, M. H. Early warning pedestrian crossing intention from its head gesture using head pose estimation. In 2021 International Seminar on Intelligent Technology and Its Applications (ISITIA) (2021), pp. 402–407.
[20] Piccoli, F., Balakrishnan, R., Perez, M. J., Sachdeo, M., Nunez, C., Tang, M., Andreasson, K., Bjurek, K., Raj, R. D., Davidsson, E., et al. Fussi-net: Fusion of spatio-temporal skeletons for intention prediction network. In 2020 54th Asilomar Conference on Signals, Systems, and Computers (2020), IEEE, pp. 68–72.
[21] Quintero Mínguez, R., Parra Alonso, I., Fernández-Llorca, D., and Sotelo, M. . Pedestrian path, pose, and intention prediction through gaussian process dynamical models and pedestrian activity recognition. IEEE Transactions on Intelligent Transportation Systems 20, 5 (2019), 1803–1814.
[22] Rasouli, A., Kotseruba, I., Kunic, T., and Tsotsos, J. Pie: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 6261–6270.
[23] Rasouli, A., Kotseruba, I., and Tsotsos, J. Pedestrian action anticipation using contextual feature fusion in stacked rnns. In Proceedings of the British Machine Vision Conference (BMVC) (September 2019), K. Sidorov and Y. Hicks, Eds., BMVA Press, pp. 49.1–49.13.
[24] Rasouli, A., Kotseruba, I., and Tsotsos, J. K. Are they going to cross? A benchmark dataset and baseline for pedestrian crosswalk behavior. 206–213.
[25] Rasouli, A., Rohani, M., and Luo, J. Bifold and semantic reasoning for pedestrian behavior prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 15600–15610.
[26] Rozenberg, R., Gesnouin, J., and Moutarde, F. Asymmetrical bi-rnn for pedestrian trajectory encoding. arXiv preprint arXiv:2106.04419 (2021).
[27] Saleh, K., Hossny, M., and Nahavandi, S. Real-time intent prediction of pedestrians for autonomous ground vehicles via spatio-temporal densenet. In 2019 International Conference on Robotics and Automation (ICRA) (2019), IEEE, pp. 9704–9710.
[28] Schulz, A., Damer, N., Fischer, M., and Stiefelhagen, R. Combined head localization and head pose estimation for video–based advanced driver assistance systems. In Pattern Recognition (Berlin, Heidelberg, 2011), R. Mester and M. Felsberg, Eds., Springer Berlin Heidelberg, pp. 51–60.
[29] Schulz, A. T., and Stiefelhagen, R. Pedestrian intention recognition using latent-dynamic conditional random fields. In 2015 IEEE Intelligent Vehicles Symposium (IV) (2015), pp. 622–627.
[30] Sharma, N., Dhiman, C., and Indu, S. Pedestrian intention prediction for autonomous vehicles: A comprehensive survey. Neurocomputing 508 (2022), 120–152.
[31] Sui, Z., Zhou, Y., Zhao, X., Chen, A., and Ni, Y. Joint intention and trajectory prediction based on transformer. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2021), IEEE, pp. 7082–7088.
[32] Varytimidis, D., Alonso-Fernandez, F., Duran, B., and Englund, C. Action and intention recognition of pedestrians in urban traffic. In 2018 14th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) (2018), pp. 676–682.
[33] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need, 2017.
[34] Vemula, A., Muelling, K., and Oh, J. Social attention: Modeling attention in human crowds. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018), pp. 4601–4607.
[35] Wang, C., Wang, Y., Xu, M., and Crandall, D. J. Stepwise goal-driven networks for trajectory prediction. IEEE Robotics and Automation Letters 7, 2 (2022), 2716–2723.
[36] Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (2018), pp. 3–19.
[37] Xu, Y., Piao, Z., and Gao, S. Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 5275–5284.
[38] Xue, H., Huynh, D. Q., and Reynolds, M. Ss-lstm: A hierarchical lstm model for pedestrian trajectory prediction. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (2018), pp. 1186–1194.
[39] Yang, B., Zhan, W., Wang, P., Chan, C., Cai, Y., and Wang, N. Crossing or not? context-based recognition of pedestrian crossing intention in the urban environment. IEEE Transactions on Intelligent Transportation Systems 23, 6 (2021), 5338–5349.
[40] Yang, D., Zhang, H., Yurtsever, E., Redmill, K. A., and Özgüner, Ü. Predicting pedestrian crossing intention with feature fusion and spatio-temporal attention. IEEE Transactions on Intelligent Vehicles 7, 2 (2022), 221–230.
[41] Yao, Y., Atkins, E., Roberson, M. J., Vasudevan, R., and Du, X. Coupling intent and action for pedestrian crossing behavior prediction. arXiv preprint arXiv:2105.04133 (2021).
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *