帳號:guest(3.128.78.255)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):費蓋德
作者(外文):Gueter Josmy Faure
論文名稱(中文):HIT: 全面交互式動作偵測深度學習網路
論文名稱(外文):Holistic Interaction Transformer Network for Spatio-Temporal Action Detection
指導教授(中文):賴尚宏
指導教授(外文):Lai, Shang-Hong
口試委員(中文):許秋婷
林惠勇
江振國
口試委員(外文):Hsu, Chui-Ting
Lin, Huei-Yung
Chiang, Chen-Kuo
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:109062421
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:44
中文關鍵詞:交互式動作偵測
外文關鍵詞:interaction modeling
相關次數:
  • 推薦推薦:0
  • 點閱點閱:352
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在本論文中,我們提出了一個創新的模型——全面交互式動作偵測深度學習網路(Holistic Interaction Transformer Network , HIT),HIT模型引入多面向的互動資訊,包含人和人之間的互動訊息、人和物體之間的交互資訊,以及人體本身的動作,透過導入這些容易被忽略,卻十分關鍵的手部動作和人體姿勢等重要訊息,使得HIT模型相較於過去模型可在動作辨識的任務中取得更優異的預測結果。
本篇論文所提出的 HIT 網絡是一個的雙模態的深度學習框架,模型包含 RGB模態和姿勢模態,它們將對人和人之間的互動、人和物品間的互動以及針對分析目標的手部動作分別進行建模。而在每個模態中都包含一個聚合模組 (Intra-Modality Aggregation module, IMA),它的作用在於選擇性地合併前述的交互單元,更好的濃縮交互單元的內含資訊,同時過濾掉無關的特徵。而在RGB模態與姿勢模態之間也存在一個訊息溝通管道,使得資訊在雙模態之間可以互相傳遞。在每個模態完成建模分析後,我們透過一個創新的注意力融合機制 (Attentive Fusion Mechanism, AFM) 來融合雙模態產生的特徵。由於RGB模態和姿勢模態皆專注於當下時點局部動作的學習,因此我們在AFM 之上加入了一個時間交互的模組,為模型導入時間的資訊,以利於模型更好地對動作進行分類。我們將HIT 網絡實驗於 UCF101-24 與 J-HMDB 數據集上,其實驗結果證明這樣的雙模態的深度學習架構優於過往的模型,並在兩個數據集上分別得到84.8\% 和 83.8\% mAP 準確度的優異結果。
Actions are about how we interact with the environment,
including other people, objects, and ourselves. In this thesis, we propose a novel multi-modal Holistic Interaction Transformer Network (HIT) that leverages largely ignored, but critical hand and pose information essential to most human actions.
The proposed HIT network is a comprehensive bi-modal framework that comprises an RGB stream and a pose stream. Each of them separately models person interaction, object interaction, and hand interaction. Within each stream, we include an Intra-Modality Aggregation module (IMA) that selectively merges individual interaction units. IMA allows for denser interaction representation while filtering out irrelevant features. Some form of ``message passing" is also used for continuous information sharing from the RGB to the pose stream. After each modality has completed its work, the resulting features are then glued using our proposed Attentive Fusion Mechanism (AFM). Since each stream focuses mainly on learning local action patterns (with some occasional glance at the memory), we add a Temporal Interaction unit at the top of our Attentive Fusion Module. This unit helps us extract relevant cues from the temporal context to better classify the occurring actions. We demonstrate the proposed model through extensive experiments on the UCF101-24 and J-HMDB datasets and show that it achieves state-of-the-art results on both. Additional experiments on two other challenging datasets further exhibit our model's generalization capabilities.
1 Introduction 1 1.1 ProblemStatement .......................... 1 1.2 Motivation............................... 2 1.3 Contributions ............................. 4 1.4 ThesisOrganization.......................... 6
2 Related Work 7
2.1 VideoClassification.......................... 7
2.2 Spatio-TemporalActionDetection .................. 8
2.3 InteractionModeling ......................... 9
2.3.1 Attention Mechanism for Interaction Modeling . . . . . . . 10 2.3.2 Multi-modalActionDetection................ 10
3 Proposed Method 12 3.1 EntitySelection............................ 12 3.2 TheRGBBranch ........................... 14 3.3 ThePoseBranch ........................... 16 3.4 TheAttentiveFusionModule(AFM)................. 17 3.5 TemporalInteractionUnit....................... 17 3.6 Discussion:Intra-ModalityAggregation . . . . . . . . . . . . . . . 18 3.7 LossFunction............................. 19
4 Experiments 21
4.1 Datasets................................ 21
4.2 ImplementationDetails........................ 22
4.2.1 PersonandObjectDetector ................. 22 4.2.2 KeypointsDetectionandProcessing . . . . . . . . . . . . . 23 4.2.3 Backbone........................... 23 4.2.4 TrainingandEvaluation ................... 24
4.3 State-of-the-ArtComparison ..................... 25
4.4 AblationStudy ............................ 27
4.4.1 SequenceofInteractionUnits ................ 27
4.4.2 NetworkDepth........................ 29
4.4.3 AttentiveFusionModule(AFM)............... 29
4.4.4 Latevs.EarlyFusion..................... 30
4.4.5 Intra-ModalityAggragator(IMA) . . . . . . . . . . . . . . 30
4.4.6 InteractionModelingmethods ................ 31
4.4.7 Importance of each Modality and Hand Features . . . . . . 32
4.4.8 Importance of Different Types of Interactions . . . . . . . . 33
4.5 QualitativeResults .......................... 33 4.5.1 Qualitative Results on Different Modules . . . . . . . . . . 34 4.5.2 ComparativeQualitativeResults............... 35 4.5.3 Failurecases ......................... 36
5 Conclusion 39
References 40
[1] Arnab, A., Sun, C., and Schmid, C. Unified graph structured models for video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 8117–8126.
[2] Carreira, J., and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 6299–6308.
[3] Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., and Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 13359–13368.
[4] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (2009), Ieee, pp. 248–255.
[5] Feichtenhofer, C. X3d: Expanding architectures for efficient video recogni- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 203–213.
[6] Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (2019), pp. 6202–6211.
[7] Feichtenhofer, C., Pinz, A., and Zisserman, A. Convolutional two-stream net- work fusion for video action recognition. In Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
[8] Girdhar, R., Carreira, J., Doersch, C., and Zisserman, A. Video action trans- former network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 244–253.
[9] Girshick, R. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (2015), pp. 1440–1448.
[10] Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., and He, K. Detectron. URL: https://github. com/facebookresearch/detectron (2011).
[11] Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijaya- narasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 6047–6056.
[12] Gupta, P., Thatipelli, A., Aggarwal, A., Maheshwari, S., Trivedi, N., Das, S., and Sarvadevabhatla, R. K. Quo vadis, skeleton action recognition? Interna- tional Journal of Computer Vision 129, 7 (2021), 2097–2112.
[13] He,K.,Gkioxari,G.,Dollár,P.,andGirshick,R.Maskr-cnn.InProceedingsof the IEEE international conference on computer vision (2017), pp. 2961–2969.

[14] Jhuang, H., Gall, J., Zuffi, S., Schmid, C., and Black, M. J. Towards under- standing action recognition. In International Conf. on Computer Vision (ICCV) (Dec. 2013), pp. 3192–3199.
[15] Kahatapitiya, K., and Ryoo, M. S. Coarse-fine networks for temporal activity detection in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 8385–8394.
[16] Köpüklü, O., Wei, X., and Rigoll, G. You only watch once: A unified cnn architecture for real-time spatiotemporal action localization. arXiv preprint arXiv:1911.06644 (2019).
[17] Kumar, A., and Rawat, Y. S. End-to-end semi-supervised learning for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 14700–14710.
[18] Li, C., Zhong, Q., Xie, D., and Pu, S. Collaborative spatiotemporal feature learning for video action recognition. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (2019), pp. 7872–7881.
[19] Li, Y., Chen, L., He, R., Wang, Z., Wu, G., and Wang, L. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 13536–13545.
[20] Li, Y., Lin, W., Wang, T., See, J., Qian, R., Xu, N., Wang, L., and Xu, S. Finding action tubes with a sparse-to-dense framework. In Proceedings of the AAAI Conference on Artificial Intelligence (2020), vol. 34, pp. 11466–11473.
[21] Li,Y.,Wang,Z.,Wang,L.,andWu,G.Actionsasmovingpoints.InEuropean Conference on Computer Vision (2020), Springer, pp. 68–84.
[22] Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 2117–2125.
[23] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international conference on com- puter vision (2017), pp. 2980–2988.
[24] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European conference on computer vision (2014), Springer, pp. 740–755.
[25] Liu,Y.,Yang,F.,andGinhac,D.Acdnet:Anactiondetectionnetworkforreal- time edge computing based on flow-guided feature approximation and memory aggregation. Pattern Recognition Letters 145 (2021), 118–126.
[26] Liu, Z., Zhang, H., Chen, Z., Wang, Z., and Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition (2020), pp. 143–152.

[27] Ma, C.-Y., Kadav, A., Melvin, I., Kira, Z., AlRegib, G., and Graf, H. P. Attend and interact: Higher-order object interactions for video understanding. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 6790–6800.
[28] Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., and Darrell, T. Something-else: Compositional action recognition with spatial-temporal in- teraction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), pp. 1049–1059.
[29] Mo,S.,Xia,J.,Tan,X.,andRaj,B.Point3d:trackingactionsasmovingpoints with 3d cnns.
[30] Ni, J., Qin, J., and Huang, D. Identity-aware graph memory network for ac- tion detection. In Proceedings of the 29th ACM International Conference on Multimedia (2021), pp. 3437–3445.
[31] Pan, J., Chen, S., Shou, M. Z., Liu, Y., Shao, J., and Li, H. Actor-context-actor relation network for spatio-temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 464–474.
[32] Pramono, R. R. A., Chen, Y.-T., and Fang, W.-H. Hierarchical self-attention network for action localization in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), pp. 61–70.
[33] Qiu, Z., Yao, T., and Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Con- ference on Computer Vision (2017), pp. 5533–5541.
[34] Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time ob- ject detection with region proposal networks. Advances in neural information processing systems 28 (2015), 91–99.
[35] Seong, H., Hyun, J., and Kim, E. Video multitask transformer network. In
Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019), pp. 0–0.
[36] Singh, G., Saha, S., Sapienza, M., Torr, P. H., and Cuzzolin, F. Online real- time multiple spatiotemporal action localisation and prediction. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 3637– 3646.
[37] Song, L., Zhang, S., Yu, G., and Sun, H. Tacnet: Transition-aware context network for spatio-temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 11987– 11995.
[38] Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human ac- tions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).

[39] Su,R.,Ouyang,W.,Zhou,L.,andXu,D.Improvingactionlocalizationbypro- gressive cross-stream cooperation. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2019), pp. 12016–12025.
[40] Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., and Schmid, C. Actor-centric relation network. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 318–334.
[41] Sun, L., Jia, K., Yeung, D.-Y., and Shi, B. E. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE international conference on computer vision (2015), pp. 4597–4605.
[42] Tang, J., Xia, J., Mu, X., Pang, B., and Lu, C. Asynchronous interaction ag- gregation for action detection. In European Conference on Computer Vision (2020), Springer, pp. 71–87.
[43] Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. Learning spa- tiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (2015), pp. 4489–4497.
[44] Wang, X., Girshick, R., Gupta, A., and He, K. Non-local neural networks. In
Proceedings of the IEEE conference on computer vision and pattern recogni- tion (2018), pp. 7794–7803.
[45] Wu, C.-Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., and Girshick, R. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 284–293.
[46] Wu, C.-Y., and Krahenbuhl, P. Towards long-form video understanding. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 1884–1894.
[47] Wu, C.-Y., Li, Y., Mangalam, K., Fan, H., Xiong, B., Malik, J., and Feichten- hofer, C. Memvit: Memory-augmented multiscale vision transformer for ef- ficient long-term video recognition. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2022), pp. 13587–13597.
[48] Wu, J., Kuang, Z., Wang, L., Zhang, W., and Wu, G. Context-aware rcnn: A baseline for action detection in videos. In European Conference on Computer Vision (2020), Springer, pp. 440–456.
[49] Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual trans- formations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 1492–1500.
[50] Xie, S., Sun, C., Huang, J., Tu, Z., and Murphy, K. Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 1, 2 (2017), 5.

[51] Yang,X.,Fan,H.,Torresani,L.,Davis,L.S.,andWang,H.Beyondshortclips: End-to-end video-level learning with collaborative memories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 7567–7576.
[52] Zhao, J., Zhang, Y., Li, X., Chen, H., Shuai, B., Xu, M., Liu, C., Kundu, K., Xiong, Y., Modolo, D., et al. Tuber: Tubelet transformer for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 13598–13607.
[53] Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., and Ding, Z. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 11656– 11665.
[54] Zhou, B., Andonian, A., Oliva, A., and Torralba, A. Temporal relational rea- soning in videos. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 803–818.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *