作者(外文):Gueter Josmy Faure
論文名稱(中文):HIT: 全面交互式動作偵測深度學習網路
論文名稱(外文):Holistic Interaction Transformer Network for Spatio-Temporal Action Detection
指導教授(外文):Lai, Shang-Hong
口試委員(外文):Hsu, Chui-Ting
Lin, Huei-Yung
Chiang, Chen-Kuo
外文關鍵詞:interaction modeling
在本論文中,我們提出了一個創新的模型——全面交互式動作偵測深度學習網路(Holistic Interaction Transformer Network , HIT),HIT模型引入多面向的互動資訊,包含人和人之間的互動訊息、人和物體之間的交互資訊,以及人體本身的動作,透過導入這些容易被忽略,卻十分關鍵的手部動作和人體姿勢等重要訊息,使得HIT模型相較於過去模型可在動作辨識的任務中取得更優異的預測結果。
本篇論文所提出的 HIT 網絡是一個的雙模態的深度學習框架,模型包含 RGB模態和姿勢模態,它們將對人和人之間的互動、人和物品間的互動以及針對分析目標的手部動作分別進行建模。而在每個模態中都包含一個聚合模組 (Intra-Modality Aggregation module, IMA),它的作用在於選擇性地合併前述的交互單元,更好的濃縮交互單元的內含資訊,同時過濾掉無關的特徵。而在RGB模態與姿勢模態之間也存在一個訊息溝通管道,使得資訊在雙模態之間可以互相傳遞。在每個模態完成建模分析後,我們透過一個創新的注意力融合機制 (Attentive Fusion Mechanism, AFM) 來融合雙模態產生的特徵。由於RGB模態和姿勢模態皆專注於當下時點局部動作的學習,因此我們在AFM 之上加入了一個時間交互的模組,為模型導入時間的資訊,以利於模型更好地對動作進行分類。我們將HIT 網絡實驗於 UCF101-24 與 J-HMDB 數據集上,其實驗結果證明這樣的雙模態的深度學習架構優於過往的模型,並在兩個數據集上分別得到84.8\% 和 83.8\% mAP 準確度的優異結果。
Actions are about how we interact with the environment,
including other people, objects, and ourselves. In this thesis, we propose a novel multi-modal Holistic Interaction Transformer Network (HIT) that leverages largely ignored, but critical hand and pose information essential to most human actions.
The proposed HIT network is a comprehensive bi-modal framework that comprises an RGB stream and a pose stream. Each of them separately models person interaction, object interaction, and hand interaction. Within each stream, we include an Intra-Modality Aggregation module (IMA) that selectively merges individual interaction units. IMA allows for denser interaction representation while filtering out irrelevant features. Some form of ``message passing" is also used for continuous information sharing from the RGB to the pose stream. After each modality has completed its work, the resulting features are then glued using our proposed Attentive Fusion Mechanism (AFM). Since each stream focuses mainly on learning local action patterns (with some occasional glance at the memory), we add a Temporal Interaction unit at the top of our Attentive Fusion Module. This unit helps us extract relevant cues from the temporal context to better classify the occurring actions. We demonstrate the proposed model through extensive experiments on the UCF101-24 and J-HMDB datasets and show that it achieves state-of-the-art results on both. Additional experiments on two other challenging datasets further exhibit our model's generalization capabilities.
1 Introduction 1 1.1 ProblemStatement .......................... 1 1.2 Motivation............................... 2 1.3 Contributions ............................. 4 1.4 ThesisOrganization.......................... 6
2 Related Work 7
2.1 VideoClassification.......................... 7
2.2 Spatio-TemporalActionDetection .................. 8
2.3 InteractionModeling ......................... 9
2.3.1 Attention Mechanism for Interaction Modeling . . . . . . . 10 2.3.2 Multi-modalActionDetection................ 10
3 Proposed Method 12 3.1 EntitySelection............................ 12 3.2 TheRGBBranch ........................... 14 3.3 ThePoseBranch ........................... 16 3.4 TheAttentiveFusionModule(AFM)................. 17 3.5 TemporalInteractionUnit....................... 17 3.6 Discussion:Intra-ModalityAggregation . . . . . . . . . . . . . . . 18 3.7 LossFunction............................. 19
4 Experiments 21
4.1 Datasets................................ 21
4.2 ImplementationDetails........................ 22
4.2.1 PersonandObjectDetector ................. 22 4.2.2 KeypointsDetectionandProcessing . . . . . . . . . . . . . 23 4.2.3 Backbone........................... 23 4.2.4 TrainingandEvaluation ................... 24
4.3 State-of-the-ArtComparison ..................... 25
4.4 AblationStudy ............................ 27
4.4.1 SequenceofInteractionUnits ................ 27
4.4.2 NetworkDepth........................ 29
4.4.3 AttentiveFusionModule(AFM)............... 29
4.4.4 Latevs.EarlyFusion..................... 30
4.4.5 Intra-ModalityAggragator(IMA) . . . . . . . . . . . . . . 30
4.4.6 InteractionModelingmethods ................ 31
4.4.7 Importance of each Modality and Hand Features . . . . . . 32
4.4.8 Importance of Different Types of Interactions . . . . . . . . 33
4.5 QualitativeResults .......................... 33 4.5.1 Qualitative Results on Different Modules . . . . . . . . . . 34 4.5.2 ComparativeQualitativeResults............... 35 4.5.3 Failurecases ......................... 36
5 Conclusion 39
References 40
