作者(外文):Wu, Tz-Ying
論文名稱(外文):Anticipating Human Intention and Monitoring Liquid Pouring via Multimodal Learning
指導教授(外文):Sun, Min
口試委員(外文):Chan, Liwei
Chiu, Wei-Chen
外文關鍵詞:deep learningmultimodal datareinforcement learning
透過觀察一個人的行為來預測人類意圖有很多的應用;舉例來說,一個人拿了手機和充電器表示他想要去幫手機充電。藉由預測意圖,智慧型系統就能引導使用者到最近的充電座。我們提出了一個手腕穿戴式動作觸發感測系統,用來預測日常生活中的意圖,這個系統可以穩定的觀測使用者的動作。這個系統的核心由遞歸神經網絡(RNN)和策略網絡(Policy Network)組成:RNN編碼了影像及動作資訊成特徵,並用這些特徵來預測意圖;Policy Network盡可能少的觸發影像資訊的處理以減少運算所需要的資源。我們使用策略梯度法(policy gradient)和交叉熵損失(cross-entropy loss)同時訓練整個網絡。為了評估我們的系統,我們收集了第一個日常意圖資料集,包含2379部影片,總共有34種意圖由164種不同的動作序列組成。我們的方法在三個使用者上平均只觸發了29%的影像運算,卻分別能達到92.68%、90.85%、97.56%的準確率。另一方面,對於液體傾倒監測任務,我們針對監測液體傾倒的過程是否成功學習,並探討監測與操作之間的相互作用。傾倒液體是一個很細微的操作任務,需要持續的監測環境狀態以調整未來的動作,防止液體撒出。在這個題目中,我們結合了手腕穿戴式相機和慣性測量單元(IMU)來推理容器狀態和他們的相對位置和速度。我們更在訓練過程中捕捉了手的3D軌跡,並將操作視為一個預測3D軌跡的任務。有了許多成功和失敗的傾倒液體演示資料,我們提出一個新的方法可以從這些同步的影像、IMU、和3D軌跡資訊來學習監測液體傾倒以及兩個額外的任務。首先,我們將3D軌跡預測視為一個額外的迴歸任務,用來學習一個同時適合監測及操作的特徵表示式。第二,我們用對抗式訓練(adversarial training)讓這個模型能產生未來一步的軌跡預測以擴充演示資料。最後,我們將初始的容器狀態視為另一個額外的分類任務,用於學習對容器狀態敏感的特徵表示式。有了這些新的元素,相較於沒有額外任務的基準方法,我們的方法可以在監測及操作上達到很好的成果。
Anticipating human intention by observing one's actions has many applications. For instance, picking up a cellphone, then a charger (actions) implies that one wants to charge the cellphone (intention). By anticipating the intention, an intelligent system can guide the user to the closest power outlet. We propose an on-wrist motion triggered sensing system for anticipating daily intentions, where the on-wrist sensors help us to persistently observe one's actions. The core of the system is a novel Recurrent Neural Network (RNN) and Policy Network (PN), where the RNN encodes visual and motion observation to anticipate intention, and the PN parsimoniously triggers the process of visual observation to reduce computation requirement. We jointly trained the whole network using policy gradient and cross-entropy loss. To evaluate, we collect the first daily ``intention" dataset consisting of 2379 videos with 34 intentions and 164 unique action sequences. Our method achieves 92.68%,90.85%,97.56% accuracy on three users while processing only 29% of the visual observation on average. On the other hand, for monitoring liquid pouring, we aim at learning to monitor whether liquid pouring is a success or failure, and studying the interplay between monitoring and manipulating. Liquid pouring is a very subtle manipulation task which involves continuously monitoring environmental states to adjust future actions toward not spilling.
In this work, We combine both a chest-mounted camera and a wrist-mounted IMU sensor to implicitly infer containers' states and their relative position and speed (i.e., environmental states). We further capture 3D hand trajectory during training and treat manipulation as a 3D trajectory forecasting task. Given many success and failure demonstrations of liquid pouring with synchronized video, IMU, and 3D trajectory, we propose a novel method for monitoring with auxiliary tasks. Firstly, we treat 3D trajectory forecasting as an auxiliary ``regression" task to learn a good representation for both monitoring and manipulating. Secondly, we allow the model to generate one-step future prediction to augment demonstrations with an adversarial training procedure. Finally, we treat the initial container states as another auxiliary classification task to learn representation sensitive to container states. With these novel components, we can achieve the state-of-the-art monitoring and manipulation performance compared to baseline methods without auxiliary task and/or demonstration generation.
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 7
2.1 Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Anticipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 High-level Behavior Analysis. . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Recognition from Wearable Sensors . . . . . . . . . . . . . . . . 9
2.5 Environmental State Estimation . . . . . . . . . . . . . . . . . . . 9
2.6 Robot Liquid Pouring. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Human Intention Anticipation 11
3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Recurrent and Fusion Model . . . . . . . . . . . . . . . . . . . . . 12
3.3 RL-based Policy Network . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Derivation of Policy Loss . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Learning Representations from Auxiliary Data . . . . . . . 16
3.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Monitoring Liquid Pouring 19
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Multimodal Data Fusion . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Monitoring with Auxiliary Tasks . . . . . . . . . . . . . . . . . . 22
4.2.1 Forecasting 3D Trajectory . . . . . . . . . . . . . . . . . . . . . 22
4.2.2 Initial Object State Classification . . . . . . . . . . . . . . . 24
4.2.3 Monitoring Module . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 25
5 Setting and Datasets 26
5.1 Human Intention Anticipation . . . . . . . . . . . . . . . . . . 26
5.1.1 Setting of On-wrist Sensors . . . . . . . . . . . . . . . . . . . 26
5.1.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Monitoring Liquid Pouring . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6 Experiments 32
6.1 Human Intention Anticipation . . . . . . . . . . . . . . . . . . 32
6.1.1 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . 32
6.1.2 Motion Triggered Intention Anticipation . . . . . . . . 34
6.1.3 Typical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Monitoring Liquid Pouring . . . . . . . . . . . . . . . . . . . . . 37
6.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2.3 Cross Trial Experiment . . . . . . . . . . . . . . . . . . . . . . 39
6.2.4 Cross User Experiment . . . . . . . . . . . . . . . . . . . . . . 39
6.2.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7 Conclusion and Future Work 43
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References 45
