帳號:guest(3.145.82.39)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):巫姿瑩
作者(外文):Wu, Tz-Ying
論文名稱(中文):基於多模態資料訓練人類意圖預測及液體傾倒監測
論文名稱(外文):Anticipating Human Intention and Monitoring Liquid Pouring via Multimodal Learning
指導教授(中文):孫民
指導教授(外文):Sun, Min
口試委員(中文):詹力韋
邱維辰
口試委員(外文):Chan, Liwei
Chiu, Wei-Chen
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:104061703
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:49
中文關鍵詞:深度學習多模態資料增強式學習
外文關鍵詞:deep learningmultimodal datareinforcement learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:160
  • 評分評分:*****
  • 下載下載:8
  • 收藏收藏:0
透過觀察一個人的行為來預測人類意圖有很多的應用;舉例來說,一個人拿了手機和充電器表示他想要去幫手機充電。藉由預測意圖,智慧型系統就能引導使用者到最近的充電座。我們提出了一個手腕穿戴式動作觸發感測系統,用來預測日常生活中的意圖,這個系統可以穩定的觀測使用者的動作。這個系統的核心由遞歸神經網絡(RNN)和策略網絡(Policy Network)組成:RNN編碼了影像及動作資訊成特徵,並用這些特徵來預測意圖;Policy Network盡可能少的觸發影像資訊的處理以減少運算所需要的資源。我們使用策略梯度法(policy gradient)和交叉熵損失(cross-entropy loss)同時訓練整個網絡。為了評估我們的系統,我們收集了第一個日常意圖資料集,包含2379部影片,總共有34種意圖由164種不同的動作序列組成。我們的方法在三個使用者上平均只觸發了29%的影像運算,卻分別能達到92.68%、90.85%、97.56%的準確率。另一方面,對於液體傾倒監測任務,我們針對監測液體傾倒的過程是否成功學習,並探討監測與操作之間的相互作用。傾倒液體是一個很細微的操作任務,需要持續的監測環境狀態以調整未來的動作,防止液體撒出。在這個題目中,我們結合了手腕穿戴式相機和慣性測量單元(IMU)來推理容器狀態和他們的相對位置和速度。我們更在訓練過程中捕捉了手的3D軌跡,並將操作視為一個預測3D軌跡的任務。有了許多成功和失敗的傾倒液體演示資料,我們提出一個新的方法可以從這些同步的影像、IMU、和3D軌跡資訊來學習監測液體傾倒以及兩個額外的任務。首先,我們將3D軌跡預測視為一個額外的迴歸任務,用來學習一個同時適合監測及操作的特徵表示式。第二,我們用對抗式訓練(adversarial training)讓這個模型能產生未來一步的軌跡預測以擴充演示資料。最後,我們將初始的容器狀態視為另一個額外的分類任務,用於學習對容器狀態敏感的特徵表示式。有了這些新的元素,相較於沒有額外任務的基準方法,我們的方法可以在監測及操作上達到很好的成果。
Anticipating human intention by observing one's actions has many applications. For instance, picking up a cellphone, then a charger (actions) implies that one wants to charge the cellphone (intention). By anticipating the intention, an intelligent system can guide the user to the closest power outlet. We propose an on-wrist motion triggered sensing system for anticipating daily intentions, where the on-wrist sensors help us to persistently observe one's actions. The core of the system is a novel Recurrent Neural Network (RNN) and Policy Network (PN), where the RNN encodes visual and motion observation to anticipate intention, and the PN parsimoniously triggers the process of visual observation to reduce computation requirement. We jointly trained the whole network using policy gradient and cross-entropy loss. To evaluate, we collect the first daily ``intention" dataset consisting of 2379 videos with 34 intentions and 164 unique action sequences. Our method achieves 92.68%,90.85%,97.56% accuracy on three users while processing only 29% of the visual observation on average. On the other hand, for monitoring liquid pouring, we aim at learning to monitor whether liquid pouring is a success or failure, and studying the interplay between monitoring and manipulating. Liquid pouring is a very subtle manipulation task which involves continuously monitoring environmental states to adjust future actions toward not spilling.
In this work, We combine both a chest-mounted camera and a wrist-mounted IMU sensor to implicitly infer containers' states and their relative position and speed (i.e., environmental states). We further capture 3D hand trajectory during training and treat manipulation as a 3D trajectory forecasting task. Given many success and failure demonstrations of liquid pouring with synchronized video, IMU, and 3D trajectory, we propose a novel method for monitoring with auxiliary tasks. Firstly, we treat 3D trajectory forecasting as an auxiliary ``regression" task to learn a good representation for both monitoring and manipulating. Secondly, we allow the model to generate one-step future prediction to augment demonstrations with an adversarial training procedure. Finally, we treat the initial container states as another auxiliary classification task to learn representation sensitive to container states. With these novel components, we can achieve the state-of-the-art monitoring and manipulation performance compared to baseline methods without auxiliary task and/or demonstration generation.
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Main Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 7
2.1 Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Anticipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 High-level Behavior Analysis. . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Recognition from Wearable Sensors . . . . . . . . . . . . . . . . 9
2.5 Environmental State Estimation . . . . . . . . . . . . . . . . . . . 9
2.6 Robot Liquid Pouring. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Human Intention Anticipation 11
3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Recurrent and Fusion Model . . . . . . . . . . . . . . . . . . . . . 12
3.3 RL-based Policy Network . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Derivation of Policy Loss . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Learning Representations from Auxiliary Data . . . . . . . 16
3.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Monitoring Liquid Pouring 19
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Multimodal Data Fusion . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Monitoring with Auxiliary Tasks . . . . . . . . . . . . . . . . . . 22
4.2.1 Forecasting 3D Trajectory . . . . . . . . . . . . . . . . . . . . . 22
4.2.2 Initial Object State Classification . . . . . . . . . . . . . . . 24
4.2.3 Monitoring Module . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 25
5 Setting and Datasets 26
5.1 Human Intention Anticipation . . . . . . . . . . . . . . . . . . 26
5.1.1 Setting of On-wrist Sensors . . . . . . . . . . . . . . . . . . . 26
5.1.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Monitoring Liquid Pouring . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6 Experiments 32
6.1 Human Intention Anticipation . . . . . . . . . . . . . . . . . . 32
6.1.1 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . 32
6.1.2 Motion Triggered Intention Anticipation . . . . . . . . 34
6.1.3 Typical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Monitoring Liquid Pouring . . . . . . . . . . . . . . . . . . . . . 37
6.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2.3 Cross Trial Experiment . . . . . . . . . . . . . . . . . . . . . . 39
6.2.4 Cross User Experiment . . . . . . . . . . . . . . . . . . . . . . 39
6.2.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7 Conclusion and Future Work 43
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References 45
[1] M. Hoai and F. Dela Torre, “Max-margin early event detectors,” in CVPR,2012.
[2] M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in ICCV, 2011.
[3] T. Lan, T.-C. Chen, and S. Savarese, “A hierarchical representation for future action prediction,” in ECCV, 2014.
[4] K. M. Kitani, B. D. Ziebart, J. A. D. Bagnell, and M. Hebert, “Activity forecasting,” in ECCV, 2012.
[5] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents in dashcam videos,” ACCV, 2016.
[6] A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car that knows before you do: Anticipating maneuvers via learning temporal driving models,” in ICCV,2015.
[7] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena, “Recurrent neural networks for driver activity anticipation via sensory-fusion architecture,” in ICRA, 2016.
[8] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,” arXiv:1609.08675, 2016.
[9] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in CVPR, 2015.
[10] J.-B. Alayrac, J. Sivic, I. Laptev, and S. Lacoste-Julien, “Joint discovery of object states and manipulating actions,” in ICCV, 2017.
[11] R. Mottaghi, C. Schenck, D. Fox, and A. Farhadi, “See the glass half full: Reasoning about liquid containers, their volume and content,” in ICCV, 2017.
[12] K. Ohnishi, A. Kanehira, A. Kanezaki, and T. Harada, “Recognizing activities of daily living with a wrist-mounted camera,” in CVPR, 2016.
[13] C.-S. Chan, S.-Z. Chen, P.-X. Xie, C.-C. Chang, and M. Sun, “Recognition from hand cameras: A revisit with deep learning,” in ECCV, 2016.
[14] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv:1212.0402, 2012.
[15] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in ICCV, 2011.
[16] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database for finegrained activity detection of cooking activities,” in CVPR, 2012.
[17] G. Chéron, I. Laptev, and C. Schmid, “P-cnn: Pose-based cnn features for action recognition,” in ICCV, 2015.
[18] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” in ICCV, 2013.
[19] T.-H. Vu, C. Olsson, I. Laptev, A. Oliva, and J. Sivic, “Predicting actions from static scenes,” in ECCV, 2014.
[20] Y. Zhang, W. Qu, and D. Wang, “Action-scene model for human action recognition from videos,” 2014.
[21] D. J. Moore, I. A. Essa, and M. H. Hayes, “Exploiting human actions and object context for recognition tasks,” in ICCV, 1999.
[22] V. Delaitre, J. Sivic, and I. Laptev, “Learning person-object interactions for action recognition in still images,” in NIPS, 2011.
[23] A. Gupta, A. Kembhavi, and L. S. Davis, “Observing human-object interactions: Using spatial and functional compatibility for recognition,” TPAMI, 2009.
[24] A. Gupta and L. S. Davis, “Objects in action: An approach for combining action understanding and object perception,” in CVPR, 2007.
[25] A. Fathi and J. M. Rehg, “Modeling actions through state changes,” in CVPR, 2013.
[26] S. Bambach, S. Lee, D. J. Crandall, and C. Yu, “Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions,” in ICCV, 2015.
[27] M. Ma, H. Fan, and K. M. Kitani, “Going deeper into first-person activity recognition,” in CVPR, 2016.
[28] J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang, “Jointly learning heterogeneous features for rgb-d activity recognition,” in CVPR, 2015.
[29] J. Lei, X. Ren, and D. Fox, “Fine-grained kitchen activity recognition using rgb-d,” in UbiComp, 2012.
[30] S. Song,N.-M. Cheung, V. Chandrasekhar, B. Mandal, and J. Liri, “Egocentric activity recognition with multimodal fisher vector,” in Acoustics, Speech and Signal Processing (ICASSP),IEEE,2016.
[31] F. de la Torre, J. K. Hodgins, J. Montano, and S. Valcarcel, “Detailed human data acquisition of kitchen activities: the cmu-multimodal activity database (cmummac),” in CHI Workshop.,2009.
[32] D. Roggen, A. Calatroni, M. Rossi, T. Holleczek, K. Förster, G. Tröster, P. Lukowicz, D. Bannach, G. Pirkl, A. Ferscha, etal., “Collecting complex activity datasets in highly rich networked sensor environments,” in Networked Sensing Systems (INSS), IEEE, 2010.
[33] Y. Zhou, B. Ni, R. Hong, M. Wang, and Q. Tian, “Interaction part mining: A mid- level approach for fine-grained action recognition,” in CVPR, 2015.
[34] Y. Zhou, B. Ni, S. Yan, P. Moulin, and Q. Tian, “Pipelining localized semantic features for fine-grained action recognition,” in ECCV, 2014.
[35] X. Peng, C. Zou, Y. Qiao, and Q. Peng, “Action recognition with stacked fisher vectors,” in ECCV, 2014.
[36] H. S. Koppula and A. Saxena, “Anticipating human activities using object affordances for reactive robotic response,” PAMI, vol. 38, no. 1, pp. 14–29, 2016.
[37] C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in CVPR, 2016.
[38] S. Z. Bokhari and K. M. Kitani, “Long-term activity forecasting using first-person vision,” in ACCV, 2016.
[39] Z. Wang, M. Deisenroth, H. Ben Amor, D. Vogt, B. Schölkopf, and J. Peters, “Probabilistic modeling of human movements for intention inference,” in RSS, 2012.
[40] H. S. Koppula, A. Jain, and A. Saxena, “Anticipatory planning for human-robot teams,” in ISER, 2014.
[41] J. Mainprice and D. Berenson, “Human-robot collaborative manipulation planning using early prediction of human motion,” in IROS, 2013.
[42] A. Hashimoto, J. Inoue, T. Funatomi, and M. Minoh, “Intention-sensing recipe guidance via user accessing objects,” International Journal of Human-Computer Interaction, 2016.
[43] N. Rhinehart and K. M. Kitani, “First-person activity forecasting with online inverse reinforcement learning,” ICCV, 2017.
[44] J. Yuen and A. Torralba, “A data-driven approach for event prediction,” in ECCV, 2010.
[45] J. Walker, A. Gupta, and M. Hebert, “Patch to the future: Unsupervised visual prediction,” in CVPR, 2014.
[46] V. Joo, W. Li, F. F. Steen, and S.-C. Zhu, “Visual persuasion: Inferring communicative intents of images,” in CVPR, 2014.
[47] C. Vondrick, D. Oktay, H. Pirsiavash, and A. Torralba, “Predicting motivations of actions by leveraging text,” in CVPR, 2016.
[48] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning using hierarchical recurrent neural networks,” in CVPR,2016.
[49] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” in CVPR, 2016.
[50] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and language,” in CVPR, 2016.
[51] K.-H. Zeng, T.-H. Chen, J. C. Niebles, and M. Sun, “Title generation for user generated videos,” in ECCV, 2016.
[52] S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence – video to text.,” in ICCV, 2015.
[53] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in ICCV, 2015.
[54] S. Singh, C. Arora, and C. V.Jawahar, “First person action recognition using deep learned descriptors,” in CVPR, 2016.
[55] A. Fathi, A. Farhadi, and J. M. Rehg, “Understanding egocentric activities,” in ICCV, 2011.
[56] Y. Li, Z. Ye, and J. M. Rehg, “Delving into egocentric actions,” in CVPR, 2015.
[57] Z. Lu and K. Grauman, “Story-driven summarization for egocentric video,” in CVPR, 2013.
[58] J. Ghosh, Y. J. Lee, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in CVPR, 2012.
[59] C. Schenck and D. Fox, “Detection and tracking of liquids with fully convolutional networks,” in RSS workshop, 2016.
[60] P. Sermanet, C. Lynch, J. Hsu, and S. Levine, “Time-contrastive networks: Self-supervised learning from multi-view observation,” arXiv:1704.06888, 2017.
[61] M. Tamosiunaite, B. Nemec, A. Ude, and F. Wörgötter, “Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives,” IEEE-RAS, 2011.
[62] L. Rozo, P. Jiménez, and C. Torras, “Force-based robot learning of pouring skills using parametric hidden markov models,”in 9th International Workshop on Robot Motion and Control, 2013.
[63] S. Brandi, O. Kroemer, and J. Peters, “Generalizing pouring actions between objects using warped parameters,” in Humanoids,2014.
[64] C. Schenck and D. Fox, “Visual closed-loop control for pouring liquids,” 2017.
[65] A. Yamaguchi and C. G. Atkeson, “Differential dynamic programming with temporally decomposed dynamics,” 2015.
[66] L. Kunze and M. Beetz, “Envisioning the qualitative effects of robot manipulation actions using simulation – based projections,” 2017.
[67] R. J. Williams, “Simple statistical gradient – following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, 1992.
[68] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,”inCVPR,2009.
[69] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
[70] Y. Chen and Y. Xue, “A deep learning approach to human activity recognition based on single accelerometer,” in SMC, 2015.
[71] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014.
[72] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in ECCV, 2016.
[73] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol.abs/1409.1556, 2014.
[74] J. W. Lockhart, G. M. Weiss, J. C. Xue, S. T. Gallagher, A. B. Grosner, and T. T. Pulickal, “Design considerations for the wisdm smart phone-based sensor mining architecture,” in Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data, 2011.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *