帳號:guest(3.144.105.36)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):簡廷安
作者(外文):Chien, Ting-An
論文名稱(中文):基於多模態系統預測人類意圖與提示遺忘事件之持續學習
論文名稱(外文):A Multimodal System Towards Intention Anticipation and Missing Actions Reminder via Continuous Self-Learning
指導教授(中文):孫民
指導教授(外文):Sun, Min
口試委員(中文):陳煥宗
陳祝嵩
口試委員(外文):Chen, Hwann-Tzong
Chen, Chu-Song
學位類別:碩士
校院名稱:國立清華大學
系所名稱:電機工程學系
學號:105061537
出版年(民國):107
畢業學年度:106
語文別:英文
論文頁數:39
中文關鍵詞:意圖預測遺忘事件提醒智慧家庭融合感測器連續學習
外文關鍵詞:Intention AnticipationMissing Action ReminderSmart HomeSensor FusionContinuous Learning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:257
  • 評分評分:*****
  • 下載下載:5
  • 收藏收藏:0
我們想像在未來的智慧家庭中,會充滿許多物聯網裝置(IoT-D),並且有記錄與觸發事件(例如泡咖啡和開啟電視)的能力。我們提出一套多模態(multimodal)系統來預測使用者意圖(開啟IoT-D)和提醒遺忘事件(關閉IoT-D),並透過持續式學習(continuous learning)使得未來智慧家庭中的IoT-D能夠自動化控制,例如自動開啟電視。我們的系統結合了IoT-D狀態(例如電視的開啟或關閉)、穿戴式裝置(手腕相機與加速度感測器)、以及俯瞰式全景運動向量(360$^{\circ}$ motion vectors),分別掌握了環境條件、人類行為活動、以及使用者位置資訊,且所有的資料皆同步化。我們此篇論文的核心貢獻在於使用可端對端訓練的(end-to-end trainable)階層式遞歸神經網路(hierarchical Recurrent Neural Network)搭配延遲結合(late fusion)方法,將人類意圖預測分為長期、中期、短期、立即發生以及不會發生等時間類別,並同時判斷是否有遺忘事件(missing actions)發生。此套系統無須任何人為標註資料,便可透過自動同步化之IoT-D狀態與其他感測資料,持續學習使用者行為並快速適應。此外,透過我們提出的倒數損失函數(countdown loss)訓練模型,能更加提升在意圖預測上的準確度。而透過我們的資料擴增(data augmentation)演算法,能夠不須要真實遺忘事件的資料便可學習提示功能。我們在一個室內智慧家庭環境評估我們的系統,除了在準確率上達到不錯的成效,同時也展現我們系統適應使用者行為改變的能力,以及在跨使用者間一般化(generalization)的能力。
We imagine a future smart home full of Internet-of-Thing Devices (IoT-D) with the capability to log and trigger events (e.g., brewing coffee and turn-on TV).
A novel multimodal system is proposed to anticipate user intention (IoT-D ON states) and remind missing actions (IoT-D OFF states) via continuous learning such that IoT-D state changes can be automated in future smart home (e.g., turn on TV automatically).
Our system fuses synchronized IoT-D states (i.e., TV ON or OFF), wearable sensors (wrist-mounted camera and accelerometer), and 360$^{\circ}$ overhead motion vectors to incorporate environment condition, human affordance (activity and interaction with objects), and user position, respectively.
Our core contribution is an end-to-end trainable hierarchical Recurrent Neural Network (RNN) with late fusion to (1) classify intention into none, long-term, mid-term, short-term, immediate intentions, and (2) classify none or missing actions. It can automatically (without human annotation) adapt to user's behavior by continuous learning using synchronized IoT-D states and sensors data. Moreover, we introduce a countdown loss for training anticipation and further improve our model performance. We also learn to remind given ZERO missing actions in training using our data augmentation procedure. Our system is evaluated on an in-house smart home environment and achieves not only reasonably good accuracy but also shown its ability to adapt user behavior variation and generalization across users.
摘要 ii
Abstract iii
誌謝 iv
1 Introduction 1
2 Related Work 5
2.1 Intention Anticipation 5
2.2 Missing Action Reminder 6
2.3 Wearable Sensing System 6
3 Preliminaries 8
3.1 Recurrent Models 8
3.1.1 Recurrent Neural Networks (RNNs) 8
3.1.2 Gated Recurrent Units (GRU) 9
3.2 Fusing Multiple Sequential Modalities 11
3.2.1 Early Multimodal Fusion Model 12
3.2.2 Late Multimodal Fusion Model 12
3.2.3 Late Multimodal Sequential Fusion Model 13
4 Intention Anticipation and Missing Action Reminder 14
4.1 Task Description 14
4.2 Problem Formulation 15
4.3 Multimodal Recurrent Model 16
4.4 Implementation Details 20
5 Setting and Datasets 21
5.1 Settings 21
5.2 Dataset 22
6 Experiments 25
6.1 Evaluation Metrics 25
6.2 Ablation Study 27
6.2.1 Representation learning for 360◦ motion vectors 27
6.2.2 Multimodal fusion RNN 28
6.2.3 Simulation for Reminder 29
6.2.4 Countdown Loss for Anticipation 30
6.3 Cross User 31
6.4 Continuous Self-Learning 32
6.5 Typical Example 33
7 Conclusion and Future Work 35
7.1 Conclusion 35
7.2 Future Work 35
References 37
[1] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the prop- erties of neural machine translation: Encoder-decoder approaches,” CoRR, vol. abs/1409.1259, 2014. vii, 9, 10
[2] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated re- current neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014. vii, 10
[3] N. Nishida and H. Nakayama, “Multimodal gesture recognition using multi-stream recurrent neural network,” in PSIVT, 2015. vii, 11, 12, 13, 17
[4] M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in 2011 International Conference on Computer Vision (ICCV), 2011. 3, 5
[5] M. S. A. Akbarian, F. Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. An- dersson, “Encouraging lstms to anticipate actions very early,” in 2017 IEEE Inter- national Conference on Computer Vision (ICCV), 2017. 3, 5
[6] T.-C. C. Tian Lan and S. Savarese, “A hierarchical representation for future action prediction,” in 2014 European Conference on Computer Vision (ECCV), 2014. 3, 5
[7] R. N. Jiyang Gao, Zhenheng Yang, “Red: Reinforced encoder-decoder networks for action anticipation,” in BMVC, 2017. 3, 5
[8] C. Wu, J. Zhang, B. Selman, S. Savarese, and A. Saxena, “Watch-bot: Unsuper- vised learning for reminding humans of forgotten actions,” in 2016 IEEE Interna- tional Conference on Robotics and Automation (ICRA), 2016. 4, 6
[9] B. Soran, A. Farhadi, and L. Shapiro, “Generating notifications for missing ac- tions: Don’t forget to turn the lights off!,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015. 4, 6, 7
[10] T.-Y. Wu, T.-A. Chien, C.-S. Chan, C.-W. Hu, and M. Sun, “Anticipating daily intention using on-wrist motion triggered sensing,” in International Conference on Computer Vision (ICCV), 2017. 4, 7, 16, 17, 20, 21
[11] P.-X. X. C.-C. C. C.-S. Chan, S.-Z. Chen and M. Sun, “Recognition from hand cameras: A revisit with deep learning,” in 2016 European Conference on Com- puter Vision (ECCV), 2016. 4, 7

[12] A. K. K. Ohnishi, A. Kanehira and T. Harada, “Recognizing activities of daily living with a wrist-mounted camera,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 4, 7
[13] N. Rhinehart and K. M. Kitani, “First-person activity forecasting with online in- verse reinforcement learning,” in The IEEE International Conference on Computer Vision (ICCV), 2017. 5, 7
[14] T. Mahmud, M. Hasan, A. Chakraborty, and A. K. Roy-Chowdhury, “A poisson process model for activity forecasting,” in 2016 IEEE International Conference on Image Processing (ICIP), 2016. 5
[15] T. Mahmud, M. Hasan, and A. K. Roy-Chowdhury, “Joint prediction of activity labels and starting times in untrimmed videos,” in 2017 IEEE International Con- ference on Computer Vision (ICCV), 2017. 5
[16] R. D. Baruah, M. Singh, D. Baruah, and I. S. Misra, “Predicting activity occurrence time in smart homes with evolving fuzzy models,” in 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2017. 5
[17] B. Minor and D. Cook, “Forecasting occurrences of activities,” in Pervasive and Mobile Computing, 2016. 5
[18] B. D. Minor, J. R. Doppa, and D. J. Cook, “Learning activity predictors from sensor data: Algorithms, evaluation, and applications,” in IEEE Transactions on Knowl- edge and Data Engineering, 2017. 5
[19] “Casas dataset.” http://ailab.wsu.edu/casas/datasets.html. [Online; accessed 11-March-2018]. 5
[20] D. J. Patterson, D. Fox, H. Kautz, and M. Philipose, “Fine-grained activity recog- nition by aggregating abstract object usage,” in Proceedings of the Ninth IEEE International Symposium on Wearable Computers, ISWC ’05, (Washington, DC,
USA), pp. 44–51, IEEE Computer Society, 2005. 6
[21] B. Logan, J. Healey, M. Philipose, E. M. Tapia, and S. Intille, “A long-term eval- uation of sensing modalities for activity recognition,” in Proceedings of the 9th International Conference on Ubiquitous Computing, UbiComp ’07, (Berlin, Hei- delberg), pp. 483–500, Springer-Verlag, 2007. 6
[22] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, and J. M. Rehg, “A scalable ap- proach to activity recognition based on object use,” 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8, 2007. 6
[23] E. M. Tapia, S. S. Intille, and K. Larson, “Activity recognition in the home using simple and ubiquitous sensors.,” in Pervasive (A. Ferscha and F. Mattern, eds.), vol. 3001 of Lecture Notes in Computer Science, pp. 158–175, Springer, 2004. 6
[24] D. Fortin-Simard, J.-S. Bilodeau, K. Bouchard, S. Gaboury, B. Bouchard, and
A. Bouzouane, “Exploiting passive rfid technology for activity recognition in smart homes,” vol. 0, pp. 1–8, 02 2015. 6

[25] A. Fathi, A. Farhadi, and J. M. Rehg, “Understanding egocentric activities,” in Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, 2011. 6
[26] Y. Li, Z. Ye, and J. M. Rehg, “Delving into egocentric actions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 6
[27] B. Soran, A. Farhadi, and L. Shapiro, Action recognition in the presence of one egocentric and multiple static cameras. 2015. 6
[28] S. Singh, C. Arora, and C. V. Jawahar, “First person action recognition using deep learned descriptors,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 6
[29] M. Ma, H. Fan, and K. M. Kitani, “Going deeper into first-person activity recog- nition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 6
[30] S. Z. Bokhari and K. M. Kitani, “Long-term activity forecasting using first-person vision,” in ACCV, 2016. 7
[31] G. Bertasius and J. Shi, “Using cross-model egosupervision to learn cooperative basketball intention,” in The IEEE International Conference on Computer Vision (ICCV) Workshops, 2017. 7
[32] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, “Real-time action recognition with enhanced motion vector cnns,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 17
[33] C. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl, “Com- pressed video action recognition,” CoRR, vol. abs/1712.00636, 2017. 17
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *