帳號:guest(18.118.139.82)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):林奕汝
作者(外文):Lin, I-Ju
論文名稱(中文):主動對抗式探索在基於觀察的模仿學習
論文名稱(外文):Behavior Cloning from Observation using Adversarial Active Exploration
指導教授(中文):李濬屹
指導教授(外文):Lee, Chun-Yi
口試委員(中文):郭柏志
程芙茵
口試委員(外文):Kuo, Po-Chih
Cherng, Fu-Yin
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:106065514
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:33
中文關鍵詞:強化學習行為模仿
外文關鍵詞:Behavior Cloning from ObservationAdversarial Active ExplorationActive LearningImitation LearningReinforcement LearningProximal Policy Optimization
相關次數:
  • 推薦推薦:0
  • 點閱點閱:223
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
模仿學習算法從觀察專家的過程中,學習許多包含感測元件的控制任務。典型的複製行為 (Behavior Cloning) 只能模仿專家演示在每個狀態的對應動作,這種做法要求專家記錄每個動作細節。從觀察中複製行為 (Behavior Cloning from Observation, BCO) 使用逆動力學模型 (Inverse Dynamics Model, IDM) 來免除對動作數據的要求,藉由兩個連續的狀態,IDM預測專家在演示中的實際動作。然而,IDM可能會預測出錯誤的動作,BCO會收斂到與專家不同的行為模式。拿這種模型來與環境互動,收集到的數據無法涵蓋專家的演示,也因此作為IDM的訓練數據,不太可能讓IDM準確性提高。就此陷入學習的惡性循環, BCO無法發現改進IDM的數據,而IDM無法為BCO提供準確的動作。

為了擺脫這種惡性循環,我們建議使用「主動對抗式探索」(Active Adversarial Exploration, AAE),將收集IDM訓練數據的任務設計成零和遊戲,鼓勵IDM和數據收集策略之間相互競爭。 IDM旨在最小化收集數據的預測誤差,而AAE引導收集最大化IDM預測誤差的數據。高誤差的數據可以提高IDM準確性,因此惡性循環得以被打破。此外,為了確保收集的數據接近BCO所需,我們限制收集策略與BCO的模型差異。實驗結果證明AAE幫助IDM更加準確,並使BCO能夠更接近專家的行為,在各種機器人操作任務中也跑得更快、更遠。
Behavior Cloning from Observation (BCO) is an imitation learning algorithm that enables robots to accomplish many sensorimotor tasks by learning from experts' observations. BCO uses an Inverse Dynamics Model (IDM) to fill in the action label (e.g., hand movements) between each pair of consecutive observations (e.g., ego-centric camera view) in an expert's demonstration. BCO trains an imitation policy with the expert's observations and the paired actions predicted by the IDM. The IDM is trained using the data collected by the imitation policy.

However, since the IDM could fill in erroneous actions, the imitation policy could converge to a suboptimal policy. The data collected by a suboptimal policy barely cover the expert's observations, hence unlikely to improve the accuracy of filled-in actions predicted by the IDM. Moreover, the policy stops changing after convergence. Consequently, BCO gets stuck in this ``vicious cycle", where the imitation policy cannot discover data for improving the IDM, and the IDM cannot fill in the accurate actions for the imitation policy.

To escape this vicious cycle, we propose using an Active Adversarial Exploration (AAE) strategy that formulates the data collection as a zero-sum game where the IDM and a data collection policy compete against each other. AAE directs the data collection policy to collect data that maximize the prediction errors of the IDM, while IDM minimizes the prediction errors on the collected data. Since previous works have shown that high-error data can improve the accuracy of an IDM, such data can enable an IDM to fill in more accurate actions for the imitation policy. Thus the vicious cycle is broken. Furthermore, to ensure the collected data are close to the expert's observations, we enforce the data collection policy to be close to the imitation policy. Our experimental results showcase that AAE learns a more accurate IDM and enables the robot to learn a better imitation policy from the expert's observations. The resulting policy allows the robot to run faster and further in various locomotion tasks.
Chinese Abstract i
Abstract ii
Contents iii
1 Introduction 1
2 Background 4
2.1 Sequential decision-making problem . . . . . . . . . . . . . . . . . . . 4
2.2 Imitation learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Behavior cloning (BC) . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Behavior cloning from observation (BCO) . . . . . . . . . . . 7
2.3 Reinforcement learning (RL) . . . . . . . . . . . . . . . . . . . . . . . 9
3 Methodology 11
3.1 Active adversarial exploration (AAE) . . . . . . . . . . . . . . . . . . 13
3.1.1 MDP formulation . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Necessity of using RL . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Behavior cloning (BC) regularization . . . . . . . . . . . . . . . . . . 14
3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Experiments 17
4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.2 Evaluation procedure . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Prediction errors comparison . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4.1 BC regularization . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4.2 Periodically re-initializing πRL . . . . . . . . . . . . . . . . . . 24
5 Related Work 27
5.1 Imitation from observation (IfO) . . . . . . . . . . . . . . . . . . . . . 27
5.2 Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Conclusion 30
Bibliography 31
[1] Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recognition and machine learning, volume 4. Springer.
[2] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym.
[3] DiCiccio, T. J. and Efron, B. (1996). Bootstrap confidence intervals. Statistical science, 11(3):189–228.
[4] Fang, M., Li, Y., and Cohn, T. (2017). Learning how to active learn: A deep reinforcement learning approach. arXiv:1708.02383.
[5] Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. Advances in Neural Information Processing Systems(NeurIPS), 29.
[6] Hong, Z.-W., Fu, T.-J., Shann, T.-Y., and Lee, C.-Y. (2020). Adversarial active exploration for inverse dynamics model learning. In Kaelbling, L. P., Kragic, D., and Sugiura, K., editors, Proceedings of the Conference on Robot Learning(CoRL), volume 100 of Proceedings of Machine Learning Research, pages 552–565. PMLR.
[7] Kidambi, R., Chang, J., and Sun, W. (2021). Mobile: Model-based imitation learning from observation alone. Advances in Neural Information Processing Systems(NeurIPS), 34:28598–28611.
[8] Kim, K., Sano, M., De Freitas, J., Haber, N., and Yamins, D. (2020). Active world model learning with progress curiosity. In Proceedings of the 37th International Conference on Machine Learning(ICML), pages 5306–5315. PMLR.
[9] Kingma, D. and Ba, J. (2017). Adam: A method for stochastic optimization. arXiv:1412.6980.
[10] Lewis, D. D. and Gale, W. A. (1994). A sequential algorithm for training text classifiers. In SIGIR’94, pages 3–12. Springer.
[11] Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., Peters, J., et al. (2018). An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 7(1-2):1–179.
[12] Pomerleau, D. A. (1988). Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1.
[13] Prakash, A., Behl, A., Ohn-Bar, E., Chitta, K., and Geiger, A. (2020). Exploring data aggregation in policy learning for vision-based urban autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11763–11773.
[14] Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.
[15] Schulman, J. et al.. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
[16] Settles, B. (2009). Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences.
[17] Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press.
[18] Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems(NeurIPS), 12.
[19] Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033.
[20] Torabi, F., Warnell, G., and Stone, P. (2018a). Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, page 4950–4957. AAAI Press.
[21] Torabi, F., Warnell, G., and Stone, P. (2018b). Generative adversarial imitation from observation. arXiv preprint arXiv:1807.06158.
[22] Yang, C., Ma, X., Huang, W., Sun, F., Liu, H., Huang, J., and Gan, C. (2019). Imitation learning from observations by minimizing inverse dynamics disagreement. Advances in Neural Information Processing Systems(NeurIPS), 32.
[23] Yang, Y., Ma, Z., Nie, F., Chang, X., and Hauptmann, A. G. (2015). Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113(2):113–127.
[24] Zhu, Z., Lin, K., Dai, B., and Zhou, J. (2020). Off-policy imitation learning from observations. Advances in Neural Information Processing Systems(NeurIPS), 33:12402–12413.
(此全文未開放授權)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *