帳號:guest(18.188.96.3)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):黃瑞連
作者(外文):Huang, Rui-Lian
論文名稱(中文):利用課程學習和額外獎勵改善多手臂運動規劃中的獎勵稀疏性
論文名稱(外文):Improving Reward Sparsity in Multi-Arm Motion Planning with Curriculum Learning and Bonus Rewards
指導教授(中文):金仲達
指導教授(外文):King, Chung-Ta
口試委員(中文):許秋婷
江振瑞
口試委員(外文):Hsu, Chiu-Ting
Jiang, Jehn-Ruey
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:109062669
出版年(民國):111
畢業學年度:111
語文別:英文
論文頁數:26
中文關鍵詞:多智能體系統課程學習強化學習稀疏獎勵
外文關鍵詞:multi-agent systemcurriculum learningreinforcement learningsparse reward
相關次數:
  • 推薦推薦:0
  • 點閱點閱:491
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
在由共享工作空間中協同工作的多個機械手臂組成的環境中,多臂運動
規劃問題是安排手臂的運動以在移動過程中無碰撞地達到各自的目標姿勢。
這個問題對於具有高自由度的機器人手臂來說特別具有挑戰性。為了應對
這種複雜性,最近基於學習的研究透過多代理強化學習來解決這個問題,
該方法在具有稀疏獎勵的情況下,對於探索出令人滿意的運動規劃的限制
最少。儘管已採用專家示範和課程學習等技術用於補償獎勵訊號的稀疏性,
但樣本效率仍然很低,且訓練時間仍然太長。在這篇論文中,我們提出為
專家示範和成功的回合提供額外獎勵。在不同階段使用不同的獎勵,我們
的方法不僅可以改善樣本效率和總訓練時間,還可以處理不完美的示範。
實驗表明,與基線相比,即使專家示範不是最優的,我們所提出的方法也
能在不影響任務成功率的情況下以更短的時間完成訓練。
In an environment consisting of multiple robotic arms working collaboratively
in a shared workspace, the multi-arm motion planning problem is to schedule the
motions of the arms to reach their respective target poses without collision during
the movements. The problem is especially challenging for robotic arms with high
degrees-of-freedom (DoF). To cope with the complexity, recent learning-based research solves the problem by multi-agent reinforcement learning (MARL) with
sparse rewards for minimal restrictions on exploring satisfying motion plans. Although techniques such as expert demonstration and curriculum learning have been
employed to compensate for the sparsity in reward signals, the sampling efficiency
remains low and the training time is still too long. In this thesis, we propose to
provide bonus rewards for expert demonstrations and successful episodes. Using
different bonus rewards at different stages, our method can not only improve sampling efficiency and total training time, but also handle imperfect demonstrations.
Experiments show that, when compared with the baseline, the proposed method
can complete the training in less time without loss of task success rate, even if the
expert demonstrations are not optimal.
Acknowledgements
摘要 i
Abstract ii
1 Introduction 1
2 Related Work 5
2.1 Multi-arm Motion Planning . . . . . 5
2.1.1 Sampling-based Methods . . . . . 5
2.1.2 Learning-based Methods . . . . . 5
2.2 Handling Reward Sparsity . . . . . 6
2.3 Decentralized MARL with Expert Demonstrations. . . . . 7
3 Method 11
3.1 Problem Definition . . . . . 11
3.2 Training Flow . . . . . 12
3.3 Reward Relabeling. . . . . 14
4 Experiments 17
4.1 Experimental Setup. . . . . 17
4.2 Experimental Results. . . . . 19
5 Conclusion 23
References . . . . . 25
[1] R. Stern, N. R. Sturtevant, A. Felner, S. Koenig, H. Ma, T. T. Walker, J. Li, D. Atzmon, L. Cohen, T. S. Kumar, et al., “Multi-agent pathfinding: Definitions, variants, and benchmarks,” in Twelfth Annual Symposium on Combinatorial Search, 2019.
[2] S. Kumar and S. Chakravorty, “Multi-agent generalized robabilistic roadmaps: Magprm,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3747–3753, IEEE, 2012.
[3] M. Čáp, P. Novák, J. Vokrínek, and M. Pěchouček, “Multi-agent rrt: sampling-based cooperative pathfinding,” in Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 1263–1264, 2013.
[4] E. Prianto, M. Kim, J.-H. Park, J.-H. Bae, and J.-S. Kim, “Path planning for multi-arm manipulators using deep reinforcement learning: Soft actor–critic with hindsight experience replay,” Sensors, vol. 20, no. 20, p. 5911, 2020.
[5] A. Ghadirzadeh, X. Chen, W. Yin, Z. Yi, M. Björkman, and D. Kragic, “Human-centered collaborative robots with deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 566–571, 2020.
[6] S. Huang and S. Ontañón, “Action guidance: Getting the best of sparse rewards and shaped rewards for real-time strategy games,” arXiv preprint arXiv:2010.03956, 2020.
[7] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
[8] S. H. Semnani, H. Liu, M. Everett, A. De Ruiter, and J. P. How, “Multi-agent motion planning for dense and dynamic environments via deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3221–3226, 2020.
[9] H. Ha, J. Xu, and S. Song, “Learning a decentralized multi-arm motion planner,” in Proceedings of the 2020 Conference on Robot Learning, 2020.
[10] J. B. Martín, R. Chekroun, and F. Moutarde, “Learning from demonstrations with sacr2: Soft actor-critic with reward relabeling,” in Deep RL Workshop NeurIPS 2021, 2021.
[11] G. Zuo, J. Lu, and T. Pan, “Sparse reward based manipulator motion planning by using high speed learning from demonstrations,” in 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 518–523, IEEE, 2018.
[12] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in 2018 IEEE international conference on robotics and automation (ICRA), pp. 6292–6299, IEEE, 2018.
[13] J. J. Kuffner and S. M. LaValle, “Rrt-connect: An efficient approach to single-query path planning,” in Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), vol. 2, pp. 995–1001, IEEE, 2000.
[14] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” 2016.
[15] E. A. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic programming for partially observable stochastic games,” in AAAI, vol. 4, pp. 709–715, 2004.
[16] A. A. Neto, D. G. Macharet, and M. F. M Campos, “Multi-agent rapidly-exploring pseudorandom tree,” Journal of Intelligent & Robotic Systems, vol. 89, no. 1, pp. 69–85, 2018.
[17] V. R. Desaraju and J. P. How, “Decentralized path planning for multi-agent teams in complex environments using rapidly-exploring random trees,” in 2011 IEEE International Conference on Robotics and Automation, pp. 4956–4961, IEEE, 2011.
[18] H. Lee, J. Hong, and J. Jeong, “Marl-based dual reward model on segmented actions for multiple mobile robots in automated warehouse environment,” Applied Sciences, vol. 12, no. 9, p. 4703, 2022.
[19] X. Lyu, Y. Xiao, B. Daley, and C. Amato, “Contrasting centralized and decentralized critics in multi-agent reinforcement learning,” in Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pp. 844–852, 2021.
[20] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” Advances in neural information processing systems, vol. 30, 2017.
[21] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
[22] A. Graves, “Long short-term memory,” Supervised sequence labelling with recurrent neural networks, pp. 37–45, 2012.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *