帳號:guest(          離開系統
字體大小: 字級放大   字級縮小   預設字形  


作者(外文):Huang, Rui-Lian
論文名稱(外文):Improving Reward Sparsity in Multi-Arm Motion Planning with Curriculum Learning and Bonus Rewards
指導教授(外文):King, Chung-Ta
口試委員(外文):Hsu, Chiu-Ting
Jiang, Jehn-Ruey
外文關鍵詞:multi-agent systemcurriculum learningreinforcement learningsparse reward
  • 推薦推薦:0
  • 點閱點閱:491
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
In an environment consisting of multiple robotic arms working collaboratively
in a shared workspace, the multi-arm motion planning problem is to schedule the
motions of the arms to reach their respective target poses without collision during
the movements. The problem is especially challenging for robotic arms with high
degrees-of-freedom (DoF). To cope with the complexity, recent learning-based research solves the problem by multi-agent reinforcement learning (MARL) with
sparse rewards for minimal restrictions on exploring satisfying motion plans. Although techniques such as expert demonstration and curriculum learning have been
employed to compensate for the sparsity in reward signals, the sampling efficiency
remains low and the training time is still too long. In this thesis, we propose to
provide bonus rewards for expert demonstrations and successful episodes. Using
different bonus rewards at different stages, our method can not only improve sampling efficiency and total training time, but also handle imperfect demonstrations.
Experiments show that, when compared with the baseline, the proposed method
can complete the training in less time without loss of task success rate, even if the
expert demonstrations are not optimal.
摘要 i
Abstract ii
1 Introduction 1
2 Related Work 5
2.1 Multi-arm Motion Planning . . . . . 5
2.1.1 Sampling-based Methods . . . . . 5
2.1.2 Learning-based Methods . . . . . 5
2.2 Handling Reward Sparsity . . . . . 6
2.3 Decentralized MARL with Expert Demonstrations. . . . . 7
3 Method 11
3.1 Problem Definition . . . . . 11
3.2 Training Flow . . . . . 12
3.3 Reward Relabeling. . . . . 14
4 Experiments 17
4.1 Experimental Setup. . . . . 17
4.2 Experimental Results. . . . . 19
5 Conclusion 23
References . . . . . 25
[1] R. Stern, N. R. Sturtevant, A. Felner, S. Koenig, H. Ma, T. T. Walker, J. Li, D. Atzmon, L. Cohen, T. S. Kumar, et al., “Multi-agent pathfinding: Definitions, variants, and benchmarks,” in Twelfth Annual Symposium on Combinatorial Search, 2019.
[2] S. Kumar and S. Chakravorty, “Multi-agent generalized robabilistic roadmaps: Magprm,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3747–3753, IEEE, 2012.
[3] M. Čáp, P. Novák, J. Vokrínek, and M. Pěchouček, “Multi-agent rrt: sampling-based cooperative pathfinding,” in Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 1263–1264, 2013.
[4] E. Prianto, M. Kim, J.-H. Park, J.-H. Bae, and J.-S. Kim, “Path planning for multi-arm manipulators using deep reinforcement learning: Soft actor–critic with hindsight experience replay,” Sensors, vol. 20, no. 20, p. 5911, 2020.
[5] A. Ghadirzadeh, X. Chen, W. Yin, Z. Yi, M. Björkman, and D. Kragic, “Human-centered collaborative robots with deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 566–571, 2020.
[6] S. Huang and S. Ontañón, “Action guidance: Getting the best of sparse rewards and shaped rewards for real-time strategy games,” arXiv preprint arXiv:2010.03956, 2020.
[7] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
[8] S. H. Semnani, H. Liu, M. Everett, A. De Ruiter, and J. P. How, “Multi-agent motion planning for dense and dynamic environments via deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3221–3226, 2020.
[9] H. Ha, J. Xu, and S. Song, “Learning a decentralized multi-arm motion planner,” in Proceedings of the 2020 Conference on Robot Learning, 2020.
[10] J. B. Martín, R. Chekroun, and F. Moutarde, “Learning from demonstrations with sacr2: Soft actor-critic with reward relabeling,” in Deep RL Workshop NeurIPS 2021, 2021.
[11] G. Zuo, J. Lu, and T. Pan, “Sparse reward based manipulator motion planning by using high speed learning from demonstrations,” in 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 518–523, IEEE, 2018.
[12] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in 2018 IEEE international conference on robotics and automation (ICRA), pp. 6292–6299, IEEE, 2018.
[13] J. J. Kuffner and S. M. LaValle, “Rrt-connect: An efficient approach to single-query path planning,” in Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), vol. 2, pp. 995–1001, IEEE, 2000.
[14] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” 2016.
[15] E. A. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic programming for partially observable stochastic games,” in AAAI, vol. 4, pp. 709–715, 2004.
[16] A. A. Neto, D. G. Macharet, and M. F. M Campos, “Multi-agent rapidly-exploring pseudorandom tree,” Journal of Intelligent & Robotic Systems, vol. 89, no. 1, pp. 69–85, 2018.
[17] V. R. Desaraju and J. P. How, “Decentralized path planning for multi-agent teams in complex environments using rapidly-exploring random trees,” in 2011 IEEE International Conference on Robotics and Automation, pp. 4956–4961, IEEE, 2011.
[18] H. Lee, J. Hong, and J. Jeong, “Marl-based dual reward model on segmented actions for multiple mobile robots in automated warehouse environment,” Applied Sciences, vol. 12, no. 9, p. 4703, 2022.
[19] X. Lyu, Y. Xiao, B. Daley, and C. Amato, “Contrasting centralized and decentralized critics in multi-agent reinforcement learning,” in Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pp. 844–852, 2021.
[20] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” Advances in neural information processing systems, vol. 30, 2017.
[21] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
[22] A. Graves, “Long short-term memory,” Supervised sequence labelling with recurrent neural networks, pp. 37–45, 2012.
第一頁 上一頁 下一頁 最後一頁 top
* *