帳號:guest(3.133.113.64)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):張晉睿
作者(外文):Chang, Chin-Jui
論文名稱(中文):分層強化學習之計算成本意識網路
論文名稱(外文):Computational Cost-Aware Control Using Hierarchical Reinforcement Learning
指導教授(中文):李濬屹
指導教授(外文):Lee, Chun-Yi
口試委員(中文):周志遠
胡敏君
口試委員(外文):Chou, Jerry
Hu, Min-Chun
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:105062571
出版年(民國):109
畢業學年度:108
語文別:英文
論文頁數:35
中文關鍵詞:分層強化學習成本計算運算
外文關鍵詞:hierarchicalreinforcementcostcomputationalcontrol
相關次數:
  • 推薦推薦:0
  • 點閱點閱:389
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
深度強化學習在許多決策或控制類型的任務上有著傑出的表現。在更加複雜的任務中,需要使用更複雜的深度強化學習策略,而這通常需要使用更大的深度神經網路。隨著深度神經網路的加大,它計算結果所需要的運算成本也會隨之快速增加,這對於能源有限的可移動式自動機器人來說更是不容忽視的考量。在這篇論文中我們提出一個運算成本意識方法來節省在這類複雜任務中所需要的運算成本。這個方法是根據我們觀察到一個複雜的控制任務往往可以被切割成多個區塊,這些區塊可以被歸類為複雜的區塊以及簡單的區塊。複雜的區塊需要使用較大的深度神經網路來處理,而簡單的區塊只需要較小的深度神經網路來處理。我們使用分層強化學習的架構來實現這個方法,我們的架構中包含了一個主決策網路,以及兩個大小不同的子決策網路。在訓練主決策網路時,主決策網路會將子決策網路所需的計算成本納入考量。每隔一段時間,主決策網路會根據所觀察到的當前狀態來決定要使用的子決策網路,這個子決策網路的大小剛好足夠解決當下遇到的任務區塊
,藉此來減少整個任務所花費的運算成本。在這篇論文中我們將這方法套用到許多機械控制任務上,來展示我們的方法可以在許多的任務中省下整體一大部分的資源。此外,我們也定性分析了主決策網路在不同任務中的行為,來確認主決策網路的決策是否符合我們所提出的成本意識方法的目標。我們更進一步提供了一系列的消融實驗來驗證我們設計架構中各部分的必要性。
Deep reinforcement learning (DRL) has been demonstrated to provide promising results in a wide range of challenging decision making and control tasks. More challenging tasks typically require DRL policies with higher complexity, which usually comes with the use of larger deep neural network (DNN) models. However, as the model size increases, the required computational costs also grow dramatically, leading to non-negligible energy concerns for battery-limited mobile robots. In order to reduce the overall computational costs required for completing such tasks, in this thesis, we propose a cost-aware strategy based on the observation that a control task can usually be decomposed into segments that require different levels of control complexities. Segments requiring higher control complexities can be handled by a larger DNN, while those requiring lower control complexities can be handled by a smaller DNN. To realize this strategy, we propose a hierarchical RL (HRL) framework consisting of a master policy and two sub-policies of different sizes. The master policy is trained to take the costs of the sub-policies in terms of the number of floating-point operations (FLOPs) into consideration. It periodically selects a sub-policy that is sufficiently capable of handling the current task segment according to its observation of the environment while minimizing the overall cost of the entire task. In this work, we perform extensive experiments to demonstrate that the proposed cost-aware strategy is able to reduce the overall computational costs in a variety of robotic control tasks.
摘要 v
Abstract vii
1 Introduction 1
1.1 Motivation................................. 1
1.2 ProposedMethod ............................. 2
1.3 ThesisOrganization............................ 2
2 Related Work 3
2.1 CostEfficientDeepReinforcementLearning . . . . . . . . . . . . . . 3 2.2 HierarchicalReinforcementLearning .................. 4
3 Background 5
3.1 ReinforcementLearning ......................... 5
3.2 HierarchicalReinforcementLearning .................. 5
3.3 RLAlgorithm............................... 6
3.3.1 DeepQ-Learning(DQN)..................... 6
3.3.2 SoftActor-Critic(SAC) ..................... 7
3.4 AuxiliaryMethodsUsedinExperiments................. 7
3.4.1 HindsightExperienceReplay(HER). . . . . . . . . . . . . . . 7 3.4.2 BoltzmannExplorationforDQN................. 8
4 Proposed Methodology 9
4.1 ProblemFormulation ........................... 9
4.2 Overview of the Cost-Aware Hierarchical Framework . . . . . . . . . . 10
4.3 Cost-AwareTraining ........................... 11
4.4 The Detailed Pseudo-Code of the Proposed Algorithm . . . . . . . . . 11
5 Experimental Setup
13
5.1 ExperimentalSetup............................ 13 5.1.1 Environments........................... 13 5.1.2 NetworkStructure ........................ 13 5.1.3 Hyperparameters ......................... 15
5.2 SelectionofthePolicyCostcω andtheCoefficientλ . . . . . . . . . . 16
5.3 ComputingInfrastructure......................... 17
6 Experimental Results 19
6.1 Qualitative Analysis of the Proposed Methodology . . . . . . . . . . . 19
6.2 Statistics of the Performance and Cost for the Proposed Methodology . 23 6.2.1 Comparison of the Performance to Baselines . . . . . . . . . . 26
6.2.2 Analysis of the Baselines with More Data Samples . . . . . . . 29 6.3 AblationStudy .............................. 29 6.3.1 EffectivenessoftheCostTerm.................. 29 6.3.2 Sub-Policies w/ and w/o Separated Replay Buffers . . . . . . . 30
7 Conclusion 31
7.1 Conclusion ................................ 31
References 33
[1] S. Han, H. Mao, and W. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in ICLR’16, 10 2016.
[2] Y.Wang,“Deeplearninginrealtime—inferenceaccelerationandcontinuoustrain- ing,” 2017.
[3] G.Hinton,O.Vinyals,andJ.Dean,“Distillingtheknowledgeinaneuralnetwork,” in NIPS Deep Learning and Representation Learning Workshop, 2015.
[4] R. S. Sutton, “Between mdps and semi-mdps: A framework for temporal abstrac- tion in reinforcement learning,” Artif. Intell., vol. 112, p. 181–211, Aug. 1999.
[5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
[6] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Ab- dolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller, “DeepMind control suite,” tech. rep., DeepMind, Jan. 2018.
[7] C. Buciluundefined, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, (New York, NY, USA), p. 535–541, Association for Computing Machinery, 2006.
[8] J. Ho and S. Ermon, “Generative adversarial imitation learning,” 2016.
[9] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learn- ing,” in Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, (New York, NY, USA), p. 1, Association for Computing Machinery, 2004.
[10] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” in Proceedings of the 33rd International Confer- ence on International Conference on Machine Learning - Volume 48, ICML’16, p. 49–58, JMLR.org, 2016.
[11] J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” 2017.
[12] F. Codevilla, E. Santana, A. Lopez, and A. Gaidon, “Exploring the limitations of behavior cloning for autonomous driving,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9328–9337, 2019.
[13] C. Florensa, Y. Duan, and P. Abbeel, “Stochastic neural networks for hierarchical reinforcement learning,” ICLR, 04 2017.
[14] P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” 2016.
[15] O. Nachum, S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforce- ment learning,” 2018.
[16] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, p. 3540–3549, JMLR.org, 2017.
[17] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor, “A deep hier- archical approach to lifelong learning in minecraft,” in Proceedings of the Thirty- First AAAI Conference on Artificial Intelligence, AAAI’17, p. 1553–1561, AAAI Press, 2017.
[18] J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” in Proceedings of the 34th International Conference on Ma- chine Learning - Volume 70, ICML’17, p. 166–175, JMLR.org, 2017.
[19] J.Harb,P.-L.Bacon,M.Klissarov,andD.Precup,“Whenwaitingisnotanoption : Learning options with a deliberation cost,” in AAAI, 2018.
[20] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt, “Multi-task deep reinforcement learning with popart,” tech. rep., DeepMind, 2019.
[21] A. Li, C. Florensa, I. Clavera, and P. Abbeel, “Sub-policy adaptation for hierarchi- cal reinforcement learning,” in ICLR, 2020.
[22] V. Mnih, K. Kavukcuoglu, and D. e. a. Silver, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[23] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” ArXiv, vol. abs/1801.01290, 2018.
[24] H.V.Hasselt,“Doubleq-learning,”inAdvancesinNeuralInformationProcessing Systems 23 (J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, eds.), pp. 2613–2621, Curran Associates, Inc., 2010.
[25] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. Mc- Grew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience re- play,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 5048–5058, Curran Associates, Inc., 2017.
[26] N.Cesa-Bianchi,C.Gentile,G.Lugosi,andG.Neu,“Boltzmannexplorationdone right,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, (Red Hook, NY, USA), p. 6287–6296, Curran Associates Inc., 2017.
[27] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, Oct 2012.
[28] A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhari- wal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu, “Stable baselines.” https://github.com/hill-a/stable- baselines, 2018.
[29] A. Raffin, “Rl baselines zoo.” https://github.com/araffin/rl-baselines-zoo, 2018.
(此全文限內部瀏覽)
電子全文
中英文摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *