應用強化獎勵機制學習解魔術方塊__國立清華大學博碩士論文全文影像系統

帳號：guest(3.15.208.156) 離開系統

字體大小：

詳目顯示

第 1 筆 / 共 1 筆

/1頁

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士論文系統

、以作者查詢全國書目

論文基本資料
摘要
外文摘要
論文目次
參考文獻
電子全文

作者(中文):	陳怡靜
作者(外文):	Chen, Yi-Ching.
論文名稱(中文):	應用強化獎勵機制學習解魔術方塊
論文名稱(外文):	Solving Rubik's Cube by Policy Gradient Based Reinforcement Learning
指導教授(中文):	林永隆
指導教授(外文):	Lin, Youn-Long
口試委員(中文):	陳煥宗黃俊達
口試委員(外文):	Chen, Hwann-Tzong Huang, Juinn-Dar
學位類別:	碩士
校院名稱:	國立清華大學
系所名稱:	資訊工程學系所
學號:	105062614
出版年(民國):	107
畢業學年度:	107
語文別:	英文
論文頁數:	30
中文關鍵詞:	強化學習、魔術方塊、策略梯度
外文關鍵詞:	Reinforcement Learning、Rubik's Cube、Policy Gradient
相關次數:	推薦:0 點閱:785 評分: 下載:28 收藏:0

強化學習系統提供了代理人與環境互動機制，策略梯度方法目的在於儘可能採
取好的動作。我們提出一個在強化學習系統上運用線性的策略梯度方法和強化獎
懲機制進而達到對於好的動作有較高的機率。實驗結果顯示此方法用神經網路模式
可以解部分的魔術方塊問題，但是仍不能解所有問題。

Reinforcement Learning provides a mechanism for training an agent to interact with its environment. Policy gradient makes the right actions more probable. We propose using a linear policy gradient method in a deep neural network-based reinforcement learning. The proposed method employs an intensifying reward function to increase the probabilities of right actions to solve the Rubik's Cube problems. Experiments show that our proposed neural network learned to solve some Rubik's Cube states. For more difficult initial states, the network still cannot always give the correct suggestion.

Abstract i
Contents ii
List of Tables iii
List of Figures iv
1 Motivation 1
2 Related Work 3
3 Reinforcement Learning 5
3.1 Basic concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Proposed Methodology and Implementation 8
4.1 Data Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Experiment Results 17
6 Conclusion and Future Work 26
References 28

[1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., "Mastering the game of Go with deep neural networks and tree search," nature, vol. 529,
no. 7587, pp. 484{489, 2016.
[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing Atari with deep reinforcement learning," arXiv preprint arXiv:1312.5602, 2013.
[3] "MuJoCo physics engine." [Online]. Available: http://www.mujoco.org/.
[4] H. Kociemba, "Two-Phase Algorithm Details." [Online]. Available: http://
kociemba.org/math/imptwophase.htm.
[5] S. McAleer, F. Agostinelli, A. Shmakov, and P. Baldi, "Solving the Rubik's Cube
Without Human Knowledge," arXiv preprint arXiv:1805.07470, 2018.
[6] A. Irpan, \Exploring boosted neural nets for rubiks cube solving,"
As of this writing, paper may be found at http://www. alexirpan.
com/public/research/nips 2016. pdf, 2016.
[7] A. Karpathy, "Deep Reinforcement Learning: Pong from Pixels." [Online]. Avail-
able: http://karpathy.github.io/2016/05/31/rl/, 2016.
[8] H. van Hasselt, "UCL Course { 2016: Introduction to reinforcement learning."
Retrieved january, 2016, from University College London Web site: https://
hadovanhasselt.com/2016/01/12/ucl-course/.
[9] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, ch. 1,
pp. 5{6. MIT press Cambridge, 2 ed., 199.
[10] S. Levine, "CS 294: Deep Reinforcement Learning, Fall 2017: Sep 6: Pol-
icy gradients introduction." Retrieved August, 2017, from UC Berkeley Web
site: http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_4_policy_
gradient.pdf.
[11] H. van Hasselt, "UCL Course { 2016: Policy Gradient." Retrieved January,
2016, from University College LondonWeb site: https://hadovanhasselt.com/
2016/01/12/ucl-course/.
[12] "Monte Carlo Method." [Online]. Available: http://mathworld.wolfram.com/
MonteCarloMethod.html.
[13] R. J. Williams, "Simple statistical gradient-following algorithms for connectionist
reinforcement learning," in Reinforcement Learning, pp. 5{32, Springer, 1992.
[14] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, "High-dimensional
continuous control using generalized advantage estimation," arXiv preprint
arXiv:1506.02438, 2015.
[15] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, "Trust region policy
optimization," in International Conference on Machine Learning, pp. 1889-1897, 2015.
[16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv preprint arXiv:1707.06347, 2017.
[17] D. Shah, "Activation Functions." [Online]. Available: https://
towardsdatascience.com/activation-functions-in-neural-networks-58115cda9c96,
2016.
[18] "Inverse transform sampling." [Online]. Available: https://stephens999.
github.io/fiveMinuteStats/inverse_transform_sampling.html.
[19] "Normal distribution." [Online]. Available: http://mathworld.wolfram.com/
NormalDistribution.html.
[20] S. Levine, "CS 294: Deep Reinforcement Learning, Fall 2017: Sep 11:
Actor-critic introduction." Retrieved August, 2017, from UC Berkeley Web
site: http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_5_actor_
critic_pdf.pdf.

電子全文
中英文摘要

推文
推薦
評分
引用網址
轉寄

top

詳目顯示

相關論文