|
[1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., "Mastering the game of Go with deep neural networks and tree search," nature, vol. 529, no. 7587, pp. 484{489, 2016. [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing Atari with deep reinforcement learning," arXiv preprint arXiv:1312.5602, 2013. [3] "MuJoCo physics engine." [Online]. Available: http://www.mujoco.org/. [4] H. Kociemba, "Two-Phase Algorithm Details." [Online]. Available: http:// kociemba.org/math/imptwophase.htm. [5] S. McAleer, F. Agostinelli, A. Shmakov, and P. Baldi, "Solving the Rubik's Cube Without Human Knowledge," arXiv preprint arXiv:1805.07470, 2018. [6] A. Irpan, \Exploring boosted neural nets for rubiks cube solving," As of this writing, paper may be found at http://www. alexirpan. com/public/research/nips 2016. pdf, 2016. [7] A. Karpathy, "Deep Reinforcement Learning: Pong from Pixels." [Online]. Avail- able: http://karpathy.github.io/2016/05/31/rl/, 2016. [8] H. van Hasselt, "UCL Course { 2016: Introduction to reinforcement learning." Retrieved january, 2016, from University College London Web site: https:// hadovanhasselt.com/2016/01/12/ucl-course/. [9] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, ch. 1, pp. 5{6. MIT press Cambridge, 2 ed., 199. [10] S. Levine, "CS 294: Deep Reinforcement Learning, Fall 2017: Sep 6: Pol- icy gradients introduction." Retrieved August, 2017, from UC Berkeley Web site: http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_4_policy_ gradient.pdf. [11] H. van Hasselt, "UCL Course { 2016: Policy Gradient." Retrieved January, 2016, from University College LondonWeb site: https://hadovanhasselt.com/ 2016/01/12/ucl-course/. [12] "Monte Carlo Method." [Online]. Available: http://mathworld.wolfram.com/ MonteCarloMethod.html. [13] R. J. Williams, "Simple statistical gradient-following algorithms for connectionist reinforcement learning," in Reinforcement Learning, pp. 5{32, Springer, 1992. [14] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, "High-dimensional continuous control using generalized advantage estimation," arXiv preprint arXiv:1506.02438, 2015. [15] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, "Trust region policy optimization," in International Conference on Machine Learning, pp. 1889-1897, 2015. [16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv preprint arXiv:1707.06347, 2017. [17] D. Shah, "Activation Functions." [Online]. Available: https:// towardsdatascience.com/activation-functions-in-neural-networks-58115cda9c96, 2016. [18] "Inverse transform sampling." [Online]. Available: https://stephens999. github.io/fiveMinuteStats/inverse_transform_sampling.html. [19] "Normal distribution." [Online]. Available: http://mathworld.wolfram.com/ NormalDistribution.html. [20] S. Levine, "CS 294: Deep Reinforcement Learning, Fall 2017: Sep 11: Actor-critic introduction." Retrieved August, 2017, from UC Berkeley Web site: http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_5_actor_ critic_pdf.pdf. |