|
[1] C. Colas, P. Fournier, M. Chetouani, O. Sigaud, and P.-Y. Oudeyer, “Curious: Intrinsically motivated modular multi-goal reinforcement learning,” in International Conference on Machine Learning, pp. 1331–1340, 2019. [2] A. Slivkins et al., “Introduction to multi-armed bandits,” Foundations and Trends® in Machine Learning, vol. 12, no. 1-2, pp. 1–286, 2019. [3] S. Forestier, R. Portelas, Y. Mollard, and P.-Y. Oudeyer, “Intrinsically moti- vated goal exploration processes with automatic curriculum learning,” arXiv preprint arXiv:1708.02190, 2017. [4] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017. [5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016. [6] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033, IEEE, 2012. [7] V. Kuleshov and D. Precup, “Algorithms for multi-armed bandit problems,” arXiv preprint arXiv:1402.6028, 2014. [8] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit problem,” in Conference on Learning Theory, pp. 39–1, JMLR Workshop and Conference Proceedings, 2012. [9] D. E. Koulouriotis and A. Xanthopoulos, “Reinforcement learning and evolutionary algo- rithms for non-stationary multi-armed bandit problems,” Applied Mathematics and Com- putation, vol. 196, no. 2, pp. 913–922, 2008. [10] A. Garivier and E. Moulines, “On upper-confidence bound policies for non-stationary bandit problems,” arXiv preprint arXiv:0805.3415, 2008. [11] D. Thierens, “An adaptive pursuit strategy for allocating operator probabilities,” in 7th Conference on Genetic and Evolutionary Computation, pp. 1539–1546, 2005. [12] C. Hartland, N. Baskiotis, S. Gelly, M. Sebag, and O. Teytaud, “Change point detec- tion and meta-bandits for online learning in dynamic environments,” in CAp 2007: 9è Conférence francophone sur l’apprentissage automatique, pp. 237–250, 2007. [13] D. V. Hinkley, “Inference about the change-point from cumulative sum tests,” Biometrika, vol. 58, no. 3, pp. 509–523, 1971. [14] J. Mellor and J. Shapiro, “Thompson sampling in switching environments with bayesian online change detection,” in Artificial Intelligence and Statistics, pp. 442–450, 2013. [15] R. Allesiardo and R. Féraud, “Exp3 with drift detection for the switching bandit problem,” in 2015 IEEE International Conference on Data Science and Advanced Analytics, pp. 1–7, IEEE, 2015. [16] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function approximators,” in International Conference on Machine Learning, pp. 1312–1320, PMLR, 2015. [17] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” Advances in Neural Information Processing Systems, vol. 30, 2017. [18] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015. |