Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Research output: Contribution to journalArticleResearchpeer-review

Authors

Abstract

We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

Details

Original languageEnglish
Pages (from-to)115-128
Number of pages14
Journal The journal of artificial intelligence research
Volume67.2020
Issue number1
DOIs
Publication statusPublished - 23 Jan 2020