Regret Bounds for Reinforcement Learning via Markov Chain Concentration
Research output: Contribution to journal › Article › Research › peer-review
Authors
Organisational units
Abstract
We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.
Details
Original language | English |
---|---|
Pages (from-to) | 115-128 |
Number of pages | 14 |
Journal | The journal of artificial intelligence research |
Volume | 67.2020 |
Issue number | 1 |
DOIs | |
Publication status | Published - 23 Jan 2020 |