Regret Bounds for Reinforcement Learning via Markov Chain Concentration
Publikationen: Beitrag in Fachzeitschrift › Artikel › Forschung › (peer-reviewed)
Autoren
Organisationseinheiten
Abstract
We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.
Details
Originalsprache | Englisch |
---|---|
Seiten (von - bis) | 115-128 |
Seitenumfang | 14 |
Fachzeitschrift | The journal of artificial intelligence research |
Jahrgang | 67.2020 |
Ausgabenummer | 1 |
DOIs | |
Status | Veröffentlicht - 23 Jan. 2020 |