Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Publikationen: Beitrag in FachzeitschriftArtikelForschung(peer-reviewed)

Autoren

Abstract

We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

Details

OriginalspracheEnglisch
Seiten (von - bis)115-128
Seitenumfang14
Fachzeitschrift The journal of artificial intelligence research
Jahrgang67.2020
Ausgabenummer1
DOIs
StatusVeröffentlicht - 23 Jan. 2020