Regret Bounds for Reinforcement Learning via Markov Chain Concentration
Research output: Contribution to journal › Article › Research › peer-review
Standard
In: The journal of artificial intelligence research, Vol. 67.2020, No. 1, 23.01.2020, p. 115-128.
Research output: Contribution to journal › Article › Research › peer-review
Harvard
APA
Vancouver
Author
Bibtex - Download
}
RIS (suitable for import to EndNote) - Download
TY - JOUR
T1 - Regret Bounds for Reinforcement Learning via Markov Chain Concentration
AU - Ortner, Ronald
N1 - Publisher Copyright: © 2020 AI Access Foundation. All rights reserved.
PY - 2020/1/23
Y1 - 2020/1/23
N2 - We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.
AB - We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.
UR - http://www.scopus.com/inward/record.url?scp=85078924062&partnerID=8YFLogxK
U2 - https://doi.org/10.1613/jair.1.11316
DO - https://doi.org/10.1613/jair.1.11316
M3 - Article
VL - 67.2020
SP - 115
EP - 128
JO - The journal of artificial intelligence research
JF - The journal of artificial intelligence research
SN - 1076-9757
IS - 1
ER -