Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Research output: Contribution to journalArticleResearchpeer-review

Standard

Regret Bounds for Reinforcement Learning via Markov Chain Concentration. / Ortner, Ronald.
In: The journal of artificial intelligence research, Vol. 67.2020, No. 1, 23.01.2020, p. 115-128.

Research output: Contribution to journalArticleResearchpeer-review

Vancouver

Bibtex - Download

@article{f61ee437511b4018b0247bd02b96db23,
title = "Regret Bounds for Reinforcement Learning via Markov Chain Concentration",
abstract = "We give a simple optimistic algorithm for which it is easy to derive regret bounds of {\~O}(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter. ",
author = "Ronald Ortner",
note = "Publisher Copyright: {\textcopyright} 2020 AI Access Foundation. All rights reserved.",
year = "2020",
month = jan,
day = "23",
doi = "https://doi.org/10.1613/jair.1.11316",
language = "English",
volume = "67.2020",
pages = "115--128",
journal = " The journal of artificial intelligence research",
issn = "1076-9757",
publisher = "Morgan Kaufmann Publishers, Inc.",
number = "1",

}

RIS (suitable for import to EndNote) - Download

TY - JOUR

T1 - Regret Bounds for Reinforcement Learning via Markov Chain Concentration

AU - Ortner, Ronald

N1 - Publisher Copyright: © 2020 AI Access Foundation. All rights reserved.

PY - 2020/1/23

Y1 - 2020/1/23

N2 - We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

AB - We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

UR - http://www.scopus.com/inward/record.url?scp=85078924062&partnerID=8YFLogxK

U2 - https://doi.org/10.1613/jair.1.11316

DO - https://doi.org/10.1613/jair.1.11316

M3 - Article

VL - 67.2020

SP - 115

EP - 128

JO - The journal of artificial intelligence research

JF - The journal of artificial intelligence research

SN - 1076-9757

IS - 1

ER -