Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Ronald Ortner

doi:10.1613/jair.1.11316

Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Research output: Contribution to journal › Article › Research › peer-review

Standard

Regret Bounds for Reinforcement Learning via Markov Chain Concentration. / Ortner, Ronald.
In: The journal of artificial intelligence research, Vol. 67.2020, No. 1, 23.01.2020, p. 115-128.

Research output: Contribution to journal › Article › Research › peer-review

Bibtex - Download

@article{f61ee437511b4018b0247bd02b96db23,

title = "Regret Bounds for Reinforcement Learning via Markov Chain Concentration",

abstract = "We give a simple optimistic algorithm for which it is easy to derive regret bounds of {\~O}(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter. ",

author = "Ronald Ortner",

year = "2020",

month = jan,

day = "23",

doi = "10.1613/jair.1.11316",

language = "English",

volume = "67.2020",

pages = "115--128",

journal = " The journal of artificial intelligence research",

issn = "1076-9757",

publisher = "Morgan Kaufmann Publishers, Inc.",

number = "1",

}

RIS (suitable for import to EndNote) - Download

TY - JOUR

T1 - Regret Bounds for Reinforcement Learning via Markov Chain Concentration

AU - Ortner, Ronald

PY - 2020/1/23

Y1 - 2020/1/23

N2 - We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

AB - We give a simple optimistic algorithm for which it is easy to derive regret bounds of Õ(√t mixSAT) after T steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.

UR - http://www.scopus.com/inward/record.url?scp=85078924062&partnerID=8YFLogxK

U2 - 10.1613/jair.1.11316

DO - 10.1613/jair.1.11316

M3 - Article

VL - 67.2020

SP - 115

EP - 128

JO - The journal of artificial intelligence research

JF - The journal of artificial intelligence research

SN - 1076-9757

IS - 1

ER -

Research Portal

Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Standard

Harvard

APA

Vancouver

Author

Bibtex - Download

RIS (suitable for import to EndNote) - Download

SFX

Research Portal

Regret Bounds for Reinforcement Learning via Markov Chain Concentration

Standard

Harvard

APA

Vancouver

Author

Bibtex - Download

RIS (suitable for import to EndNote) - Download

SFX

By the same authors

Adaptive Algorithms for Meta-Induction