Safe Policy Improvement with an Estimated Baseline Policy

Thiago D. Simão; Romain Laroche; Rémi Tachet des Combes

Safe Policy Improvement with an Estimated Baseline Policy

Thiago D. Simão, Romain Laroche, Rémi Tachet des Combes

Algorithmics

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

24 Downloads (Pure)

Abstract

Previous work has shown the unreliability of existing algorithms in the batch Reinforcement Learning setting, and proposed the theoretically-grounded Safe Policy Improvement with Baseline Bootstrapping (SPIBB) fix: reproduce the baseline policy in the uncertain state-action pairs, in order to control the variance on the trained policy performance. However, in many real-world applications such as dialogue systems, pharmaceutical tests or crop management, data is collected under human supervision and the baseline remains unknown. In this paper, we apply SPIBB algorithms with a baseline estimate built from the data. We formally show safe policy improvement guarantees over the true baseline even without direct access to it. Our empirical experiments on finite and continuous states tasks support the theoretical findings. It shows little loss of performance in comparison with SPIBB when the baseline policy is given, and more importantly, drastically and significantly outperforms competing algorithms both in safe policy improvement, and in average performance.

Original language	English
Title of host publication	Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems
Place of Publication	Richland, SC
Pages	1269–1277
Number of pages	9
Publication status	Published - 2020
Event	AAMAS 2020: The 19th International Conference on Autonomous Agents and Multi-Agent Systems - Auckland, New Zealand Duration: 9 May 2020 → 13 May 2020 Conference number: 19th https://aamas2020.conference.auckland.ac.nz

Publication series

Name	AAMAS '20
Publisher	International Foundation for Autonomous Agents and Multiagent Systems

Conference

Conference	AAMAS 2020
Country/Territory	New Zealand
City	Auckland
Period	9/05/20 → 13/05/20
Other	Virtual/online event due to COVID-19
Internet address	https://aamas2020.conference.auckland.ac.nz

Bibliographical note

Virtual/online event due to COVID-19

Access to Document

3398761.3398908Final published version, 2.41 MB

http://10.5555/3398761.3398908

Cite this

@inproceedings{f4bbd331221e4635911ae2e32535109d,

title = "Safe Policy Improvement with an Estimated Baseline Policy",

abstract = "Previous work has shown the unreliability of existing algorithms in the batch Reinforcement Learning setting, and proposed the theoretically-grounded Safe Policy Improvement with Baseline Bootstrapping (SPIBB) fix: reproduce the baseline policy in the uncertain state-action pairs, in order to control the variance on the trained policy performance. However, in many real-world applications such as dialogue systems, pharmaceutical tests or crop management, data is collected under human supervision and the baseline remains unknown. In this paper, we apply SPIBB algorithms with a baseline estimate built from the data. We formally show safe policy improvement guarantees over the true baseline even without direct access to it. Our empirical experiments on finite and continuous states tasks support the theoretical findings. It shows little loss of performance in comparison with SPIBB when the baseline policy is given, and more importantly, drastically and significantly outperforms competing algorithms both in safe policy improvement, and in average performance.",

author = "Sim{\~a}o, {Thiago D.} and Romain Laroche and {Tachet des Combes}, R{\'e}mi",

note = "Virtual/online event due to COVID-19; AAMAS 2020 : The 19th International Conference on Autonomous Agents and Multi-Agent Systems ; Conference date: 09-05-2020 Through 13-05-2020",

year = "2020",

language = "English",

isbn = "9781450375184",

series = "AAMAS '20",

publisher = " International Foundation for Autonomous Agents and Multiagent Systems",

pages = "1269–1277",

booktitle = "Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems",

url = "https://aamas2020.conference.auckland.ac.nz",

}

Safe Policy Improvement with an Estimated Baseline Policy. / Simão, Thiago D.; Laroche, Romain; Tachet des Combes, Rémi.
Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems. Richland, SC, 2020. p. 1269–1277 (AAMAS '20).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Safe Policy Improvement with an Estimated Baseline Policy

AU - Simão, Thiago D.

AU - Laroche, Romain

AU - Tachet des Combes, Rémi

N1 - Conference code: 19th

PY - 2020

Y1 - 2020

N2 - Previous work has shown the unreliability of existing algorithms in the batch Reinforcement Learning setting, and proposed the theoretically-grounded Safe Policy Improvement with Baseline Bootstrapping (SPIBB) fix: reproduce the baseline policy in the uncertain state-action pairs, in order to control the variance on the trained policy performance. However, in many real-world applications such as dialogue systems, pharmaceutical tests or crop management, data is collected under human supervision and the baseline remains unknown. In this paper, we apply SPIBB algorithms with a baseline estimate built from the data. We formally show safe policy improvement guarantees over the true baseline even without direct access to it. Our empirical experiments on finite and continuous states tasks support the theoretical findings. It shows little loss of performance in comparison with SPIBB when the baseline policy is given, and more importantly, drastically and significantly outperforms competing algorithms both in safe policy improvement, and in average performance.

AB - Previous work has shown the unreliability of existing algorithms in the batch Reinforcement Learning setting, and proposed the theoretically-grounded Safe Policy Improvement with Baseline Bootstrapping (SPIBB) fix: reproduce the baseline policy in the uncertain state-action pairs, in order to control the variance on the trained policy performance. However, in many real-world applications such as dialogue systems, pharmaceutical tests or crop management, data is collected under human supervision and the baseline remains unknown. In this paper, we apply SPIBB algorithms with a baseline estimate built from the data. We formally show safe policy improvement guarantees over the true baseline even without direct access to it. Our empirical experiments on finite and continuous states tasks support the theoretical findings. It shows little loss of performance in comparison with SPIBB when the baseline policy is given, and more importantly, drastically and significantly outperforms competing algorithms both in safe policy improvement, and in average performance.

UR - http://www.scopus.com/inward/record.url?scp=85096684694&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781450375184

T3 - AAMAS '20

SP - 1269

EP - 1277

BT - Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems

CY - Richland, SC

T2 - AAMAS 2020

Y2 - 9 May 2020 through 13 May 2020

ER -

Safe Policy Improvement with an Estimated Baseline Policy

Abstract

Publication series

Conference

Bibliographical note

Access to Document

Other files and links

Fingerprint

Safe Online and Offline Reinforcement Learning

Safe and Sample-Efficient Reinforcement Learning Algorithms for Factored Environments

Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Cite this