Safety-constrained reinforcement learning with a distributional safety critic

Qisong Yang; Thiago D Simão; Simon H. Tindemans; Matthijs T.J. Spaan

doi:10.1007/s10994-022-06187-8

Safety-constrained reinforcement learning with a distributional safety critic

Qisong Yang^*, Thiago D Simão, Simon H. Tindemans, Matthijs T.J. Spaan

^*Corresponding author for this work

Research output: Contribution to journal › Article › Scientific › peer-review

7 Citations (Scopus)

102 Downloads (Pure)

Abstract

Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called Worst-Case Soft Actor Critic for safe RL that approximates the distribution of accumulated safety-costs to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can compute policies whose worst-case performance satisfies the constraints. We investigate two ways to estimate the safety-cost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safety-constrained environments, showing good risk control.

Original language	English
Pages (from-to)	859-887
Number of pages	29
Journal	Machine Learning
Volume	112
Issue number	3
DOIs	https://doi.org/10.1007/s10994-022-06187-8
Publication status	Published - 2022

Access to Document

10.1007/s10994-022-06187-8

Yang2022_Article_Safety-constrainedReinforcemenFinal published version, 2.62 MBLicence: CC BY

Cite this

@article{663135aa2fef43fc9e1754e69d144218,

title = "Safety-constrained reinforcement learning with a distributional safety critic",

abstract = "Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called Worst-Case Soft Actor Critic for safe RL that approximates the distribution of accumulated safety-costs to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can compute policies whose worst-case performance satisfies the constraints. We investigate two ways to estimate the safety-cost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safety-constrained environments, showing good risk control.",

author = "Qisong Yang and Sim{\~a}o, {Thiago D} and Tindemans, {Simon H.} and Spaan, {Matthijs T.J.}",

year = "2022",

doi = "10.1007/s10994-022-06187-8",

language = "English",

volume = "112",

pages = "859--887",

journal = "Machine Learning",

issn = "0885-6125",

publisher = "Springer",

number = "3",

}

TY - JOUR

T1 - Safety-constrained reinforcement learning with a distributional safety critic

AU - Yang, Qisong

AU - Simão, Thiago D

AU - Tindemans, Simon H.

AU - Spaan, Matthijs T.J.

PY - 2022

Y1 - 2022

N2 - Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called Worst-Case Soft Actor Critic for safe RL that approximates the distribution of accumulated safety-costs to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can compute policies whose worst-case performance satisfies the constraints. We investigate two ways to estimate the safety-cost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safety-constrained environments, showing good risk control.

AB - Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called Worst-Case Soft Actor Critic for safe RL that approximates the distribution of accumulated safety-costs to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can compute policies whose worst-case performance satisfies the constraints. We investigate two ways to estimate the safety-cost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safety-constrained environments, showing good risk control.

UR - http://www.scopus.com/inward/record.url?scp=85132574112&partnerID=8YFLogxK

U2 - 10.1007/s10994-022-06187-8

DO - 10.1007/s10994-022-06187-8

M3 - Article

SN - 0885-6125

VL - 112

SP - 859

EP - 887

JO - Machine Learning

JF - Machine Learning

IS - 3

ER -

Safety-constrained reinforcement learning with a distributional safety critic

Abstract

Access to Document

Other files and links

Fingerprint

Cite this