TY - JOUR
T1 - Human-Feedback Shield Synthesis for Perceived Safety in Deep Reinforcement Learning
AU - Marta, Daniel
AU - Pek, Christian
AU - Melsion, Gaspar I.
AU - Tumova, Jana
AU - Leite, Iolanda
PY - 2022
Y1 - 2022
N2 - Despite the successes of deep reinforcement learning (RL), it is still challenging to obtain safe policies. Formal verification approaches ensure safety at all times, but usually overly restrict the agent's behaviors, since they assume adversarial behavior of the environment. Instead of assuming adversarial behavior, we suggest to focus on perceived safety instead, i.e., policies that avoid undesired behaviors while having a desired level of conservativeness. To obtain policies that are perceived as safe, we propose a shield synthesis framework with two distinct loops: (1) an inner loop that trains policies with a set of actions that is constrained by shields whose conservativeness is parameterized, and (2) an outer loop that presents example rollouts of the policy to humans and collects their feedback to update the parameters of the shields in the inner loop. We demonstrate our approach on a RL benchmark of Lunar landing and a scenario in which a mobile robot navigates around humans. For the latter, we conducted two user studies to obtain policies that were perceived as safe. Our results indicate that our framework converges to policies that are perceived as safe, is robust against noisy feedback, and can query feedback for multiple policies at the same time.
AB - Despite the successes of deep reinforcement learning (RL), it is still challenging to obtain safe policies. Formal verification approaches ensure safety at all times, but usually overly restrict the agent's behaviors, since they assume adversarial behavior of the environment. Instead of assuming adversarial behavior, we suggest to focus on perceived safety instead, i.e., policies that avoid undesired behaviors while having a desired level of conservativeness. To obtain policies that are perceived as safe, we propose a shield synthesis framework with two distinct loops: (1) an inner loop that trains policies with a set of actions that is constrained by shields whose conservativeness is parameterized, and (2) an outer loop that presents example rollouts of the policy to humans and collects their feedback to update the parameters of the shields in the inner loop. We demonstrate our approach on a RL benchmark of Lunar landing and a scenario in which a mobile robot navigates around humans. For the latter, we conducted two user studies to obtain policies that were perceived as safe. Our results indicate that our framework converges to policies that are perceived as safe, is robust against noisy feedback, and can query feedback for multiple policies at the same time.
KW - human factors and human-in-the-loop
KW - reinforcement learning
KW - Safety in HRI
UR - http://www.scopus.com/inward/record.url?scp=85121268554&partnerID=8YFLogxK
U2 - 10.1109/LRA.2021.3128237
DO - 10.1109/LRA.2021.3128237
M3 - Article
AN - SCOPUS:85121268554
SN - 2377-3766
VL - 7
SP - 406
EP - 413
JO - IEEE Robotics and Automation Letters
JF - IEEE Robotics and Automation Letters
IS - 1
ER -