Correct Me If I'm Wrong: Using Non-Experts to Repair Reinforcement Learning Policies

Sanne Van Waveren, Christian Pek, Jana Tumova, Iolanda Leite

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

7 Citations (Scopus)

Abstract

Reinforcement learning has shown great potential for learning sequential decision-making tasks. Yet, it is difficult to anticipate all possible real-world scenarios during training, causing robots to inevitably fail in the long run. Many of these failures are due to variations in the robot's environment. Usually experts are called to correct the robot's behavior; however, some of these failures do not necessarily require an expert to solve them. In this work, we query non-experts online for help and explore 1) if/how non-experts can provide feedback to the robot after a failure and 2) how the robot can use this feedback to avoid such failures in the future by generating shields that restrict or correct its high-level actions. We demonstrate our approach on common daily scenarios of a simulated kitchen robot. The results indicate that non-experts can indeed understand and repair robot failures. Our generated shields accelerate learning and improve data-efficiency during retraining.

Original languageEnglish
Title of host publicationHRI 2022 - Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction
PublisherIEEE
Pages493-501
ISBN (Electronic)9781538685549
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event17th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2022 - Sapporo, Japan
Duration: 7 Mar 202210 Mar 2022

Conference

Conference17th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2022
Country/TerritoryJapan
CitySapporo
Period7/03/2210/03/22

Keywords

  • non-experts
  • policy repair
  • robot failure
  • shielded reinforcement learning

Fingerprint

Dive into the research topics of 'Correct Me If I'm Wrong: Using Non-Experts to Repair Reinforcement Learning Policies'. Together they form a unique fingerprint.

Cite this