The goal in behavior cloning is to extract meaningful information from expertdemonstrations and reproduce the same behavior autonomously. However, theavailable data is unlikely to exhaustively cover the potential problem space. As aresult, the quality of automated decision making is compromised without elegantways to handle the encountering of out-of-distribution states that might occur dueto unforeseen events in the environment. Our novel approach RECO uses only theoffline data available to recover a behavioral cloning agent from unknown states.Given expert trajectories, RECO learns both an imitation policy and recoverypolicy. Our contribution is a method for learning this recovery policy that steersthe agent back to the trajectories in the data set from unknown states. Whilethere is, per definition, no data available to learn the recovery policy, we exploitabstractions to generalize beyond the available data thus overcoming this problem.In a tabular domain, we show how our method results in drastically fewer calls to ahuman supervisor without compromising solution quality and with few trajectoriesprovided by an expert. We further introduce a continuous adaptation of RECO andevaluate its potential in an experiment.
|Number of pages||10|
|Publication status||Published - 2020|
|Event||Offline Reinforcement Learning Workshop at Neural Information Processing Systems, 2020 - |
Duration: 12 Dec 2020 → 12 Dec 2020
|Workshop||Offline Reinforcement Learning Workshop at Neural Information Processing Systems, 2020|
|Period||12/12/20 → 12/12/20|
|Other||Virtual/online event due to COVID-19|