Abstract
In this study, we investigate the effects of conditioning Independent Q-Learners (IQL) not solely on the individual action-observation history, but additionally on the sufficient plan-time statistic for Decentralized Partially Observable Markov Decision Processes. In doing so, we attempt to address a key shortcoming of IQL, namely that it is likely to converge to a Nash Equilibrium that can be arbitrarily poor. We identify a novel exploration strategy for IQL when it conditions on the sufficient statistic, and furthermore show that sub-optimal equilibria can be escaped consistently by sequencing the decision-making during learning. The practical limitation is the exponential complexity of both the sufficient statistic and the decision rules.
Original language | English |
---|---|
Title of host publication | BNAIC/BeneLearn 2020 |
Editors | Lu Cao, Walter Kosters, Jefrey Lijffijt |
Publisher | RU Leiden |
Pages | 423-424 |
Publication status | Published - 19 Nov 2020 |
Event | BNAIC/BENELEARN 2020 - Leiden, Netherlands Duration: 19 Nov 2020 → 20 Nov 2020 |
Conference
Conference | BNAIC/BENELEARN 2020 |
---|---|
Country/Territory | Netherlands |
City | Leiden |
Period | 19/11/20 → 20/11/20 |
Keywords
- Deep Reinforcement Learning
- Multi-Agent
- Partial Observability
- Decentralized Execution