Generalized off-policy actor-critic

Shangtong Zhang; Wendelin Boehmer; Shimon Whiteson

Generalized off-policy actor-critic

Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson

Research output: Contribution to journal › Conference article › Scientific › peer-review

Abstract

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

Original language	English
Journal	Advances in Neural Information Processing Systems
Volume	32
Publication status	Published - 2019
Externally published	Yes
Event	33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019 - Vancouver, Canada Duration: 8 Dec 2019 → 14 Dec 2019

Cite this

@article{ccfcc3f5e1e8419b98f89ac9c2aea7cf,

title = "Generalized off-policy actor-critic",

abstract = "We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.",

author = "Shangtong Zhang and Wendelin Boehmer and Shimon Whiteson",

year = "2019",

language = "English",

volume = "32",

journal = "Advances in Neural Information Processing Systems",

issn = "1049-5258",

note = "33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019 ; Conference date: 08-12-2019 Through 14-12-2019",

}

TY - JOUR

T1 - Generalized off-policy actor-critic

AU - Zhang, Shangtong

AU - Boehmer, Wendelin

AU - Whiteson, Shimon

PY - 2019

Y1 - 2019

N2 - We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

AB - We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.

UR - http://www.scopus.com/inward/record.url?scp=85090173460&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85090173460

SN - 1049-5258

VL - 32

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

T2 - 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019

Y2 - 8 December 2019 through 14 December 2019

ER -

Generalized off-policy actor-critic

Abstract

Other files and links

Fingerprint

Cite this