Fine-tuning deep RL with gradient-free optimization

Tim de Bruin; Jens Kober; Karl Tuyls; Robert Babuška

doi:10.1016/j.ifacol.2020.12.2240

Fine-tuning deep RL with gradient-free optimization

Tim de Bruin^*, Jens Kober, Karl Tuyls, Robert Babuška

^*Corresponding author for this work

Learning & Autonomous Control

Research output: Contribution to journal › Conference article › Scientific › peer-review

2 Citations (Scopus)

198 Downloads (Pure)

Abstract

Deep reinforcement learning makes it possible to train control policies that map high-dimensional observations to actions. These methods typically use gradient-based optimization techniques to enable relatively efficient learning, but are notoriously sensitive to hyperparameter choices and do not have good convergence properties. Gradient-free optimization methods, such as evolutionary strategies, can offer a more stable alternative but tend to be much less sample efficient. In this work we propose a combination, using the relative strengths of both. We start with a gradient-based initial training phase, which is used to quickly learn both a state representation and an initial policy. This phase is followed by a gradient-free optimization of only the final action selection parameters. This enables the policy to improve in a stable manner to a performance level not obtained by gradient-based optimization alone, using many fewer trials than methods using only gradient-free optimization. We demonstrate the effectiveness of the method on two Atari games, a continuous control benchmark and the CarRacing-v0 benchmark. On the latter we surpass the best previously reported score while using significantly fewer episodes.

Original language	English
Pages (from-to)	8049-8056
Journal	IFAC-PapersOnline
Volume	53
Issue number	2
DOIs	https://doi.org/10.1016/j.ifacol.2020.12.2240
Publication status	Published - 2020
Event	21st IFAC World Congress 2020 - Berlin, Germany Duration: 12 Jul 2020 → 17 Jul 2020

Keywords

Control
Deep learning
Neural networks
Optimization
Reinforcement learning

Access to Document

10.1016/j.ifacol.2020.12.2240

1-s2.0-S2405896320329001-mainFinal published version, 557 KBLicence: CC BY-NC-ND

Cite this

@article{b85a88663a3649b0a54fcc0c700e842a,

title = "Fine-tuning deep RL with gradient-free optimization",

abstract = "Deep reinforcement learning makes it possible to train control policies that map high-dimensional observations to actions. These methods typically use gradient-based optimization techniques to enable relatively efficient learning, but are notoriously sensitive to hyperparameter choices and do not have good convergence properties. Gradient-free optimization methods, such as evolutionary strategies, can offer a more stable alternative but tend to be much less sample efficient. In this work we propose a combination, using the relative strengths of both. We start with a gradient-based initial training phase, which is used to quickly learn both a state representation and an initial policy. This phase is followed by a gradient-free optimization of only the final action selection parameters. This enables the policy to improve in a stable manner to a performance level not obtained by gradient-based optimization alone, using many fewer trials than methods using only gradient-free optimization. We demonstrate the effectiveness of the method on two Atari games, a continuous control benchmark and the CarRacing-v0 benchmark. On the latter we surpass the best previously reported score while using significantly fewer episodes.",

keywords = "Control, Deep learning, Neural networks, Optimization, Reinforcement learning",

author = "{de Bruin}, Tim and Jens Kober and Karl Tuyls and Robert Babu{\v s}ka",

year = "2020",

doi = "10.1016/j.ifacol.2020.12.2240",

language = "English",

volume = "53",

pages = "8049--8056",

journal = "IFAC-PapersOnline",

issn = "1474-6670",

publisher = "Elsevier",

number = "2",

note = "21st IFAC World Congress 2020 ; Conference date: 12-07-2020 Through 17-07-2020",

}

TY - JOUR

T1 - Fine-tuning deep RL with gradient-free optimization

AU - de Bruin, Tim

AU - Kober, Jens

AU - Tuyls, Karl

AU - Babuška, Robert

PY - 2020

Y1 - 2020

N2 - Deep reinforcement learning makes it possible to train control policies that map high-dimensional observations to actions. These methods typically use gradient-based optimization techniques to enable relatively efficient learning, but are notoriously sensitive to hyperparameter choices and do not have good convergence properties. Gradient-free optimization methods, such as evolutionary strategies, can offer a more stable alternative but tend to be much less sample efficient. In this work we propose a combination, using the relative strengths of both. We start with a gradient-based initial training phase, which is used to quickly learn both a state representation and an initial policy. This phase is followed by a gradient-free optimization of only the final action selection parameters. This enables the policy to improve in a stable manner to a performance level not obtained by gradient-based optimization alone, using many fewer trials than methods using only gradient-free optimization. We demonstrate the effectiveness of the method on two Atari games, a continuous control benchmark and the CarRacing-v0 benchmark. On the latter we surpass the best previously reported score while using significantly fewer episodes.

AB - Deep reinforcement learning makes it possible to train control policies that map high-dimensional observations to actions. These methods typically use gradient-based optimization techniques to enable relatively efficient learning, but are notoriously sensitive to hyperparameter choices and do not have good convergence properties. Gradient-free optimization methods, such as evolutionary strategies, can offer a more stable alternative but tend to be much less sample efficient. In this work we propose a combination, using the relative strengths of both. We start with a gradient-based initial training phase, which is used to quickly learn both a state representation and an initial policy. This phase is followed by a gradient-free optimization of only the final action selection parameters. This enables the policy to improve in a stable manner to a performance level not obtained by gradient-based optimization alone, using many fewer trials than methods using only gradient-free optimization. We demonstrate the effectiveness of the method on two Atari games, a continuous control benchmark and the CarRacing-v0 benchmark. On the latter we surpass the best previously reported score while using significantly fewer episodes.

KW - Control

KW - Deep learning

KW - Neural networks

KW - Optimization

KW - Reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85098817994&partnerID=8YFLogxK

U2 - 10.1016/j.ifacol.2020.12.2240

DO - 10.1016/j.ifacol.2020.12.2240

M3 - Conference article

AN - SCOPUS:85098817994

SN - 1474-6670

VL - 53

SP - 8049

EP - 8056

JO - IFAC-PapersOnline

JF - IFAC-PapersOnline

IS - 2

T2 - 21st IFAC World Congress 2020

Y2 - 12 July 2020 through 17 July 2020

ER -

Fine-tuning deep RL with gradient-free optimization

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this