Symbolic method for deriving policy in reinforcement learning

Eduard Alibekov, Jiřì Kubalìk, Robert Babuska

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

10 Citations (Scopus)
79 Downloads (Pure)

Abstract

This paper addresses the problem of deriving a policy from the value function in the context of reinforcement learning in continuous state and input spaces. We propose a novel method based on genetic programming to construct a symbolic function, which serves as a proxy to the value function and from which a continuous policy is derived. The symbolic proxy function is constructed such that it maximizes the number of correct choices of the control input for a set of selected states. Maximization methods can then be used to derive a control policy that performs better than the policy derived from the original approximate value function. The method was experimentally evaluated on two control problems with continuous spaces, pendulum swing-up and magnetic manipulation, and compared to a standard policy derivation method using the value function approximation. The results show that the proposed method and its variants outperform the standard method.
Original languageEnglish
Title of host publicationProceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC)
EditorsFrancesco Bullo, Christophe Prieur, Alessandro Giua
Place of PublicationPiscataway, NJ, USA
PublisherIEEE
Pages2789-2795
ISBN (Print)978-1-5090-1837-6
DOIs
Publication statusPublished - 2016
Event55th IEEE Conference on Decision and Control, CDC 2016 - Las Vegas, United States
Duration: 12 Dec 201614 Dec 2016

Conference

Conference55th IEEE Conference on Decision and Control, CDC 2016
Abbreviated titleCDC 2016
CountryUnited States
CityLas Vegas
Period12/12/1614/12/16

Keywords

  • Genetic programming
  • Sociology
  • Statistics
  • Learning (artificial intelligence)
  • Standards
  • Cybernetics
  • Trajectory

Fingerprint

Dive into the research topics of 'Symbolic method for deriving policy in reinforcement learning'. Together they form a unique fingerprint.

Cite this