Reinforcement learning in continuous state and action spaces

IL Busoniu

Research output: ThesisDissertation (TU Delft)

10 Downloads (Pure)

Abstract

The framework of dynamic programming (DP) and reinforcement learning (RL) can be used to express important problems arising in a variety of fields, including e.g., automatic control, operations research, economy, and computer science. From the perspective of automatic control, DP/RL expresses an optimal control problem for general, nonlinear, and stochastic systems. Moreover, RL algorithms solve the problem without requiring prior knowledge about the process, and online RL algorithms do not even require data in advance; they start learning a solution to the problem at the moment when they are placed in the control loop. In the DP/RL framework, a controller measures at each discrete time step the state of a process, and applies an action according to a control policy. As a result of this action, the process transits into a new state, and a scalar reward signal is sent to the controller to indicate the quality of this transition. The controller measures the new state, and the whole cycle repeats. The goal is to find an optimal policy, i.e., a policy that maximizes the cumulative reward over the course of interaction (the return). DP algorithms search for an optimal policy using a model of the process dynamics and the reward function. RL algorithms do not require a model, but use data obtained from the process. Many DP and RL algorithms use value functions, which give the returns from every state (V-functions) or from every state-action pair (Q-functions). This thesis develops effective DP and RL techniques for control. Classical DP/RL algorithms represent value functions and policies exactly. However, the majority of the control problems have continuous state and action variables, in which case value functions and policies cannot be represented exactly, but have to be approximated. Three categories of algorithms for approximate DP/RL can be identified, according to the path they take to search for an (approximately) optimal policy: approximate value iteration, approximate policy iteration, and approximate policy search. Algorithms for approximate value iteration search for an approximation of the optimal value function, and use it to compute an approximately optimal policy. Algorithms for approximate policy iteration iteratively improve policies. In each iteration, an approximate value function of the current policy is found, which is then used to computed a new, improved policy. Algorithms for approximate policy search parameterize the policy and optimize its parameters directly, without using a value function…
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Delft University of Technology
Supervisors/Advisors
  • Babuska, R., Supervisor
  • De Schutter, B.H.K., Advisor
Award date13 Jan 2009
Print ISBNs978-90-9023754-1
Publication statusPublished - 2009

Fingerprint

Dive into the research topics of 'Reinforcement learning in continuous state and action spaces'. Together they form a unique fingerprint.

Cite this