Skip to content

User Tutorial 3. Actor critic on the rain car environment

BorjaFG edited this page Mar 11, 2019 · 5 revisions

Prerequisites

  • We strongly recommend doing Tutorial #2 before this one.

Objectives

Using the same environment, we will set up an Actor-Critic agent and try to make it learn the control of the car. The main difference of Actor-Critic agents with respect to Q-Learning agents is that the output of the agent is continuous, whereas the output of Q-Learning and similar agents is a discretization of the continuous action space. We will review some of the following concepts:

  • Noise functions
  • Initializing the policy learned by an actor using a controller
  • Using two sets of weights (Freeze-Target-Function) for improved stability in the learning process

Tutorial

Actor-Critic agents consist on two elements:

  • The actor: learns a policy pi(s) based on the feedback of the critic. In this tutorial, we will use Cacla
  • The critic: estimates the value of the current policy V(s) as a function of the state. Every time-step, the critic sends a feedback value to the actor assessing the quality of the last action selected. In this tutorial, we will use TD(lambda), the most popular value function learning algorithm

Design the experiment

Run Badger and create a new experiment in the Editor tab. We will leave the default Log parameters and select Rain-car as the World.

Experiment

We will use the same experiment parameters as in the previous tutorial:

  • Num-Episodes = 100
  • Eval-Freq = 10
  • Episode-Length = 60

SimGod

We will start by ignore the following parameters: Target-Function-Update-Freq, Gamma, Freeze-Target-Function and Use-Importance-Weights. We will later revisit these parameters.

We will then set the parameters of the State-Feature-Map similarly to what we did in the previous tutorial, but, instead of using a Discrete-Grid, we will use a Gaussian-RBF-Grid with two input variables: position-deviation and acceleration, and Num-Features = 20. This means that two functions learned (the policy pi(s) and the value function V(s) ), will each represent the function with 20*20 = 400 features. The more features, the more accurate the approximated function can be.

Since both pi(s) and V(s) are a function of the state and not the action, we may disable Action-Feature-Map.

We will use Experience-Replay with the default parameters.

Simions

We will select Actor-Critic as the type of agent and disable Base-Controller. We will later revisit this decision and explain its goal.

Actor

Set Policy-Learner = CACLA with Alpha = (Constant) 0.1, Policy = Deterministic-Policy-Gaussian-Noise and 'Output-Action = acceleration'. This will make the actor learn a deterministic policy using gain 0.1. The initial value of the policy will be 0 by default, which will make the car move. For the agent to learn and improve the initial policy, a noise signal (Exploration-Noise) is added to the value of the policy. There are three different noise signals in SimionZoo and we suggest the use of an Ornstein-Uhlenbeck process. We will leave the default parameters and return after we finish setting the rest of the experiment parameters.

Critic

Select Critic = TD-Lambda without eligibility traces and Alpha = (Constant) 0.001. The critic's gain should be at least an order of magnitude lower than policy learning gain. In general, we will always set the initial value of the function to 0.0.

Adjust the exploration noise-signal

Next, we will get back to the Exploration-Noise parameters and adjust them visually. To see the effect of the noise signal on the exploration of the agent, we will temporarily disable evaluation episodes (Episode/Eval-Freq = 0). Now, back to the parameters in Exploration-Noise. Mu is the average value of the output, so we will usually set it to zero so that the output of the noise signal moves around the zero value. 'Sigma' determines the volatility of the signal (how far it strays from the mean value) and 'Theta' the rate by which the noise tends to return to the mean value. Let's start by setting them to Sigma = 0.1 and Theta = 1.0 without scaling the output (Scale = (Constant) 1).

Now, run the experiment locally to see the behavior of the agent.

Clone this wiki locally