Skip to content

User Tutorial 3. Actor critic on the rain car environment

BorjaFG edited this page Mar 12, 2019 · 5 revisions

Prerequisites

  • We strongly recommend doing Tutorial #2 before this one.

Objectives

Using the same environment, we will set up an Actor-Critic agent and try to make it learn the control of the car. The main difference of Actor-Critic agents with respect to Q-Learning agents is that the output of the agent is continuous, whereas the output of Q-Learning and similar agents is a discretization of the continuous action space. We will review some of the following concepts:

  • Noise functions
  • Time-dependent parameters (schedules)

Tutorial

Actor-Critic agents consist on two elements:

  • The actor: learns a policy pi(s) based on the feedback of the critic. In this tutorial, we will use Cacla
  • The critic: estimates the value of the current policy V(s) as a function of the state. Every time-step, the critic sends a feedback value to the actor assessing the quality of the last action selected. In this tutorial, we will use TD(lambda), the most popular value function learning algorithm

Design the experiment

Run Badger and create a new experiment in the Editor tab. We will leave the default Log parameters and select Rain-car as the World.

Experiment

We will configure the experiment with these parameters:

  • Num-Episodes = 500 (we can expect continuous-action methods to need more time to learn than discrete ones)
  • Eval-Freq = 10
  • Episode-Length = 60

SimGod

We will ignore the following parameters: Target-Function-Update-Freq, Freeze-Target-Function and Use-Importance-Weights, and leave the default value for Gamma = 0.9.

We will then set the parameters of the State-Feature-Map similarly to what we did in the previous tutorial, but, instead of using a Discrete-Grid, we will use a Gaussian-RBF-Grid with two input variables: position-deviation and acceleration, and Num-Features = 20. This means that two functions learned (the policy pi(s) and the value function V(s) ), will each represent the function with 20*20 = 400 features. The more features, the more accurate the approximated function can be.

Since both pi(s) and V(s) are a function of the state and not the action, we may disable Action-Feature-Map.

We will use Experience-Replay with the default parameters.

Simions

We will select Actor-Critic as the type of agent and disable Base-Controller.

Actor

Set Policy-Learner = CACLA with Alpha = (Constant) 0.001, Policy = Deterministic-Policy-Gaussian-Noise and 'Output-Action = acceleration'. This will make the actor learn a deterministic policy using a constant learning gain of 0.001. The initial value of the policy will be 0 by default, which means that, initially, the acceleration value output by the agent will be null, so the car will not move. For the agent to learn and improve the initial policy, a noise signal (Exploration-Noise) is added to the value of the policy. There are three different noise signals in SimionZoo and we recommend the use of an Ornstein-Uhlenbeck process because it produces correlated samples. We will leave the default parameters but will come to these settings after we finish setting the rest of the experiment parameters.

Critic

Select Critic = TD-Lambda without eligibility traces and Alpha = (Constant) 0.001. The critic's gain should be at least an order of magnitude higher than the policy learner's gain. The rationale behind this is that, since the actor shapes its policy according to the critic's critique, the critic must learn the value of the policy faster than the actor updates it. We will leave the initial value of the function as the default value 0.0.

Adjust the exploration noise-signal

Next, we will get back to the Exploration-Noise parameters and adjust them visually. To see the effect of the noise signal on the exploration of the agent, we will temporarily disable evaluation episodes (Episode/Eval-Freq = 0). Now, back to the parameters in Exploration-Noise:

  • Mu is the average value of the output, so we will usually set it to zero so that the output of the noise signal moves around the zero value
  • Sigma determines the volatility of the signal (how far it strays from the mean value)
  • Theta the rate by which the noise tends to return to the mean value. Let's start by setting them to Sigma = 0.1 and Theta = 1.0
  • Scale allows us to scale the output of the noise by a constant or a time-dependent function. We will start without scaling the output: Scale = (Constant) 1

There is no absolute rule as to how to adjust the noise, but we should use enough exploration to move the car forward and back covering most of the state-action space so that the agent can learn the task. If we run the experiment locally we can see that the car moves, so we will accept the noise settings.

Using schedules for time-dependent values

Usually, we want a higher degree of exploration at the beginning of an experiment when agents have no knowledge on the domain. As the agent gains knowledge on the task, we usually the degree of exploration to decay. To do this, we will use the Scale parameter. Instead of using a constant value, we will set Scale = Simple-Linear-Decay with Initial-Value = 1.0 and End-Value = 0.0. This will multiply the noise signal by the initial value (1.0) at the beginning of the experiment. The multiplying factor will decay linearly until it reaches the final value (0.0) when the experiment ends. The schedule is applied during the training episodes, but no exploration will be done in evaluation episodes.

Fork parameters to increase chances of learning

In this experiment, we will fork three parameters:

  • Actor/Alpha = (Constant) 0.001, 0.0005, 0.0001
  • Actor/Scale(Schedule = Simple-Linear-Decay)/Initial-Value = 1, 0.1, 0.01
  • Critic/Alpha = (Constant) 0.1, 0.05, 0.01

Run the experiment

After clicking Launch, the 27 experimental units (3 forks with 3 values each = 3*3*3 = 27) will be executed.

Analyse the results

Once all the experimental units have finished, we can switch to the Reports tab. Select these variables:

  • position-deviation (Last Evaluation)
  • acceleration (Last Evaluation)
  • reward (Last Evaluation / Evaluation Averages)

Now, we will enable Limit Tracks to see only those which performed best. In its parameter section, we will select the three tracks (experimental units) with the highest average reward in the evaluation episodes:

  • Maximum Number = 3
  • Track selection = Max / reward / Last evaluation

This query will create a report with these images:

Evaluation average rewards Last episode - position deviation Last episode - acceleration Last episode - reward

The track in green is clearly the only one which learned how to reach the goal position and stay within the area where a positive reward is obtained.

Clone this wiki locally