-
Notifications
You must be signed in to change notification settings - Fork 25
User Tutorial 3. Actor critic on the rain car environment
- We strongly recommend doing Tutorial #2 before this one.
Using the same environment, we will set up an Actor-Critic agent and try to make it learn the control of the car. The main difference of Actor-Critic agents with respect to Q-Learning agents is that the output of the agent is continuous, whereas the output of Q-Learning and similar agents is a discretization of the continuous action space. We will review some of the following concepts:
- Noise functions
- Initializing the policy learned by an actor using a controller
- Using two sets of weights (Freeze-Target-Function) for improved stability in the learning process
Actor-Critic agents consist on two elements:
- The actor: learns a policy pi(s) based on the feedback of the critic. In this tutorial, we will use Cacla
- The critic: estimates the value of the current policy V(s) as a function of the state. Every time-step, the critic sends a feedback value to the actor assessing the quality of the last action selected. In this tutorial, we will use TD(lambda), the most popular value function learning algorithm
Run Badger and create a new experiment in the Editor tab. We will leave the default Log parameters and select Rain-car as the World.
We will use the same experiment parameters as in the previous tutorial:
Num-Episodes = 100
Eval-Freq = 10
Episode-Length = 60
We will start by ignore the following parameters: Target-Function-Update-Freq
, Gamma
, Freeze-Target-Function
and Use-Importance-Weights
. We will later revisit these parameters.
We will then set the parameters of the State-Feature-Map similarly to what we did in the previous tutorial, but, instead of using a Discrete-Grid
, we will use a Gaussian-RBF-Grid
with two input variables: position-deviation
and acceleration
, and Num-Features = 20
. This means that two functions learned (the policy pi(s) and the value function V(s) ), will each represent the function with 20*20 = 400
features. The more features, the more accurate the approximated function can be.
Since both pi(s) and V(s) are a function of the state and not the action, we may disable Action-Feature-Map
.
We will use Experience-Replay with the default parameters.
We will select Actor-Critic
as the type of agent and disable Base-Controller
. We will later revisit this decision and explain its goal.
Set Policy-Learner = CACLA
with Alpha = (Constant) 0.1
, Policy = Deterministic-Policy-Gaussian-Noise
and 'Output-Action = acceleration'. This will make the actor learn a deterministic policy using gain 0.1
. The initial value of the policy will be 0
by default, which will make the car move. For the agent to learn and improve the initial policy, a noise signal (Exploration-Noise) is added to the value of the policy. There are three different noise signals in SimionZoo and we suggest the use of an Ornstein-Uhlenbeck process. We will leave the default parameters and return after we finish setting the rest of the experiment parameters.
Select Critic = TD-Lambda
without eligibility traces and Alpha = (Constant) 0.001
. The critic's gain should be at least an order of magnitude lower than policy learning gain. In general, we will always set the initial value of the function to 0.0
.
Next, we will get back to the Exploration-Noise parameters and adjust them visually. To see the effect of the noise signal on the exploration of the agent, we will temporarily disable evaluation episodes (Episode/Eval-Freq = 0
). Now, back to the parameters in Exploration-Noise. Mu
is the average value of the output, so we will usually set it to zero so that the output of the noise signal moves around the zero value. 'Sigma' determines the volatility of the signal (how far it strays from the mean value) and 'Theta' the rate by which the noise tends to return to the mean value. Let's start by setting them to Sigma = 0.1
and Theta = 1.0
without scaling the output (Scale = (Constant) 1
).
Now, run the experiment locally to see the behavior of the agent.