This repository contains the code to train optimal policies from random trajectories using offline reinforcement learning.
For a brief overview see the offline-rl-summary document.
Install GNU make: https://www.gnu.org/software/make/.
Make sure that the default python interpreter is python >=3.10.
Setup env with
make setup
Further inspect CLI using
make help
The algorithm used in this repository is Implicit Q-Learning from the paper Offline Q-Learning via Implicit Q-Learning. Which jointly trains a value and Q function using the following objectives
where
where
The data is collected from the PointMaze environment, which contains an open arena with only perimeter walls. The agent uses a uniform random sampling or a PD controller (fetched from here) to follow a path of waypoints generated with QIteration until it reaches the goal. The task is continuing which means that when the agent reaches the goal the environment generates a new random goal without resetting the location of the agent. The reward function is sparse, only returning a value of 1 if the goal is reached, otherwise 0. To add variance to the collected paths random noise is added to the actions taken by the agent.
Env | 100k | 1M | PD Controller |
---|---|---|---|
Open | ![]() |
![]() |
![]() |
U | ![]() |
![]() |
![]() |
Medium | ![]() |
![]() |
![]() |
Large | ![]() |
![]() |
![]() |
The hyperparameters for the following results where obtained by running a hyperparameter sweep on the IQL algorithm using PointMaze_Open-v3 (episode=500) with the 100k random uniform dataset. The results are averaged over 10 episodes with consistent initialization.
Algorithm | Dataset Size | Dataset Sampling | Eval Reward |
---|---|---|---|
Uniform Random | - | - | 0.1 |
IQL | 100k | Uniform Random | 4.7 |
IQL | 1M | Uniform Random | 8.0 |
IQL | 10M | Uniform Random | 8.1 |
IQL | 1M | PD Controller | 8.6 |
We can see that the random uniform baseline achieves a cummulative reward of 0.1, while the IQL algorithm achieves an max reward of 8.1 with 10M, 8.0 with 1M and 4.7 with 100k step which where sampled using the uniform random baseline. Thus the algorithm learns a approximate optimal policy from the from (very) suboptimal trajectories. We can also see that the PD controller achieves a max reward of 8.6 with 1M steps, which is better than the uniform random sampling.
Uniform Random | IQL 100k | IQL 1M |
---|---|---|
![]() |
![]() |
![]() |
Now we reused the same hyperparameters but increased the difficulty of the environment by changing the maze to a Umaze, effectively introducing some non-boundary hard constraints.
Algorithm | Dataset Size | Dataset Sampling | Eval Reward |
---|---|---|---|
Uniform Random | - | - | 0.2 |
IQL | 100k | Uniform Random | 4.2 |
IQL | 1M | Uniform Random | 6.3 |
IQL | 1M | PD Controller | 8.0 |
Uniform Random | IQL 100k | IQL 1M |
---|---|---|
![]() |
![]() |
![]() |
Again we reused the same hyperparameters but increased the difficulty of the environment by changing the maze to a medium size.
Algorithm | Dataset Size | Dataset Sampling | Eval Reward |
---|---|---|---|
Uniform Random | - | - | 0.1 |
IQL | 100k | Uniform Random | 0.1 |
IQL | 1M | Uniform Random | 2.3 |
IQL | 10M | Uniform Random | 2.4 |
IQL | 1M | PD Controller | 2.7 |
We can see that the max reward dropped significantly to 2.4 with 10M steps, 2.3 with 1M and 0.1 with 100k steps. But still the agent is able to find a path to the goal.
When closely inspecting the renderings we can see that the agent sometimes struggles with the walls and gets stuck in a local minima. This is likely because the agent struggles with long term planning which is required due to the reward function.
Uniform Random | IQL 100k | IQL 1M |
---|---|---|
![]() |
![]() |
![]() |
To push the limits of the algorithm we increased the difficulty of the environment by changing the maze to a large size. We reused the same hyperparameters as before.
Algorithm | Dataset Size | Dataset Sampling | Eval Reward |
---|---|---|---|
Uniform Random | - | - | 0.0 |
IQL | 100k | Uniform Random | 0.2 |
IQL | 1M | Uniform Random | 0.4 |
IQL | 10M | Uniform Random | 0.4 |
IQL | 1M | PD Controller | 0.1 |
The result are rather bad with < 1 indicating that the agent is only able to find the path in 4/10 cases.
Now the issues with the agent are very obvious in all cases. The agent is not able to find a path to the in most cases goal and gets stuck in local minima.
Uniform Random | IQL 100k | IQL 1M |
---|---|---|
![]() |
![]() |
![]() |
As we could see in some occasions the agent gets stuck in local minima due to the discontinuities in the environment. In the figure bellow we can see the agent policy plotted over the state space given a constant goal. The agent can reach the target destination from almost every state in the maze except the upper right corner where it would have needed to go around the corner.
This is most likely due to the fact that the value function is approximated using neural networks which are are function approximators and therefore do perform interpolation. Thus the network interpolates the value from the other side of the wall, even though the space is discontinuous in that domain. This will lead to a high advantage for an suboptimal and invalid actions. Because the dataset is collected with uniform random sampling, there actually are such suboptimal actions in the dataset, leading the IQL policy objective to weight the value of the suboptimal action higher than the optimal action.
This issue might also be related to the problem of reward attribution, in simpler environments where the trajectories are generally shorter, obstacle avoidance does work. Here the dataset was also generated from random trajectories.