Offline RL from random trajectories

This repository contains the code to train optimal policies from random trajectories using offline reinforcement learning.

For a brief overview see the offline-rl-summary document.

Setup

Install GNU make: https://www.gnu.org/software/make/.

Make sure that the default python interpreter is python >=3.10.

Setup env with

make setup

Further inspect CLI using

make help

Algorithm

The algorithm used in this repository is Implicit Q-Learning from the paper Offline Q-Learning via Implicit Q-Learning. Which jointly trains a value and Q function using the following objectives

$$L_V(\psi) = \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ L_\tau^2(Q_{\hat{\theta}}(s, a) - V_\psi(s)) \right] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \\\ L_Q(\theta) = \mathbb{E}_{(s,a,s') \sim \mathcal{D}} \left[ \left( r(s, a) + \gamma V_\psi(s') - Q_\theta(s, a) \right)^2 \right]$$

where $V_\psi(s)$ estimates the expectile ($\tau$) of the state value function, an estimate of the maximum Q-value over actions that are in the support of the dataset distribution. At the same time, IQL trains a policy using a advantage-weighted behavior cloning objective

$$L_\pi(\phi) = \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ e^{\beta (Q_{\hat{\theta}}(s, a) - V_\psi(s))} \log \pi_\phi(a \mid s) \right]$$

where $\beta=0$ recovers the pure behaviour cloning policy. The method is implemented using TorchRL and Minari.

Datasets

The data is collected from the PointMaze environment, which contains an open arena with only perimeter walls. The agent uses a uniform random sampling or a PD controller (fetched from here) to follow a path of waypoints generated with QIteration until it reaches the goal. The task is continuing which means that when the agent reaches the goal the environment generates a new random goal without resetting the location of the agent. The reward function is sparse, only returning a value of 1 if the goal is reached, otherwise 0. To add variance to the collected paths random noise is added to the actions taken by the agent.

Renderings

Env	100k	1M	PD Controller
Open
U
Medium
Large

Results

The hyperparameters for the following results where obtained by running a hyperparameter sweep on the IQL algorithm using PointMaze_Open-v3 (episode=500) with the 100k random uniform dataset. The results are averaged over 10 episodes with consistent initialization.

PointMaze_Open-v3 (episode=500)

Algorithm	Dataset Size	Dataset Sampling	Eval Reward
Uniform Random	-	-	0.1
IQL	100k	Uniform Random	4.7
IQL	1M	Uniform Random	8.0
IQL	10M	Uniform Random	8.1
IQL	1M	PD Controller	8.6

We can see that the random uniform baseline achieves a cummulative reward of 0.1, while the IQL algorithm achieves an max reward of 8.1 with 10M, 8.0 with 1M and 4.7 with 100k step which where sampled using the uniform random baseline. Thus the algorithm learns a approximate optimal policy from the from (very) suboptimal trajectories. We can also see that the PD controller achieves a max reward of 8.6 with 1M steps, which is better than the uniform random sampling.

Renderings

Uniform Random	IQL 100k	IQL 1M

PointMaze_UMaze-v3 (episode=500)

Now we reused the same hyperparameters but increased the difficulty of the environment by changing the maze to a Umaze, effectively introducing some non-boundary hard constraints.

Algorithm	Dataset Size	Dataset Sampling	Eval Reward
Uniform Random	-	-	0.2
IQL	100k	Uniform Random	4.2
IQL	1M	Uniform Random	6.3
IQL	1M	PD Controller	8.0

Renderings

Uniform Random	IQL 100k	IQL 1M

PointMaze_Medium-v3 (episode=500)

Again we reused the same hyperparameters but increased the difficulty of the environment by changing the maze to a medium size.

Algorithm	Dataset Size	Dataset Sampling	Eval Reward
Uniform Random	-	-	0.1
IQL	100k	Uniform Random	0.1
IQL	1M	Uniform Random	2.3
IQL	10M	Uniform Random	2.4
IQL	1M	PD Controller	2.7

We can see that the max reward dropped significantly to 2.4 with 10M steps, 2.3 with 1M and 0.1 with 100k steps. But still the agent is able to find a path to the goal.

Renderings

When closely inspecting the renderings we can see that the agent sometimes struggles with the walls and gets stuck in a local minima. This is likely because the agent struggles with long term planning which is required due to the reward function.

Uniform Random	IQL 100k	IQL 1M

PointMaze_Large-v3 (episode=500)

To push the limits of the algorithm we increased the difficulty of the environment by changing the maze to a large size. We reused the same hyperparameters as before.

Algorithm	Dataset Size	Dataset Sampling	Eval Reward
Uniform Random	-	-	0.0
IQL	100k	Uniform Random	0.2
IQL	1M	Uniform Random	0.4
IQL	10M	Uniform Random	0.4
IQL	1M	PD Controller	0.1

The result are rather bad with < 1 indicating that the agent is only able to find the path in 4/10 cases.

Renderings

Now the issues with the agent are very obvious in all cases. The agent is not able to find a path to the in most cases goal and gets stuck in local minima.

Uniform Random	IQL 100k	IQL 1M

Issues with discontinuities

As we could see in some occasions the agent gets stuck in local minima due to the discontinuities in the environment. In the figure bellow we can see the agent policy plotted over the state space given a constant goal. The agent can reach the target destination from almost every state in the maze except the upper right corner where it would have needed to go around the corner.

This is most likely due to the fact that the value function is approximated using neural networks which are are function approximators and therefore do perform interpolation. Thus the network interpolates the value from the other side of the wall, even though the space is discontinuous in that domain. This will lead to a high advantage for an suboptimal and invalid actions. Because the dataset is collected with uniform random sampling, there actually are such suboptimal actions in the dataset, leading the IQL policy objective to weight the value of the suboptimal action higher than the optimal action.

This issue might also be related to the problem of reward attribution, in simpler environments where the trajectories are generally shorter, obstacle avoidance does work. Here the dataset was also generated from random trajectories.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.vscode		.vscode
assets		assets
configs		configs
scripts		scripts
src/offline_rl		src/offline_rl
test/env		test/env
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Offline RL from random trajectories

Setup

Algorithm

Datasets

Renderings

Results

PointMaze_Open-v3 (episode=500)

Renderings

PointMaze_UMaze-v3 (episode=500)

Renderings

PointMaze_Medium-v3 (episode=500)

Renderings

PointMaze_Large-v3 (episode=500)

Renderings

Issues with discontinuities

About

Uh oh!

Releases

Packages

Uh oh!

Languages

andrinbuerli/offline-rl

Folders and files

Latest commit

History

Repository files navigation

Offline RL from random trajectories

Setup

Algorithm

Datasets

Renderings

Results

PointMaze_Open-v3 (episode=500)

Renderings

PointMaze_UMaze-v3 (episode=500)

Renderings

PointMaze_Medium-v3 (episode=500)

Renderings

PointMaze_Large-v3 (episode=500)

Renderings

Issues with discontinuities

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages