Skip to content

Filippo29/Go-Explore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Go-Explore, exploration in Montezuma's Revenge

Filippo Ansalone, Reinforcement Learning 2023

Reimplementation of https://arxiv.org/abs/1901.10995
Environment: Montezuma's Revenge https://www.gymlibrary.dev/environments/atari/montezuma_revenge/
Task: first build trajectories by randomly exploring a deterministic environment and then learn imitating the best trajectory.

Installation

To install the needed packages:

pip install -r requirements.txt

Phase 1

To run the phase 1 of the algorithm just execute:

python main.py --phase1

By default, the algorithm samples cells and calculates scores as shown in the paper. If you want to sample cells with uniform probabilities, you can specify:

python main.py --phase1 --sameprob

This process will produce the best trajectory found for each checkpoint (points where a new reward is collected) as files named best_trajectory_rew_dist.npy.

Test phase 1

To test a trajectory produced by the phase 1:

python main.py --test1 --trajectory <string>

You can optionally decide whether to render or not:

python main.py --test1 --trajectory <string> --render

Phase 2

To run the phase 2:

python main.py --phase2 --trajectory <string>

Additionally, you can add the following arguments (optionally):

python main.py --phase2 --startpoint <int> --maxtimesteps <int> --patience <int>

trajectory: filename of the trajectory to imitate
startpoint: starting point from where the robustification, if not specified is length of the trajectory - 10
maxtimesteps: the maximum amount of timesteps for each robustification iteration
patience: if set to -1, the behavior remains unchanged. If set to a positive scalar, the maximum number of iterations without any improvement at a specific point in the trajectory; if this limit is reached, then the algorithm is repeated from the previous point of the trajectory.

The process will terminate once the policy is well performing from the initial state of the environment. As the paper states, the phase 2 is not guaranteed to converge both to a solution or an optimal solution for a given trajectory

Test phase 2

To test a policy produced by the phase 2:

python main.py --test2

If you renamed the produced policy:

python main.py --test2 --policy <string>

If you want to start the simulation from a certain point in a trajectory you need to specify:

python main.py --test2 --trajectory <string> --startpoint <int>

You can optionally add the same render flag as test1.

About

Reimplementation of Go-Explore paper in Montezuma's Revenge environment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages