PPOC: Proximal Policy Optimization with Option Critic

Overview

PPOC is an advanced reinforcement learning framework that combines the strengths of Proximal Policy Optimization (PPO) and Option-Critic (OC) architectures. This implementation inspired by the paper "Learnings Options End-to-End for Continuous Action Tasks" by Klissarov et al. However, the implementation is from the paper "Accelerating Task Generalisation Using Multi-Level Hierarchical Options" by Cannon and Simsek.

Key Features

Proximal Policy Optimization: Utilizes PPO for stable and efficient policy updates.
Option-Critic Architecture: Integrates OC to enable dynamic option selection and termination.
Stability: Typically OC has option collapse issues. In this implementation PPO is applied to both levels of the hierarchy. It has been tested at 25 options and no option collapse was observed.

Installation

To get started with PPOC, clone the repository and install the required dependencies:

git clone https://github.com/x4nnon/PPOC.git

cd PPOC

pip install -r requirements.txt


## Usage

The main script for running PPOC is `OC_PPO.py`. The script can be executed with various arguments defined in the `Args` class. Below is an example of how to run the script:

```bash
python3 methods/OC_PPO.py --env_id="procgen:procgen-fruitbot-v0" --total_timesteps=20000000 --num_envs=32

Arguments

exp_name: The name of the experiment. This is used for logging and tracking purposes.
Default: os.path.basename(__file__)[: -len(".py")]
seed: An integer seed for random number generation to ensure reproducibility of results.
Default: 0
torch_deterministic: A boolean flag to enable deterministic behavior in PyTorch operations.
Default: False
cuda: A boolean flag to enable CUDA for GPU acceleration. Set to True to use GPU if available.
Default: True
track: A boolean flag to enable tracking of experiments using tools like Weights & Biases.
Default: True
wandb_project_name: The name of the Weights & Biases project for logging experiment data.
Default: "fracos_StarPilot_A_QuickTest"
wandb_entity: The Weights & Biases entity (user or team) under which the project is logged.
Default: "tpcannon"
env_id: The identifier for the environment to be used, e.g., "procgen-bigfish".
Default: "procgen-bigfish"
total_timesteps: The total number of timesteps to run the training for.
Default: 100000
learning_rate: The learning rate for the optimizer.
Default: 5e-4
num_envs: The number of parallel environments to run.
Default: 8
num_steps: The number of steps to run in each environment per update.
Default: 256
anneal_lr: A boolean flag to enable learning rate annealing over time.
Default: True
gamma: The discount factor for future rewards.
Default: 0.999
num_minibatches: The number of minibatches to split the data into for each update.
Default: 4
update_epochs: The number of epochs to update the policy and value networks.
Default: 2
report_epoch: The number of steps after which to report evaluation metrics.
Default: 81920
anneal_ent: A boolean flag to enable annealing of the entropy coefficient.
Default: True
ent_coef_action: The coefficient for the entropy term in the action policy loss.
Default: 0.01
ent_coef_option: The coefficient for the entropy term in the option policy loss.
Default: 0.01
clip_coef: The coefficient for clipping the policy gradient.
Default: 0.1
clip_vloss: A boolean flag to enable clipping of the value loss.
Default: False
vf_coef: The coefficient for the value function loss.
Default: 0.5
norm_adv: A boolean flag to normalize advantages. Always set to True.
Default: True
max_grad_norm: The maximum norm for gradient clipping.
Default: 0.1
batch_size: The size of the batch for updates. Calculated as num_envs * num_steps.
Default: 0 (calculated during runtime)
minibatch_size: The size of each minibatch. Calculated as batch_size // num_minibatches.
Default: 0 (calculated during runtime)
num_iterations: The number of iterations to run. Calculated as total_timesteps // batch_size.
Default: 0 (calculated during runtime)
max_ep_length: The maximum length of an episode.
Default: 990
debug: A boolean flag to enable debug mode.
Default: False
proc_start: The starting level for procedurally generated environments.
Default: 1
start_ood_level: The starting level for out-of-distribution evaluation.
Default: 420
proc_num_levels: The number of levels for procedurally generated environments.
Default: 32
proc_sequential: A boolean flag to enable sequential levels in procedurally generated environments.
Default: False
max_eval_ep_len: The maximum length of an evaluation episode.
Default: 1001
easy: A boolean flag to enable easy mode for environments.
Default: 1
eval_repeats: The number of times to repeat evaluations.
Default: 1
use_monochrome: A boolean flag to use monochrome assets in environments.
Default: 0
eval_interval: The interval at which to perform evaluations.
Default: 100000
eval_specific_envs: The number of specific environments to evaluate.
Default: 32
eval_batch_size: The batch size for evaluations.
Default: 32
gae_lambda: The lambda parameter for Generalized Advantage Estimation.
Default: 0.95
warmup: A boolean flag to enable warmup mode if this is off, you will need a trained model.
Default: 1
num_options: The number of options available to the agent.
Default: 25

Citing PPOC

If you use PPOC in your research, please cite the following papers:

PPOC Paper: [Learnings Options End-to-End for Continuous Action Tasks]
- Authors: [Klissarov et al.]
PPO Paper: [Proximal Policy Optimization Algorithms]
- Authors: [John Schulman et al.]
OC Paper: [The Option-Critic Architecture]
- Authors: [Pierre-Luc Bacon et al.]
Accelerating Task Generalisation Using Multi-Level Hierarchical Options:
- Authors: [Cannon and Simsek]

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contact

For questions or feedback, please contact x4nnon@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
OC_agents		OC_agents
methods		methods
start_scripts		start_scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__		__init__

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PPOC: Proximal Policy Optimization with Option Critic

Overview

Key Features

Installation

Arguments

Citing PPOC

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

x4nnon/PPOC

Folders and files

Latest commit

History

Repository files navigation

PPOC: Proximal Policy Optimization with Option Critic

Overview

Key Features

Installation

Arguments

Citing PPOC

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages