BenchNetRL/
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ env_utils.py # Environment wrappers and creators
βββ exp_utils.py # Experiment argument parsing and logging utilities
βββ gae.py # Generalized Advantage Estimation implementation
βββ layers.py # Neural network layer utilities and transformer modules
βββ ppo.py # Vanilla PPO implementation
βββ ppo_lstm.py # PPO with LSTM / GRU recurrent policies
βββ ppo_mamba.py # PPO with Mamba / Mamba-2 recurrent SSM
βββ ppo_trxl.py # PPO with Transformer-XL (TrXL) / GTrXL memory
β
βββ envs/ # Custom environment implementations for quick memory tests
β βββ poc_memory_env.py # Proof-of-concept memory environment (PocMemoryEnv)
β βββ pom_env.py # Proof-of-memory Gym environment (PoMEnv)
β
βββ scripts/ # Baseline experiment scripts
βββ baselines/
βββ atari.sh # Atari benchmark commands
βββ classic_control.sh # Classic control benchmark commands
βββ minigrid.sh # MiniGrid benchmark commands
βββ mujoco.sh # MuJoCo benchmark commands
Clone the repository:
git clone https://github.com/SafeRL-Lab/BenchNetRL.git
cd BenchNetRL
Create a Python environment (recommended using conda or virtualenv):
python -m venv venv
source venv/bin/activate # on Linux/Mac
venv\Scripts\activate # on Windows
Install dependencies:
pip install -r requirements.txt
Before installing CUDA-enabled PyTorch, make sure you have NVIDIAβs CUDA toolkit installed and your drivers up to date. You can download and install CUDA 12.4 from NVIDIA:
-
Visit the CUDA Toolkit Archive: https://developer.nvidia.com/cuda-toolkit-archive
-
Select CUDA Toolkit 12.4 for your operating system and follow the installation guide.
The Mamba and Mamba2 recurrent state-space models are required for ppo_mamba.py and ppo_mamba2.py. These modules are not included in this repository and must be installed separately. Ensure you are on a Linux system with a compatible CUDA version.
git clone https://github.com/state-spaces/mamba.git
Note: Mamba requires Linux and specific CUDA drivers. Please refer to the Mamba repository for installation details and supported CUDA versions.
If you encounter an AttributeError related to torch.cuda.reset_peak_memory_stats, it means you have a CPU-only or incompatible PyTorch build. To resolve:
Uninstall any existing torch packages
pip uninstall -y torch torchvision torchaudio
Reinstall CUDA-enabled PyTorch (matching your CUDA toolkit, e.g. 12.3):
pip install --index-url https://download.pytorch.org/whl/cu123 \
torch torchvision torchaudio
Verify CUDA is available:
python - <<EOF
import torch
print("Torch version:", torch.version)
print("CUDA available:", torch.cuda.is_available())
EOF
Optional guard in ppo.py
: in case some setups still miss the function, open ppo.py
and replace:
torch.cuda.reset_peak_memory_stats()
with:
if torch.cuda.is_available() and hasattr(torch.cuda, "reset_peak_memory_stats"):
torch.cuda.reset_peak_memory_stats()
Use the provided scripts under scripts/ours/
to launch our experiments. For example:
bash scripts/ours/atari.sh
Example command for PPO + Mamba on Breakout:
python ppo_mamba.py \
--gym-id ALE/Breakout-v5 \
--total-timesteps 10000000 \
--num-envs 16 \
--num-minibatches 8 \
--hidden-dim 450 \
--expand 1 \
--track \
--wandb-project-name atari-bench \
--exp-name ppo_mamba
Replace the script name (ppo.py
, ppo_lstm.py
, ppo_mamba.py
, ppo_trxl.py
) and flags as needed.
-
env_utils.py: Wraps Gym environments with preprocessing such as frame stacking, masking, video recording.
-
exp_utils.py: Command-line argument parsing and logging setup.
-
gae.py: Advantage and return computation (GAE).
-
layers.py: layer_init, attention modules, Transformer, SSM interfaces.
-
ppo.py: Various PPO implementations (vanilla, LSTM/GRU, Mamba, Mamba2, Transformer-XL).
-
envs/: Custom memory-focused Gym environments.
-
scripts/ours/: Shell scripts for reproducible benchmarks.
- PPO-1: Standard PPO with 1-frame observation (no frame stacking).
- PPO-4: PPO with 4-frame observation stacking (temporal context via stacked frames).
- LSTM, GRU, TrXL, GTrXL, Mamba, Mamba-2: Sequence-based models with varying architectures to capture temporal dependencies in environment dynamics.
Metric | PPO-1 | PPO-4 | LSTM | GRU | TrXL | GTrXL | Mamba | Mamba-2 |
---|---|---|---|---|---|---|---|---|
Steps Per Second (β) | 3539 | 3305 | 604 | 701 | 1856 | 1890 | 2734 | 2455 |
Training Time (min) (β) | 16.59 | 18.84 | 121.90 | 91.04 | 30.33 | 29.42 | 21.20 | 22.97 |
Inference Latency (ms) (β) | 0.856 | 0.899 | 1.006 | 0.971 | 2.171 | 2.147 | 1.304 | 1.489 |
GPU Mem. Allocated (GB) (β) | 0.035 | 0.660 | 0.194 | 0.194 | 1.765 | 1.330 | 0.217 | 0.219 |
GPU Mem. Reserved (GB) (β) | 0.327 | 0.983 | 0.343 | 0.349 | 5.508 | 4.968 | 0.362 | 0.662 |
Below are key performance metrics visualized by architecture group.
π Each architecture is color-coded by family for quick reference.
If you find the repository useful, please cite the study
@article{ivan2025benchnetrl,
title={RLBenchNet: The Right Network for the Right Reinforcement Learning Task},
author={Smirnov, Ivan and Gu, Shangding},
journal={Arxiv},
year={2025}
}