Skip to content

Commit de589d4

Browse files
authored
Running benchmarks (#812)
* Add script to run benchmarks on slurm * Increase memory and decrease QOS for benchmark runs. * Add bash script to run the entire benchmark without any parallelization. * Add scripts to generate benchmark summary information and computing the probability of improvement. * Speed up the numpy transform when loading a trajectories from a huggingface dataset. * Add asserts and explanation for the sample matrix. * Add script to export sacred runs to csv file. * Add mean/std/ICM and confidence intervals to the markdown summary script. * Explain how to run the entire benchmarking suite and how to compare a new algorithm to the benchmark runs. * Switch from coco's deprecated --side-by-side option to --allow-downgrade. * Split up the python and openssl/ffmpeg installation to ensure that the step properly fails when the installation of one of the packages fails. * Add tests for sacred file parsing. * Add link to the rliable library to the benchmarking README. * Add no-cover pragma for warnings about incomplete runs.
1 parent d833d9e commit de589d4

11 files changed

+1086
-13
lines changed

.circleci/config.yml

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -141,11 +141,13 @@ commands:
141141
- v11win-dependencies-{{ checksum "setup.py" }}-{{ checksum "ci/build_and_activate_venv.ps1" }}
142142

143143
- run:
144-
name: install python and binary dependencies
145-
command: |
146-
choco install --side-by-side -y python --version=3.8.10
147-
choco install -y ffmpeg
148-
choco install -y openssl
144+
name: install python
145+
command: choco install --allow-downgrade -y python --version=3.8.10
146+
shell: powershell.exe
147+
148+
- run:
149+
name: install openssl and ffmpeg
150+
command: choco install -y ffmpeg openssl
149151
shell: powershell.exe
150152

151153
- run:

benchmarking/README.md

Lines changed: 173 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,190 @@
11
# Benchmarking imitation
22

3-
The `src/imitation/scripts/config/tuned_hps` directory provides the tuned hyperparameter configs for benchmarking imitation. For v0.4.0, these correspond to the hyperparameters used in the paper [imitation: Clean Imitation Learning Implementations](https://arxiv.org/abs/2211.11972).
3+
The imitation library is benchmarked by running the algorithms BC, DAgger, AIRL and GAIL
4+
on five different environments from the
5+
[seals environment suite](https://github.com/HumanCompatibleAI/seals)
6+
each with 10 different random seeds.
47

5-
Configuration files can be loaded either from the CLI or from the Python API.
68

7-
## CLI
9+
## Running a Single Benchmark
10+
11+
To run a single benchmark from the commandline, you may use:
812

913
```bash
1014
python -m imitation.scripts.<train_script> <algo> with <algo>_<env>
1115
```
12-
`train_script` can be either 1) `train_imitation` with `algo` as `bc` or `dagger` or 2) `train_adversarial` with `algo` as `gail` or `airl`. The `env` can be either of `seals_ant`, `seals_half_cheetah`, `seals_hopper`, `seals_swimmer`, or `seals_walker`. The hyperparameters for other environments are not tuned yet. You may be able to get reasonable performance by using hyperparameters tuned for a similar environment; alternatively, you can tune the hyperparameters using the `tuning` script.
1316

14-
## Python
17+
There are two different `train_scripts`: `train_imitation` and `train_adversarial` each running different algorithms:
18+
19+
| train_script | algo |
20+
|-------------------|------------|
21+
| train_imitation | bc, dagger |
22+
| train_adversarial | gail, airl |
23+
24+
There are five environment configurations for which we have tuned hyperparameters:
25+
26+
| environment |
27+
|--------------------|
28+
| seals_ant |
29+
| seals_half_cheetah |
30+
| seals_hopper |
31+
| seals_swimmer |
32+
| seals_walker |
33+
34+
35+
If you want to run the same benchmark from a python script, you can use the following code:
1536

1637
```python
1738
...
18-
from imitation.scripts.<train_script> import <train_ex>
19-
<train_ex>.run(command_name="<algo>", named_configs=["<algo>_<env>"])
39+
from imitation.scripts.<train_script> import <train_script>_ex
40+
<train_script>_ex.run(command_name="<algo>", named_configs=["<algo>_<env>"])
41+
```
42+
43+
### Inputs
44+
45+
The tuned hyperparameters can be found in `src/imitation/scripts/config/tuned_hps`.
46+
For v0.4.0, they correspond to the hyperparameters used in the paper
47+
[imitation: Clean Imitation Learning Implementations](https://arxiv.org/abs/2211.11972).
48+
You may be able to get reasonable performance by using hyperparameters tuned for a similar environment.
49+
50+
The experts and expert demonstrations are loaded from the HuggingFace model hub and
51+
are grouped under the [HumanCompatibleAI Organization](https://huggingface.co/HumanCompatibleAI).
52+
53+
### Outputs
54+
55+
The training scripts are [sacred experiments](https://sacred.readthedocs.io) which place
56+
their output in an output folder structured like this:
57+
58+
```
59+
output
60+
├── airl
61+
│ └── seals-Swimmer-v1
62+
│ └── 20231012_121226_c5c0e4
63+
│ └── sacred -> ../../../sacred/train_adversarial/2
64+
├── dagger
65+
│ └── seals-CartPole-v0
66+
│ └── 20230927_095917_c29dc2
67+
│ └── sacred -> ../../../sacred/train_imitation/1
68+
└── sacred
69+
├── train_adversarial
70+
│ ├── 1
71+
│ ├── 2
72+
│ ├── 3
73+
│ ├── 4
74+
│ ├── ...
75+
│ └── _sources
76+
└── train_imitation
77+
├── 1
78+
└── _sources
2079
```
2180

81+
In the `sacred` folder all runs are grouped by the training script, and each gets a
82+
folder with their run id.
83+
That run folder contains
84+
- a `config.json` file with the hyperparameters used for that run
85+
- a `run.json` file with run information with the final score and expert score
86+
- a `cout.txt` file with the stdout of the run
87+
88+
Additionally, there are run folders grouped by algorithm and environment.
89+
They contain further log files and model checkpoints as well as a symlink to the
90+
corresponding sacred run folder.
91+
92+
Important entries in the json files are:
93+
- `run.json`
94+
- `command`: The name of the algorithm
95+
- `result.imit_stats.monitor_return_mean`: the score of a run
96+
- `result.expert_stats.monitor_return_mean`: the score of the expert policy that was used for a run
97+
- `config.json`
98+
- `environment.gym_id` The environment name of the run
99+
100+
## Running the Complete Benchmark Suite
101+
102+
To execute the entire benchmarking suite with 10 seeds for each configuration,
103+
you can utilize the `run_all_benchmarks.sh` script.
104+
This script will consecutively run all configurations.
105+
To optimize the process, consider parallelization options.
106+
You can either send all commands to GNU Parallel,
107+
use SLURM by invoking `run_all_benchmarks_on_slurm.sh` or
108+
split up the lines in multiple scripts to run on multiple machines manually.
109+
110+
### Generating Benchmark Summaries
111+
112+
There are scripts to summarize all runs in a folder in a CSV file or in a markdown file.
113+
For the CSV, run:
114+
115+
```shell
116+
python sacred_output_to_csv.py output/sacred > summary.csv
117+
```
118+
119+
This generates a csv file like this:
120+
121+
```csv
122+
algo, env, score, expert_score
123+
gail, seals/Walker2d-v1, 2298.883520464286, 2502.8930135576925
124+
gail, seals/Swimmer-v1, 287.33667667857145, 295.40472964423077
125+
airl, seals/Walker2d-v1, 310.4065185178571, 2502.8930135576925
126+
...
127+
```
128+
129+
For a more comprehensive summary that includes aggregate statistics such as
130+
mean, standard deviation, IQM (Inter Quartile Mean) with confidence intervals,
131+
as recommended by the [rliable library](https://github.com/google-research/rliable),
132+
use the following command:
133+
134+
```shell
135+
python sacred_output_to_markdown_summary output/sacred --output summary.md
136+
```
137+
138+
This will produce a markdown summary file named `summary.md`.
139+
140+
141+
142+
**Hint:**
143+
If you have multiple output folders, because you ran different parts of the
144+
benchmark on different machines, you can copy the output folders into a common root
145+
folder.
146+
The above scripts will search all nested directories for folders with
147+
a `run.json` and a `config.json` file.
148+
For example, calling `python sacred_output_to_csv.py benchmark_runs/ > summary.csv`
149+
on an output folder structured like this:
150+
```
151+
benchmark_runs
152+
├── first_batch
153+
│ ├── 1
154+
│ ├── 2
155+
│ ├── 3
156+
│ ├── ...
157+
└── second_batch
158+
├── 1
159+
├── 2
160+
├── 3
161+
├── ...
162+
```
163+
will aggregate all runs from both `first_batch` and `second_batch` into a single
164+
csv file.
165+
166+
## Comparing an Algorithm against the Benchmark Runs
167+
168+
If you modified one of the existing algorithms or implemented a new one, you might want
169+
to compare it to the benchmark runs to see if there is a significant improvement or not.
170+
171+
If your algorithm has the same file output format as described above, you can use the
172+
`compute_probability_of_improvement.py` script to do the comparison.
173+
It uses the "Probability of Improvement" metric as recommended by the
174+
[rliable library](https://github.com/google-research/rliable).
175+
176+
```shell
177+
python compute_probability_of_improvement.py <your_runs_dir> <baseline_runs_dir> --baseline-algo <algo>
178+
```
179+
180+
where:
181+
- `your_runs_dir` is the directory containing the runs for your algorithm
182+
- `baseline_runs_dir` is the directory containing runs for a known algorithm. Hint: you do not need to re-run our benchmarks. We provide our run folders as release artifacts.
183+
- `algo` is the algorithm you want to compare against
184+
185+
If `your_runs_dir` contains runs for more than one algorithm, you will have to
186+
disambiguate using the `--algo` option.
187+
22188
# Tuning Hyperparameters
23189

24190
The hyperparameters of any algorithm in imitation can be tuned using `src/imitation/scripts/tuning.py`.

0 commit comments

Comments
 (0)