|
1 | 1 | # Benchmarking imitation
|
2 | 2 |
|
3 |
| -The `src/imitation/scripts/config/tuned_hps` directory provides the tuned hyperparameter configs for benchmarking imitation. For v0.4.0, these correspond to the hyperparameters used in the paper [imitation: Clean Imitation Learning Implementations](https://arxiv.org/abs/2211.11972). |
| 3 | +The imitation library is benchmarked by running the algorithms BC, DAgger, AIRL and GAIL |
| 4 | +on five different environments from the |
| 5 | +[seals environment suite](https://github.com/HumanCompatibleAI/seals) |
| 6 | +each with 10 different random seeds. |
4 | 7 |
|
5 |
| -Configuration files can be loaded either from the CLI or from the Python API. |
6 | 8 |
|
7 |
| -## CLI |
| 9 | +## Running a Single Benchmark |
| 10 | + |
| 11 | +To run a single benchmark from the commandline, you may use: |
8 | 12 |
|
9 | 13 | ```bash
|
10 | 14 | python -m imitation.scripts.<train_script> <algo> with <algo>_<env>
|
11 | 15 | ```
|
12 |
| -`train_script` can be either 1) `train_imitation` with `algo` as `bc` or `dagger` or 2) `train_adversarial` with `algo` as `gail` or `airl`. The `env` can be either of `seals_ant`, `seals_half_cheetah`, `seals_hopper`, `seals_swimmer`, or `seals_walker`. The hyperparameters for other environments are not tuned yet. You may be able to get reasonable performance by using hyperparameters tuned for a similar environment; alternatively, you can tune the hyperparameters using the `tuning` script. |
13 | 16 |
|
14 |
| -## Python |
| 17 | +There are two different `train_scripts`: `train_imitation` and `train_adversarial` each running different algorithms: |
| 18 | + |
| 19 | +| train_script | algo | |
| 20 | +|-------------------|------------| |
| 21 | +| train_imitation | bc, dagger | |
| 22 | +| train_adversarial | gail, airl | |
| 23 | + |
| 24 | +There are five environment configurations for which we have tuned hyperparameters: |
| 25 | + |
| 26 | +| environment | |
| 27 | +|--------------------| |
| 28 | +| seals_ant | |
| 29 | +| seals_half_cheetah | |
| 30 | +| seals_hopper | |
| 31 | +| seals_swimmer | |
| 32 | +| seals_walker | |
| 33 | + |
| 34 | + |
| 35 | +If you want to run the same benchmark from a python script, you can use the following code: |
15 | 36 |
|
16 | 37 | ```python
|
17 | 38 | ...
|
18 |
| -from imitation.scripts.<train_script> import <train_ex> |
19 |
| -<train_ex>.run(command_name="<algo>", named_configs=["<algo>_<env>"]) |
| 39 | +from imitation.scripts.<train_script> import <train_script>_ex |
| 40 | +<train_script>_ex.run(command_name="<algo>", named_configs=["<algo>_<env>"]) |
| 41 | +``` |
| 42 | + |
| 43 | +### Inputs |
| 44 | + |
| 45 | +The tuned hyperparameters can be found in `src/imitation/scripts/config/tuned_hps`. |
| 46 | +For v0.4.0, they correspond to the hyperparameters used in the paper |
| 47 | +[imitation: Clean Imitation Learning Implementations](https://arxiv.org/abs/2211.11972). |
| 48 | +You may be able to get reasonable performance by using hyperparameters tuned for a similar environment. |
| 49 | + |
| 50 | +The experts and expert demonstrations are loaded from the HuggingFace model hub and |
| 51 | +are grouped under the [HumanCompatibleAI Organization](https://huggingface.co/HumanCompatibleAI). |
| 52 | + |
| 53 | +### Outputs |
| 54 | + |
| 55 | +The training scripts are [sacred experiments](https://sacred.readthedocs.io) which place |
| 56 | +their output in an output folder structured like this: |
| 57 | + |
| 58 | +``` |
| 59 | +output |
| 60 | +├── airl |
| 61 | +│ └── seals-Swimmer-v1 |
| 62 | +│ └── 20231012_121226_c5c0e4 |
| 63 | +│ └── sacred -> ../../../sacred/train_adversarial/2 |
| 64 | +├── dagger |
| 65 | +│ └── seals-CartPole-v0 |
| 66 | +│ └── 20230927_095917_c29dc2 |
| 67 | +│ └── sacred -> ../../../sacred/train_imitation/1 |
| 68 | +└── sacred |
| 69 | + ├── train_adversarial |
| 70 | + │ ├── 1 |
| 71 | + │ ├── 2 |
| 72 | + │ ├── 3 |
| 73 | + │ ├── 4 |
| 74 | + │ ├── ... |
| 75 | + │ └── _sources |
| 76 | + └── train_imitation |
| 77 | + ├── 1 |
| 78 | + └── _sources |
20 | 79 | ```
|
21 | 80 |
|
| 81 | +In the `sacred` folder all runs are grouped by the training script, and each gets a |
| 82 | +folder with their run id. |
| 83 | +That run folder contains |
| 84 | +- a `config.json` file with the hyperparameters used for that run |
| 85 | +- a `run.json` file with run information with the final score and expert score |
| 86 | +- a `cout.txt` file with the stdout of the run |
| 87 | + |
| 88 | +Additionally, there are run folders grouped by algorithm and environment. |
| 89 | +They contain further log files and model checkpoints as well as a symlink to the |
| 90 | +corresponding sacred run folder. |
| 91 | + |
| 92 | +Important entries in the json files are: |
| 93 | +- `run.json` |
| 94 | + - `command`: The name of the algorithm |
| 95 | + - `result.imit_stats.monitor_return_mean`: the score of a run |
| 96 | + - `result.expert_stats.monitor_return_mean`: the score of the expert policy that was used for a run |
| 97 | +- `config.json` |
| 98 | + - `environment.gym_id` The environment name of the run |
| 99 | + |
| 100 | +## Running the Complete Benchmark Suite |
| 101 | + |
| 102 | +To execute the entire benchmarking suite with 10 seeds for each configuration, |
| 103 | +you can utilize the `run_all_benchmarks.sh` script. |
| 104 | +This script will consecutively run all configurations. |
| 105 | +To optimize the process, consider parallelization options. |
| 106 | +You can either send all commands to GNU Parallel, |
| 107 | +use SLURM by invoking `run_all_benchmarks_on_slurm.sh` or |
| 108 | +split up the lines in multiple scripts to run on multiple machines manually. |
| 109 | + |
| 110 | +### Generating Benchmark Summaries |
| 111 | + |
| 112 | +There are scripts to summarize all runs in a folder in a CSV file or in a markdown file. |
| 113 | +For the CSV, run: |
| 114 | + |
| 115 | +```shell |
| 116 | +python sacred_output_to_csv.py output/sacred > summary.csv |
| 117 | +``` |
| 118 | + |
| 119 | +This generates a csv file like this: |
| 120 | + |
| 121 | +```csv |
| 122 | +algo, env, score, expert_score |
| 123 | +gail, seals/Walker2d-v1, 2298.883520464286, 2502.8930135576925 |
| 124 | +gail, seals/Swimmer-v1, 287.33667667857145, 295.40472964423077 |
| 125 | +airl, seals/Walker2d-v1, 310.4065185178571, 2502.8930135576925 |
| 126 | +... |
| 127 | +``` |
| 128 | + |
| 129 | +For a more comprehensive summary that includes aggregate statistics such as |
| 130 | +mean, standard deviation, IQM (Inter Quartile Mean) with confidence intervals, |
| 131 | +as recommended by the [rliable library](https://github.com/google-research/rliable), |
| 132 | +use the following command: |
| 133 | + |
| 134 | +```shell |
| 135 | +python sacred_output_to_markdown_summary output/sacred --output summary.md |
| 136 | +``` |
| 137 | + |
| 138 | +This will produce a markdown summary file named `summary.md`. |
| 139 | + |
| 140 | + |
| 141 | + |
| 142 | +**Hint:** |
| 143 | +If you have multiple output folders, because you ran different parts of the |
| 144 | +benchmark on different machines, you can copy the output folders into a common root |
| 145 | +folder. |
| 146 | +The above scripts will search all nested directories for folders with |
| 147 | +a `run.json` and a `config.json` file. |
| 148 | +For example, calling `python sacred_output_to_csv.py benchmark_runs/ > summary.csv` |
| 149 | +on an output folder structured like this: |
| 150 | +``` |
| 151 | +benchmark_runs |
| 152 | +├── first_batch |
| 153 | +│ ├── 1 |
| 154 | +│ ├── 2 |
| 155 | +│ ├── 3 |
| 156 | +│ ├── ... |
| 157 | +└── second_batch |
| 158 | + ├── 1 |
| 159 | + ├── 2 |
| 160 | + ├── 3 |
| 161 | + ├── ... |
| 162 | +``` |
| 163 | +will aggregate all runs from both `first_batch` and `second_batch` into a single |
| 164 | +csv file. |
| 165 | + |
| 166 | +## Comparing an Algorithm against the Benchmark Runs |
| 167 | + |
| 168 | +If you modified one of the existing algorithms or implemented a new one, you might want |
| 169 | +to compare it to the benchmark runs to see if there is a significant improvement or not. |
| 170 | + |
| 171 | +If your algorithm has the same file output format as described above, you can use the |
| 172 | +`compute_probability_of_improvement.py` script to do the comparison. |
| 173 | +It uses the "Probability of Improvement" metric as recommended by the |
| 174 | +[rliable library](https://github.com/google-research/rliable). |
| 175 | + |
| 176 | +```shell |
| 177 | +python compute_probability_of_improvement.py <your_runs_dir> <baseline_runs_dir> --baseline-algo <algo> |
| 178 | +``` |
| 179 | + |
| 180 | +where: |
| 181 | +- `your_runs_dir` is the directory containing the runs for your algorithm |
| 182 | +- `baseline_runs_dir` is the directory containing runs for a known algorithm. Hint: you do not need to re-run our benchmarks. We provide our run folders as release artifacts. |
| 183 | +- `algo` is the algorithm you want to compare against |
| 184 | + |
| 185 | +If `your_runs_dir` contains runs for more than one algorithm, you will have to |
| 186 | +disambiguate using the `--algo` option. |
| 187 | + |
22 | 188 | # Tuning Hyperparameters
|
23 | 189 |
|
24 | 190 | The hyperparameters of any algorithm in imitation can be tuned using `src/imitation/scripts/tuning.py`.
|
|
0 commit comments