|
| 1 | +# Building |
| 2 | + |
| 3 | +These benchmarks are automatically built when ```WITH_CAFFE2=ON``` is passed. |
| 4 | +If you have been following the instructions given [here](https://facebookresearch.github.io/TensorComprehensions/installation.html), you can use the command: |
| 5 | + |
| 6 | +``` |
| 7 | +BUILD_TYPE=Release WITH_CAFFE2=ON CLANG_PREFIX=$(${CONDA_PREFIX}/bin/llvm-config --prefix) ./build.sh |
| 8 | +``` |
| 9 | + |
| 10 | +# Running the autotuner manually |
| 11 | +By default a full evolutionary search is run with 25 generations and 100 candidates per generation. This will take some time for some of the kernels. This setting be changed by using the proper gflags options: ```--tuner_gen_generations``` and ```--tuner_gen_pop_size```. |
| 12 | + |
| 13 | +For instance, a shorter tuning search could iterate as such: |
| 14 | +``` |
| 15 | +./build/tc/benchmarks/benchmark_batchmatmul --autotune=true --tuner_gen_generations=10 --tuner_gen_pop_size=20 |
| 16 | +``` |
| 17 | + |
| 18 | +When running manually, the number of CPU compilation threads and GPUs used for evaluation can be controlled via gflags |
| 19 | +```--tuner_threads``` and ```--tuner_devices``` |
| 20 | + |
| 21 | +For instance, on a 4 GPU system with 20 threads: |
| 22 | +``` |
| 23 | +./build/tc/benchmarks/benchmark_batchmatmul --autotune=true --tuner_gen_generations=10 --tuner_gen_pop_size=10 --tuner_threads=20 --tuner_devices="0,1,2,3" |
| 24 | +``` |
| 25 | + |
| 26 | +# Running the autotuner with provided scripts |
| 27 | +These examples are run as part of ```test.sh``` but can also be run with a full autotuning run |
| 28 | + |
| 29 | +If you are the lucky owner of a supercomputer with ```slurm``` and ```sbatch``` you can run: |
| 30 | +``` |
| 31 | +sbatch --array=1-40 ./tc/benchmarks/scripts/autotuner_parallel.sh |
| 32 | +``` |
| 33 | + |
| 34 | +Results and logs will show in the subdir ```tc/benchmarks/results_xxx```, one can tail the ```*.INFO``` to obtain the best performance found by the autuner. |
| 35 | + |
| 36 | +To control the CPU compilation threads and the GPUs used for evaluation, please use the environment variables ```TUNER_THREADS``` and ```TUNER_GPUS```. |
| 37 | +For instance, on a 4 GPU machine: |
| 38 | +``` |
| 39 | +for f in $(seq 1 14); do TUNER_THREADS=20 TUNER_GPUS="0,1,2,3" SLURM_ARRAY_JOB_ID=local SLURM_ARRAY_TASK_ID=$f ./tc/benchmarks/scripts/autotuner_parallel.sh ; done |
| 40 | +``` |
0 commit comments