Skip to content

Commit 009aaeb

Browse files
ChongWei905ChongWei905
andauthored
docs: change mpirun with msrun and add other notices (#805)
Co-authored-by: ChongWei905 <weichong4@huawei.com>
1 parent e38f625 commit 009aaeb

File tree

58 files changed

+194
-175
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+194
-175
lines changed

README.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -124,15 +124,24 @@ It is easy to train your model on a standard or customized dataset using `train.
124124

125125
- Distributed Training
126126

127-
For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `mpirun` and parallel features supported by MindSpore.
127+
For large datasets like ImageNet, it is necessary to do training in distributed mode on multiple devices. This can be achieved with `msrun` and parallel features supported by MindSpore.
128128

129129
```shell
130130
# distributed training
131131
# assume you have 4 GPUs/NPUs
132-
mpirun -n 4 python train.py --distribute \
132+
msrun --bind_core=True --worker_num 4 python train.py --distribute \
133133
--model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
134134
```
135-
> Notes: If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
135+
136+
Notice that if you are using msrun startup with 2 devices, please add `--bind_core=True` to improve performance. For example:
137+
138+
```shell
139+
msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \
140+
--log_dir=msrun_log --join=True --cluster_time_out=300 \
141+
python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
142+
```
143+
144+
> For more information, please refer to https://www.mindspore.cn/tutorials/experts/en/r2.3.1/parallel/startup_method.html
136145

137146
Detailed parameter definitions can be seen in `config.py` and checked by running `python train.py --help'.
138147
@@ -143,7 +152,7 @@ It is easy to train your model on a standard or customized dataset using `train.
143152
You can configure your model and other components either by specifying external parameters or by writing a yaml config file. Here is an example of training using a preset yaml file.
144153
145154
```shell
146-
mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
155+
msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
147156
```
148157
149158
**Pre-defined Training Strategies:**

README_CN.md

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -117,15 +117,25 @@ python infer.py --model=swin_tiny --image_path='./dog.jpg'
117117

118118
- 分布式训练
119119

120-
对于像ImageNet这样的大型数据集,有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持,用户可以使用`mpirun`来进行模型的分布式训练。
120+
对于像ImageNet这样的大型数据集,有必要在多个设备上以分布式模式进行训练。基于MindSpore对分布式相关功能的良好支持,用户可以使用`msrun`来进行模型的分布式训练。
121121

122122
```shell
123123
# 分布式训练
124124
# 假设你有4张GPU或者NPU卡
125-
mpirun --allow-run-as-root -n 4 python train.py --distribute \
125+
msrun --bind_core=True --worker_num 4 python train.py --distribute \
126126
--model densenet121 --dataset imagenet --data_dir ./datasets/imagenet
127127
```
128128

129+
注意,如果在两卡环境下选用msrun作为启动方式,请添加配置项 `--bind_core=True` 增加绑核操作以优化两卡性能,范例代码如下:
130+
131+
```shell
132+
msrun --bind_core=True --worker_num=2--local_worker_num=2 --master_port=8118 \
133+
--log_dir=msrun_log --join=True --cluster_time_out=300 \
134+
python train.py --distribute --model=densenet121 --dataset=imagenet --data_dir=/path/to/imagenet
135+
```
136+
137+
> 如需更多操作指导,请参考 https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.1/parallel/startup_method.html
138+
129139
完整的参数列表及说明在`config.py`中定义,可运行`python train.py --help`快速查看。
130140

131141
如需恢复训练,请指定`--ckpt_path``--ckpt_save_dir`参数,脚本将加载路径中的模型权重和优化器状态,并恢复中断的训练进程。
@@ -135,7 +145,7 @@ python infer.py --model=swin_tiny --image_path='./dog.jpg'
135145
您可以编写yaml文件或设置外部参数来指定配置数据、模型、优化器等组件及其超参数。以下是使用预设的训练策略(yaml文件)进行模型训练的示例。
136146

137147
```shell
138-
mpirun --allow-run-as-root -n 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
148+
msrun --bind_core=True --worker_num 4 python train.py -c configs/squeezenet/squeezenet_1.0_gpu.yaml
139149
```
140150

141151
**预定义的训练策略**

configs/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -59,17 +59,16 @@ Illustration:
5959

6060
#### Training Script Format
6161

62-
For consistency, it is recommended to provide distributed training commands based on `mpirun -n {num_devices} python train.py`, instead of using shell script such as `distrubuted_train.sh`.
62+
For consistency, it is recommended to provide distributed training commands based on `msrun --bind_core=True --worker_num {num_devices} python train.py`, instead of using shell script such as `distrubuted_train.sh`.
6363

6464
```shell
6565
# standalone training on a gpu or ascend device
6666
python train.py --config configs/densenet/densenet_121_gpu.yaml --data_dir /path/to/dataset --distribute False
6767

6868
# distributed training on gpu or ascend divices
69-
mpirun -n 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
69+
msrun --bind_core=True --worker_num 8 python train.py --config configs/densenet/densenet_121_ascend.yaml --data_dir /path/to/imagenet
7070

7171
```
72-
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
7372

7473
#### URL and Hyperlink Format
7574
Please use **absolute path** in the hyperlink or url for linking the target resource in the readme file and table.

configs/bit/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,11 +58,10 @@ It is easy to reproduce the reported results with the pre-defined training recip
5858

5959
```shell
6060
# distributed training on multiple GPU/Ascend devices
61-
mpirun -n 8 python train.py --config configs/bit/bit_resnet50_ascend.yaml --data_dir /path/to/imagenet
61+
msrun --bind_core=True --worker_num 8 python train.py --config configs/bit/bit_resnet50_ascend.yaml --data_dir /path/to/imagenet
6262
```
63-
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
6463

65-
Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
64+
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
6665

6766
For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
6867

configs/cmt/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -54,11 +54,10 @@ It is easy to reproduce the reported results with the pre-defined training recip
5454

5555
```shell
5656
# distributed training on multiple GPU/Ascend devices
57-
mpirun -n 8 python train.py --config configs/cmt/cmt_small_ascend.yaml --data_dir /path/to/imagenet
57+
msrun --bind_core=True --worker_num 8 python train.py --config configs/cmt/cmt_small_ascend.yaml --data_dir /path/to/imagenet
5858
```
59-
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
6059

61-
Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
60+
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
6261

6362
For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
6463

configs/coat/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,12 +48,11 @@ It is easy to reproduce the reported results with the pre-defined training recip
4848

4949
```shell
5050
# distributed training on multiple GPU/Ascend devices
51-
mpirun -n 8 python train.py --config configs/coat/coat_lite_tiny_ascend.yaml --data_dir /path/to/imagenet
51+
msrun --bind_core=True --worker_num 8 python train.py --config configs/coat/coat_lite_tiny_ascend.yaml --data_dir /path/to/imagenet
5252
```
5353

54-
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`
5554

56-
Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
55+
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
5756

5857
For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
5958

configs/convit/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,11 +68,10 @@ It is easy to reproduce the reported results with the pre-defined training recip
6868

6969
```shell
7070
# distributed training on multiple GPU/Ascend devices
71-
mpirun -n 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet
71+
msrun --bind_core=True --worker_num 8 python train.py --config configs/convit/convit_tiny_ascend.yaml --data_dir /path/to/imagenet
7272
```
73-
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
7473

75-
Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
74+
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
7675

7776
For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
7877

configs/convnext/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -66,12 +66,11 @@ It is easy to reproduce the reported results with the pre-defined training recip
6666

6767
```shell
6868
# distributed training on multiple GPU/Ascend devices
69-
mpirun -n 8 python train.py --config configs/convnext/convnext_tiny_ascend.yaml --data_dir /path/to/imagenet
69+
msrun --bind_core=True --worker_num 8 python train.py --config configs/convnext/convnext_tiny_ascend.yaml --data_dir /path/to/imagenet
7070
```
7171

72-
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
7372

74-
Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
73+
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
7574

7675
For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
7776

configs/convnextv2/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,12 +63,11 @@ It is easy to reproduce the reported results with the pre-defined training recip
6363

6464
```shell
6565
# distributed training on multiple GPU/Ascend devices
66-
mpirun -n 8 python train.py --config configs/convnextv2/convnextv2_tiny_ascend.yaml --data_dir /path/to/imagenet
66+
msrun --bind_core=True --worker_num 8 python train.py --config configs/convnextv2/convnextv2_tiny_ascend.yaml --data_dir /path/to/imagenet
6767
```
6868

69-
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
7069

71-
Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
70+
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
7271

7372
For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
7473

configs/crossvit/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -62,11 +62,10 @@ It is easy to reproduce the reported results with the pre-defined training recip
6262

6363
```shell
6464
# distributed training on multiple GPU/Ascend devices
65-
mpirun -n 8 python train.py --config configs/crossvit/crossvit_15_ascend.yaml --data_dir /path/to/imagenet
65+
msrun --bind_core=True --worker_num 8 python train.py --config configs/crossvit/crossvit_15_ascend.yaml --data_dir /path/to/imagenet
6666
```
67-
> If the script is executed by the root user, the `--allow-run-as-root` parameter must be added to `mpirun`.
6867

69-
Similarly, you can train the model on multiple GPU devices with the above `mpirun` command.
68+
Similarly, you can train the model on multiple GPU devices with the above `msrun` command.
7069

7170
For detailed illustration of all hyper-parameters, please refer to [config.py](https://github.com/mindspore-lab/mindcv/blob/main/config.py).
7271

0 commit comments

Comments
 (0)