Skip to content

Commit 0f87f47

Browse files
authored
Fix(docker): update docker image and dockerfile for new version (#200)
1 parent aa3e9c4 commit 0f87f47

File tree

10 files changed

+93
-82
lines changed

10 files changed

+93
-82
lines changed

README-zh-Hans.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@
1717
[![使用文档](https://readthedocs.org/projects/internevo/badge/?version=latest)](https://internevo.readthedocs.io/zh_CN/latest/?badge=latest)
1818
[![license](./doc/imgs/license.svg)](./LICENSE)
1919

20-
[📘使用教程](./doc/en/usage.md) |
21-
[🛠️安装指引](./doc/en/install.md) |
22-
[📊框架性能](./doc/en/train_performance.md) |
20+
[📘使用教程](./doc/usage.md) |
21+
[🛠️安装指引](./doc/install.md) |
22+
[📊框架性能](./doc/train_performance.md) |
2323
[🤔问题报告](https://github.com/InternLM/InternEvo/issues/new)
2424

2525
[English](./README.md) |

doc/en/install.md

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,10 @@ cd ../../../../
7878
Install Apex (version 23.05):
7979
```bash
8080
cd ./third_party/apex
81-
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
81+
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
82+
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
83+
# otherwise
84+
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
8285
cd ../../
8386
```
8487

@@ -88,31 +91,36 @@ pip install git+https://github.com/databricks/megablocks@v0.3.2 # MOE need
8891
```
8992

9093
### Environment Image
91-
Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternEvo runtime environment installed from https://hub.docker.com/r/internlm/internlm.
94+
Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternEvo runtime environment installed from https://hub.docker.com/r/internlm/internevo/tags.
9295

9396
#### Image Configuration and Build
9497
The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternEvo:
9598
``` bash
9699
make -f docker.Makefile BASE_OS=centos7
97100
```
98-
In docker.Makefile, you can customize the basic image, environment version, etc., and the corresponding parameters can be passed directly through the command line. For BASE_OS, ubuntu20.04 and centos7 are respectively supported.
101+
In docker.Makefile, you can customize the basic image, environment version, etc., and the corresponding parameters can be passed directly through the command line. The default is the recommended environment version. For BASE_OS, ubuntu20.04 and centos7 are respectively supported.
99102

100103
#### Pull Standard Image
101104
The standard image based on ubuntu and centos has been built and can be directly pulled:
102105

103106
```bash
104107
# ubuntu20.04
105-
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-ubuntu20.04
108+
docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-ubuntu20.04
106109
# centos7
107-
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7
110+
docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7
108111
```
109112

110113
#### Run Container
111114
For the local standard image built with dockerfile or pulled, use the following command to run and enter the container:
112115
```bash
113-
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
116+
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name internevo_centos internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7 bash
117+
```
118+
119+
#### Start Training
120+
The default directory in the container is `/InternEvo`, please start training according to the [Usage](./usage.md). The default 7B model starts the single-machine with 8-GPU training command example as follows:
121+
```bash
122+
torchrun --nproc_per_node=8 --nnodes=1 train.py --config configs/7B_sft.py --launcher torch
114123
```
115-
The default directory in the container is `/InternLM`, please start training according to the [Usage](./usage.md).
116124

117125
## Environment Installation (NPU)
118126
For machines with NPU, the version of the installation environment can refer to that of GPU. Use Ascend's torch_npu instead of torch on NPU machines. Additionally, Flash-Attention and Apex are no longer supported for installation on NPU. The corresponding functionalities have been internally implemented in the InternEvo codebase. The following tutorial is only for installing torch_npu.
@@ -135,4 +143,4 @@ pip3 install pyyaml
135143
pip3 install setuptools
136144
wget https://gitee.com/ascend/pytorch/releases/download/v6.0.rc1-pytorch2.1.0/torch_npu-2.1.0.post3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
137145
pip install torch_npu-2.1.0.post3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
138-
```
146+
```

doc/install.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,10 @@ cd ../../../../
7878
安装 Apex (version 23.05):
7979
```bash
8080
cd ./third_party/apex
81-
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
81+
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
82+
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
83+
# otherwise
84+
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
8285
cd ../../
8386
```
8487

@@ -88,32 +91,36 @@ pip install git+https://github.com/databricks/megablocks@v0.3.2 # MOE相关
8891
```
8992

9093
### 环境镜像
91-
用户可以使用提供的 dockerfile 结合 docker.Makefile 来构建自己的镜像,或者也可以从 https://hub.docker.com/r/internlm/internlm 获取安装了 InternEvo 运行环境的镜像。
94+
用户可以使用提供的 dockerfile 结合 docker.Makefile 来构建自己的镜像,或者也可以从 https://hub.docker.com/r/internlm/internevo/tags 获取安装了 InternEvo 运行环境的镜像。
9295

9396
#### 镜像配置及构造
9497
dockerfile 的配置以及构造均通过 docker.Makefile 文件实现,在 InternEvo 根目录下执行如下命令即可 build 镜像:
9598
``` bash
9699
make -f docker.Makefile BASE_OS=centos7
97100
```
98-
在 docker.Makefile 中可自定义基础镜像,环境版本等内容,对应参数可直接通过命令行传递。对于 BASE_OS 分别支持 ubuntu20.04 和 centos7。
101+
在 docker.Makefile 中可自定义基础镜像,环境版本等内容,对应参数可直接通过命令行传递,默认为推荐的环境版本。对于 BASE_OS 分别支持 ubuntu20.04 和 centos7。
99102

100103
#### 镜像拉取
101104
基于 ubuntu 和 centos 的标准镜像已经 build 完成也可直接拉取使用:
102105

103106
```bash
104107
# ubuntu20.04
105-
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-ubuntu20.04
108+
docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-ubuntu20.04
106109
# centos7
107-
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7
110+
docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7
108111
```
109112

110113
#### 容器启动
111114
对于使用 dockerfile 构建或拉取的本地标准镜像,使用如下命令启动并进入容器:
112115
```bash
113-
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
116+
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name internevo_centos internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7 bash
114117
```
115-
容器内默认目录即 `/InternLM`,根据[使用文档](./usage.md)即可启动训练。
116118

119+
#### 训练启动
120+
容器内默认目录即 `/InternEvo`,参考[使用文档](./usage.md)可获取具体使用方法。默认7B模型启动单机8卡训练命令样例:
121+
```bash
122+
torchrun --nproc_per_node=8 --nnodes=1 train.py --config configs/7B_sft.py --launcher torch
123+
```
117124

118125
## 环境安装(NPU)
119126
在搭载NPU的机器上安装环境的版本可参考GPU,在NPU上使用昇腾torch_npu代替torch,同时Flash-Attention和Apex不再支持安装,相应功能已由InternEvo代码内部实现。以下教程仅为torch_npu安装。

docker.Makefile

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,11 @@
11
DOCKER_REGISTRY ?= docker.io
2-
DOCKER_ORG ?= my
3-
DOCKER_IMAGE ?= internlm
2+
DOCKER_ORG ?= internlm
3+
DOCKER_IMAGE ?= internevo
44
DOCKER_FULL_NAME = $(DOCKER_REGISTRY)/$(DOCKER_ORG)/$(DOCKER_IMAGE)
55

6-
CUDA_VERSION = 11.7.1
7-
GCC_VERSION = 10.2.0
8-
6+
CUDA_VERSION = 11.8.0
97
CUDNN_VERSION = 8
8+
109
BASE_RUNTIME =
1110
# ubuntu20.04 centos7
1211
BASE_OS = centos7
@@ -17,9 +16,10 @@ CUDA_CHANNEL = nvidia
1716
INSTALL_CHANNEL ?= pytorch
1817

1918
PYTHON_VERSION ?= 3.10
20-
PYTORCH_VERSION ?= 1.13.1
21-
TORCHVISION_VERSION ?= 0.14.1
22-
TORCHAUDIO_VERSION ?= 0.13.1
19+
PYTORCH_TAG ?= 2.1.0
20+
PYTORCH_VERSION ?= 2.1.0+cu118
21+
TORCHVISION_VERSION ?= 0.16.0+cu118
22+
TORCHAUDIO_VERSION ?= 2.1.0+cu118
2323
BUILD_PROGRESS ?= auto
2424
TRITON_VERSION ?=
2525
GMP_VERSION ?= 6.2.1
@@ -28,18 +28,14 @@ MPC_VERSION ?= 1.2.1
2828
GCC_VERSION ?= 10.2.0
2929
HTTPS_PROXY_I ?=
3030
HTTP_PROXY_I ?=
31-
FLASH_ATTEN_VERSION ?= 1.0.5
31+
FLASH_ATTEN_VERSION ?= 2.2.1
3232
FLASH_ATTEN_TAG ?= v${FLASH_ATTEN_VERSION}
3333

3434
BUILD_ARGS = --build-arg BASE_IMAGE=$(BASE_IMAGE) \
3535
--build-arg PYTHON_VERSION=$(PYTHON_VERSION) \
36-
--build-arg CUDA_VERSION=$(CUDA_VERSION) \
37-
--build-arg CUDA_CHANNEL=$(CUDA_CHANNEL) \
3836
--build-arg PYTORCH_VERSION=$(PYTORCH_VERSION) \
3937
--build-arg TORCHVISION_VERSION=$(TORCHVISION_VERSION) \
4038
--build-arg TORCHAUDIO_VERSION=$(TORCHAUDIO_VERSION) \
41-
--build-arg INSTALL_CHANNEL=$(INSTALL_CHANNEL) \
42-
--build-arg TRITON_VERSION=$(TRITON_VERSION) \
4339
--build-arg GMP_VERSION=$(GMP_VERSION) \
4440
--build-arg MPFR_VERSION=$(MPFR_VERSION) \
4541
--build-arg MPC_VERSION=$(MPC_VERSION) \
@@ -98,7 +94,7 @@ all: devel-image
9894

9995
.PHONY: devel-image
10096
devel-image: BASE_IMAGE := $(BASE_DEVEL)
101-
devel-image: DOCKER_TAG := torch${PYTORCH_VERSION}-cuda${CUDA_VERSION}-flashatten${FLASH_ATTEN_VERSION}-${BASE_OS}
97+
devel-image: DOCKER_TAG := torch${PYTORCH_TAG}-cuda${CUDA_VERSION}-flashatten${FLASH_ATTEN_VERSION}-${BASE_OS}
10298
devel-image:
10399
$(DOCKER_BUILD)
104100

docker/Dockerfile-centos

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -107,18 +107,18 @@ ENV CXX=${GCC_HOME}/bin/c++
107107

108108

109109
##############################################################################
110-
# Install InternLM development environment, including flash-attention and apex
110+
# Install InternEvo development environment, including flash-attention and apex
111111
##############################################################################
112112
FROM dep as intrenlm-dev
113-
COPY . /InternLM
114-
WORKDIR /InternLM
113+
COPY . /InternEvo
114+
WORKDIR /InternEvo
115115
ARG https_proxy
116116
ARG http_proxy
117117
ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
118118
RUN git submodule update --init --recursive \
119119
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt \
120120
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt \
121-
&& cd /InternLM/third_party/flash-attention \
121+
&& cd /InternEvo/third_party/flash-attention \
122122
&& /opt/conda/bin/python setup.py install \
123123
&& cd ./csrc \
124124
&& cd fused_dense_lib && /opt/conda/bin/pip install -v . \
@@ -127,6 +127,9 @@ RUN git submodule update --init --recursive \
127127
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
128128
&& cd ../../../../ \
129129
&& cd ./third_party/apex \
130-
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
130+
&& /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
131+
&& /opt/conda/bin/pip install pytorch-extension \
131132
&& /opt/conda/bin/pip cache purge \
132-
&& rm -rf ~/.cache/pip
133+
&& rm -rf ~/.cache/pip \
134+
&& /opt/conda/bin/conda init \
135+
&& . ~/.bashrc

docker/Dockerfile-ubuntu

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -88,18 +88,18 @@ ENV CXX=${GCC_HOME}/bin/c++
8888

8989

9090
##############################################################################
91-
# Install InternLM development environment, including flash-attention and apex
91+
# Install InternEvo development environment, including flash-attention and apex
9292
##############################################################################
9393
FROM dep as intrenlm-dev
94-
COPY . /InternLM
95-
WORKDIR /InternLM
94+
COPY . /InternEvo
95+
WORKDIR /InternEvo
9696
ARG https_proxy
9797
ARG http_proxy
9898
ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
9999
RUN git submodule update --init --recursive \
100100
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt \
101101
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt \
102-
&& cd /InternLM/third_party/flash-attention \
102+
&& cd /InternEvo/third_party/flash-attention \
103103
&& /opt/conda/bin/python setup.py install \
104104
&& cd ./csrc \
105105
&& cd fused_dense_lib && /opt/conda/bin/pip install -v . \
@@ -108,6 +108,9 @@ RUN git submodule update --init --recursive \
108108
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
109109
&& cd ../../../../ \
110110
&& cd ./third_party/apex \
111-
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
111+
&& /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
112+
&& /opt/conda/bin/pip install pytorch-extension \
112113
&& /opt/conda/bin/pip cache purge \
113-
&& rm -rf ~/.cache/pip
114+
&& rm -rf ~/.cache/pip \
115+
&& /opt/conda/bin/conda init \
116+
&& . ~/.bashrc

experiment/Dockerfile-centos

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -106,11 +106,11 @@ ENV CXX=${GCC_HOME}/bin/c++
106106

107107

108108
##############################################################################
109-
# Install InternLM development environment, including flash-attention and apex
109+
# Install InternEvo development environment, including flash-attention and apex
110110
##############################################################################
111111
FROM dep as intrenlm-dev
112-
COPY . /InternLM
113-
WORKDIR /InternLM
112+
COPY . /InternEvo
113+
WORKDIR /InternEvo
114114
ARG https_proxy
115115
ARG http_proxy
116116
ARG PYTORCH_VERSION
@@ -134,11 +134,11 @@ RUN /opt/conda/bin/pip --no-cache-dir install \
134134
torch-scatter \
135135
pyecharts \
136136
py-libnuma \
137-
-f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}+cu117.html \
137+
-f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}.html \
138138
&& /opt/conda/bin/pip --no-cache-dir install \
139-
--extra-index-url https://download.pytorch.org/whl/cu117 \
140-
torch==${PYTORCH_VERSION}+cu117 \
141-
torchvision==${TORCHVISION_VERSION}+cu117 \
139+
--extra-index-url https://download.pytorch.org/whl/cu118 \
140+
torch==${PYTORCH_VERSION} \
141+
torchvision==${TORCHVISION_VERSION} \
142142
torchaudio==${TORCHAUDIO_VERSION}
143143

144144
ARG https_proxy
@@ -147,7 +147,7 @@ ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
147147
ARG FLASH_ATTEN_TAG
148148

149149
RUN git submodule update --init --recursive \
150-
&& cd /InternLM/third_party/flash-attention \
150+
&& cd /InternEvo/third_party/flash-attention \
151151
&& git checkout ${FLASH_ATTEN_TAG} \
152152
&& /opt/conda/bin/python setup.py install \
153153
&& cd ./csrc \
@@ -157,6 +157,9 @@ RUN git submodule update --init --recursive \
157157
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
158158
&& cd ../../../../ \
159159
&& cd ./third_party/apex \
160-
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
160+
&& /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
161+
&& /opt/conda/bin/pip install pytorch-extension \
161162
&& /opt/conda/bin/pip cache purge \
162-
&& rm -rf ~/.cache/pip
163+
&& rm -rf ~/.cache/pip \
164+
&& /opt/conda/bin/conda init \
165+
&& . ~/.bashrc

experiment/Dockerfile-ubuntu

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -87,11 +87,11 @@ ENV CXX=${GCC_HOME}/bin/c++
8787

8888

8989
##############################################################################
90-
# Install InternLM development environment, including flash-attention and apex
90+
# Install InternEvo development environment, including flash-attention and apex
9191
##############################################################################
9292
FROM dep as intrenlm-dev
93-
COPY . /InternLM
94-
WORKDIR /InternLM
93+
COPY . /InternEvo
94+
WORKDIR /InternEvo
9595
ARG https_proxy
9696
ARG http_proxy
9797
ARG PYTORCH_VERSION
@@ -115,11 +115,11 @@ RUN /opt/conda/bin/pip --no-cache-dir install \
115115
torch-scatter \
116116
pyecharts \
117117
py-libnuma \
118-
-f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}+cu117.html \
118+
-f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}.html \
119119
&& /opt/conda/bin/pip --no-cache-dir install \
120-
--extra-index-url https://download.pytorch.org/whl/cu117 \
121-
torch==${PYTORCH_VERSION}+cu117 \
122-
torchvision==${TORCHVISION_VERSION}+cu117 \
120+
--extra-index-url https://download.pytorch.org/whl/cu118 \
121+
torch==${PYTORCH_VERSION} \
122+
torchvision==${TORCHVISION_VERSION} \
123123
torchaudio==${TORCHAUDIO_VERSION}
124124

125125
ARG https_proxy
@@ -128,7 +128,7 @@ ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
128128
ARG FLASH_ATTEN_TAG
129129

130130
RUN git submodule update --init --recursive \
131-
&& cd /InternLM/third_party/flash-attention \
131+
&& cd /InternEvo/third_party/flash-attention \
132132
&& git checkout ${FLASH_ATTEN_TAG} \
133133
&& /opt/conda/bin/python setup.py install \
134134
&& cd ./csrc \
@@ -138,6 +138,9 @@ RUN git submodule update --init --recursive \
138138
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
139139
&& cd ../../../../ \
140140
&& cd ./third_party/apex \
141-
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
141+
&& /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
142+
&& /opt/conda/bin/pip install pytorch-extension \
142143
&& /opt/conda/bin/pip cache purge \
143-
&& rm -rf ~/.cache/pip
144+
&& rm -rf ~/.cache/pip \
145+
&& /opt/conda/bin/conda init \
146+
&& . ~/.bashrc

0 commit comments

Comments
 (0)