Fix(docker): update docker image and dockerfile for new version (#200)

li126com · web-flow · commit 0f87f4764481 · 2024-07-16T19:17:45.000+08:00
diff --git a/README-zh-Hans.md b/README-zh-Hans.md
@@ -17,9 +17,9 @@
 [![使用文档](https://readthedocs.org/projects/internevo/badge/?version=latest)](https://internevo.readthedocs.io/zh_CN/latest/?badge=latest)
 [![license](./doc/imgs/license.svg)](./LICENSE)
 
-[📘使用教程](./doc/en/usage.md) |
-[🛠️安装指引](./doc/en/install.md) |
-[📊框架性能](./doc/en/train_performance.md) |
+[📘使用教程](./doc/usage.md) |
+[🛠️安装指引](./doc/install.md) |
+[📊框架性能](./doc/train_performance.md) |
 [🤔问题报告](https://github.com/InternLM/InternEvo/issues/new)
 
 [English](./README.md) |
diff --git a/doc/en/install.md b/doc/en/install.md
@@ -78,7 +78,10 @@ cd ../../../../
 Install Apex (version 23.05):
 ```bash
 cd ./third_party/apex
-pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
+# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
+pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
+# otherwise
+pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
 cd ../../
 ```
 
@@ -88,31 +91,36 @@ pip install git+https://github.com/databricks/megablocks@v0.3.2 # MOE need
 ```
 
 ### Environment Image
-Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternEvo runtime environment installed from https://hub.docker.com/r/internlm/internlm.
+Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternEvo runtime environment installed from https://hub.docker.com/r/internlm/internevo/tags.
 
 #### Image Configuration and Build
 The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternEvo:
 ``` bash
 make -f docker.Makefile BASE_OS=centos7
 ```
-In docker.Makefile, you can customize the basic image, environment version, etc., and the corresponding parameters can be passed directly through the command line. For BASE_OS, ubuntu20.04 and centos7 are respectively supported.
+In docker.Makefile, you can customize the basic image, environment version, etc., and the corresponding parameters can be passed directly through the command line. The default is the recommended environment version. For BASE_OS, ubuntu20.04 and centos7 are respectively supported.
 
 #### Pull Standard Image
 The standard image based on ubuntu and centos has been built and can be directly pulled:
 
 ```bash
 # ubuntu20.04
-docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-ubuntu20.04
+docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-ubuntu20.04
 # centos7
-docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7
+docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7
 ```
 
 #### Run Container
 For the local standard image built with dockerfile or pulled, use the following command to run and enter the container:
 ```bash
-docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
+docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name internevo_centos internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7 bash
+```
+
+#### Start Training
+The default directory in the container is `/InternEvo`, please start training according to the [Usage](./usage.md). The default 7B model starts the single-machine with 8-GPU training command example as follows:
+```bash
+torchrun --nproc_per_node=8 --nnodes=1 train.py --config configs/7B_sft.py --launcher torch
 ```
-The default directory in the container is `/InternLM`, please start training according to the [Usage](./usage.md).
 
 ## Environment Installation (NPU)
 For machines with NPU, the version of the installation environment can refer to that of GPU. Use Ascend's torch_npu instead of torch on NPU machines. Additionally, Flash-Attention and Apex are no longer supported for installation on NPU. The corresponding functionalities have been internally implemented in the InternEvo codebase. The following tutorial is only for installing torch_npu.
@@ -135,4 +143,4 @@ pip3 install pyyaml
 pip3 install setuptools
 wget https://gitee.com/ascend/pytorch/releases/download/v6.0.rc1-pytorch2.1.0/torch_npu-2.1.0.post3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
 pip install torch_npu-2.1.0.post3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
-```
+```
diff --git a/doc/install.md b/doc/install.md
@@ -78,7 +78,10 @@ cd ../../../../
 安装 Apex (version 23.05)：
 ```bash
 cd ./third_party/apex
-pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
+# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
+pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
+# otherwise
+pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
 cd ../../
 ```
 
@@ -88,32 +91,36 @@ pip install git+https://github.com/databricks/megablocks@v0.3.2 # MOE相关
 ```
 
 ### 环境镜像
-用户可以使用提供的 dockerfile 结合 docker.Makefile 来构建自己的镜像，或者也可以从 https://hub.docker.com/r/internlm/internlm 获取安装了 InternEvo 运行环境的镜像。
+用户可以使用提供的 dockerfile 结合 docker.Makefile 来构建自己的镜像，或者也可以从 https://hub.docker.com/r/internlm/internevo/tags 获取安装了 InternEvo 运行环境的镜像。
 
 #### 镜像配置及构造
 dockerfile 的配置以及构造均通过 docker.Makefile 文件实现，在 InternEvo 根目录下执行如下命令即可 build 镜像：
 ``` bash
 make -f docker.Makefile BASE_OS=centos7
 ```
-在 docker.Makefile 中可自定义基础镜像，环境版本等内容，对应参数可直接通过命令行传递。对于 BASE_OS 分别支持 ubuntu20.04 和 centos7。
+在 docker.Makefile 中可自定义基础镜像，环境版本等内容，对应参数可直接通过命令行传递，默认为推荐的环境版本。对于 BASE_OS 分别支持 ubuntu20.04 和 centos7。
 
 #### 镜像拉取
 基于 ubuntu 和 centos 的标准镜像已经 build 完成也可直接拉取使用：
 
 ```bash
 # ubuntu20.04
-docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-ubuntu20.04
+docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-ubuntu20.04
 # centos7
-docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7
+docker pull internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7
 ```
 
 #### 容器启动
 对于使用 dockerfile 构建或拉取的本地标准镜像，使用如下命令启动并进入容器：
 ```bash
-docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
+docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name internevo_centos internlm/internevo:torch2.1.0-cuda11.8.0-flashatten2.2.1-centos7 bash
 ```
-容器内默认目录即 `/InternLM`，根据[使用文档](./usage.md)即可启动训练。
 
+#### 训练启动
+容器内默认目录即 `/InternEvo`，参考[使用文档](./usage.md)可获取具体使用方法。默认7B模型启动单机8卡训练命令样例：
+```bash
+torchrun --nproc_per_node=8 --nnodes=1 train.py --config configs/7B_sft.py --launcher torch
+```
 
 ## 环境安装（NPU）
 在搭载NPU的机器上安装环境的版本可参考GPU，在NPU上使用昇腾torch_npu代替torch，同时Flash-Attention和Apex不再支持安装，相应功能已由InternEvo代码内部实现。以下教程仅为torch_npu安装。
diff --git a/docker.Makefile b/docker.Makefile
@@ -1,12 +1,11 @@
 DOCKER_REGISTRY          ?= docker.io
-DOCKER_ORG               ?= my
-DOCKER_IMAGE             ?= internlm
+DOCKER_ORG               ?= internlm
+DOCKER_IMAGE             ?= internevo
 DOCKER_FULL_NAME          = $(DOCKER_REGISTRY)/$(DOCKER_ORG)/$(DOCKER_IMAGE)
 
-CUDA_VERSION              = 11.7.1
-GCC_VERSION               = 10.2.0
-
+CUDA_VERSION              = 11.8.0
 CUDNN_VERSION             = 8
+
 BASE_RUNTIME              =
 # ubuntu20.04  centos7
 BASE_OS                   = centos7
@@ -17,9 +16,10 @@ CUDA_CHANNEL              = nvidia
 INSTALL_CHANNEL          ?= pytorch
 
 PYTHON_VERSION           ?= 3.10
-PYTORCH_VERSION          ?= 1.13.1
-TORCHVISION_VERSION      ?= 0.14.1
-TORCHAUDIO_VERSION       ?= 0.13.1
+PYTORCH_TAG              ?= 2.1.0
+PYTORCH_VERSION          ?= 2.1.0+cu118
+TORCHVISION_VERSION      ?= 0.16.0+cu118
+TORCHAUDIO_VERSION       ?= 2.1.0+cu118
 BUILD_PROGRESS           ?= auto
 TRITON_VERSION           ?=
 GMP_VERSION              ?= 6.2.1
@@ -28,18 +28,14 @@ MPC_VERSION              ?= 1.2.1
 GCC_VERSION              ?= 10.2.0
 HTTPS_PROXY_I            ?=
 HTTP_PROXY_I             ?=
-FLASH_ATTEN_VERSION      ?= 1.0.5
+FLASH_ATTEN_VERSION      ?= 2.2.1
 FLASH_ATTEN_TAG          ?= v${FLASH_ATTEN_VERSION}
 
 BUILD_ARGS                = --build-arg BASE_IMAGE=$(BASE_IMAGE) \
                             --build-arg PYTHON_VERSION=$(PYTHON_VERSION) \
-                            --build-arg CUDA_VERSION=$(CUDA_VERSION) \
-                            --build-arg CUDA_CHANNEL=$(CUDA_CHANNEL) \
                             --build-arg PYTORCH_VERSION=$(PYTORCH_VERSION) \
                             --build-arg TORCHVISION_VERSION=$(TORCHVISION_VERSION) \
                             --build-arg TORCHAUDIO_VERSION=$(TORCHAUDIO_VERSION) \
-                            --build-arg INSTALL_CHANNEL=$(INSTALL_CHANNEL) \
-                            --build-arg TRITON_VERSION=$(TRITON_VERSION) \
                             --build-arg GMP_VERSION=$(GMP_VERSION) \
                             --build-arg MPFR_VERSION=$(MPFR_VERSION) \
                             --build-arg MPC_VERSION=$(MPC_VERSION) \
@@ -98,7 +94,7 @@ all: devel-image
 
 .PHONY: devel-image
 devel-image: BASE_IMAGE := $(BASE_DEVEL)
-devel-image: DOCKER_TAG := torch${PYTORCH_VERSION}-cuda${CUDA_VERSION}-flashatten${FLASH_ATTEN_VERSION}-${BASE_OS}
+devel-image: DOCKER_TAG := torch${PYTORCH_TAG}-cuda${CUDA_VERSION}-flashatten${FLASH_ATTEN_VERSION}-${BASE_OS}
 devel-image:
 	$(DOCKER_BUILD)
 
diff --git a/docker/Dockerfile-centos b/docker/Dockerfile-centos
@@ -107,18 +107,18 @@ ENV CXX=${GCC_HOME}/bin/c++
 
 
 ##############################################################################
-# Install InternLM development environment, including flash-attention and apex
+# Install InternEvo development environment, including flash-attention and apex
 ##############################################################################
 FROM dep as intrenlm-dev
-COPY . /InternLM
-WORKDIR /InternLM
+COPY . /InternEvo
+WORKDIR /InternEvo
 ARG https_proxy
 ARG http_proxy
 ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
 RUN git submodule update --init --recursive \
     && /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt \
     && /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt \
-    && cd /InternLM/third_party/flash-attention \
+    && cd /InternEvo/third_party/flash-attention \
     && /opt/conda/bin/python setup.py install \
     && cd ./csrc \
     && cd fused_dense_lib && /opt/conda/bin/pip install -v . \
@@ -127,6 +127,9 @@ RUN git submodule update --init --recursive \
     && cd ../layer_norm && /opt/conda/bin/pip install -v . \
     && cd ../../../../ \
     && cd ./third_party/apex \
-    && /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
+    && /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
+    && /opt/conda/bin/pip install pytorch-extension \
     && /opt/conda/bin/pip cache purge \
-    && rm -rf ~/.cache/pip
+    && rm -rf ~/.cache/pip \
+    && /opt/conda/bin/conda init \
+    && . ~/.bashrc
diff --git a/docker/Dockerfile-ubuntu b/docker/Dockerfile-ubuntu
@@ -88,18 +88,18 @@ ENV CXX=${GCC_HOME}/bin/c++
 
 
 ##############################################################################
-# Install InternLM development environment, including flash-attention and apex
+# Install InternEvo development environment, including flash-attention and apex
 ##############################################################################
 FROM dep as intrenlm-dev
-COPY . /InternLM
-WORKDIR /InternLM
+COPY . /InternEvo
+WORKDIR /InternEvo
 ARG https_proxy
 ARG http_proxy
 ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
 RUN git submodule update --init --recursive \
     && /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt \
     && /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt \
-    && cd /InternLM/third_party/flash-attention \
+    && cd /InternEvo/third_party/flash-attention \
     && /opt/conda/bin/python setup.py install \
     && cd ./csrc \
     && cd fused_dense_lib && /opt/conda/bin/pip install -v . \
@@ -108,6 +108,9 @@ RUN git submodule update --init --recursive \
     && cd ../layer_norm && /opt/conda/bin/pip install -v . \
     && cd ../../../../ \
     && cd ./third_party/apex \
-    && /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
+    && /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
+    && /opt/conda/bin/pip install pytorch-extension \
     && /opt/conda/bin/pip cache purge \
-    && rm -rf ~/.cache/pip
+    && rm -rf ~/.cache/pip \
+    && /opt/conda/bin/conda init \
+    && . ~/.bashrc
diff --git a/experiment/Dockerfile-centos b/experiment/Dockerfile-centos
@@ -106,11 +106,11 @@ ENV CXX=${GCC_HOME}/bin/c++
 
 
 ##############################################################################
-# Install InternLM development environment, including flash-attention and apex
+# Install InternEvo development environment, including flash-attention and apex
 ##############################################################################
 FROM dep as intrenlm-dev
-COPY . /InternLM
-WORKDIR /InternLM
+COPY . /InternEvo
+WORKDIR /InternEvo
 ARG https_proxy
 ARG http_proxy
 ARG PYTORCH_VERSION
@@ -134,11 +134,11 @@ RUN /opt/conda/bin/pip --no-cache-dir install \
     torch-scatter \
     pyecharts \
     py-libnuma \
-    -f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}+cu117.html \
+    -f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}.html \
     && /opt/conda/bin/pip --no-cache-dir install \
-    --extra-index-url https://download.pytorch.org/whl/cu117 \
-    torch==${PYTORCH_VERSION}+cu117 \
-    torchvision==${TORCHVISION_VERSION}+cu117 \
+    --extra-index-url https://download.pytorch.org/whl/cu118 \
+    torch==${PYTORCH_VERSION} \
+    torchvision==${TORCHVISION_VERSION} \
     torchaudio==${TORCHAUDIO_VERSION}
 
 ARG https_proxy
@@ -147,7 +147,7 @@ ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
 ARG FLASH_ATTEN_TAG
 
 RUN git submodule update --init --recursive \
-    && cd /InternLM/third_party/flash-attention \
+    && cd /InternEvo/third_party/flash-attention \
     && git checkout ${FLASH_ATTEN_TAG} \
     && /opt/conda/bin/python setup.py install \
     && cd ./csrc \
@@ -157,6 +157,9 @@ RUN git submodule update --init --recursive \
     && cd ../layer_norm && /opt/conda/bin/pip install -v . \
     && cd ../../../../ \
     && cd ./third_party/apex \
-    && /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
+    && /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
+    && /opt/conda/bin/pip install pytorch-extension \
     && /opt/conda/bin/pip cache purge \
-    && rm -rf ~/.cache/pip
+    && rm -rf ~/.cache/pip \
+    && /opt/conda/bin/conda init \
+    && . ~/.bashrc
diff --git a/experiment/Dockerfile-ubuntu b/experiment/Dockerfile-ubuntu
@@ -87,11 +87,11 @@ ENV CXX=${GCC_HOME}/bin/c++
 
 
 ##############################################################################
-# Install InternLM development environment, including flash-attention and apex
+# Install InternEvo development environment, including flash-attention and apex
 ##############################################################################
 FROM dep as intrenlm-dev
-COPY . /InternLM
-WORKDIR /InternLM
+COPY . /InternEvo
+WORKDIR /InternEvo
 ARG https_proxy
 ARG http_proxy
 ARG PYTORCH_VERSION
@@ -115,11 +115,11 @@ RUN /opt/conda/bin/pip --no-cache-dir install \
     torch-scatter \
     pyecharts \
     py-libnuma \
-    -f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}+cu117.html \
+    -f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}.html \
     && /opt/conda/bin/pip --no-cache-dir install \
-    --extra-index-url https://download.pytorch.org/whl/cu117 \
-    torch==${PYTORCH_VERSION}+cu117 \
-    torchvision==${TORCHVISION_VERSION}+cu117 \
+    --extra-index-url https://download.pytorch.org/whl/cu118 \
+    torch==${PYTORCH_VERSION} \
+    torchvision==${TORCHVISION_VERSION} \
     torchaudio==${TORCHAUDIO_VERSION}
 
 ARG https_proxy
@@ -128,7 +128,7 @@ ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
 ARG FLASH_ATTEN_TAG
 
 RUN git submodule update --init --recursive \
-    && cd /InternLM/third_party/flash-attention \
+    && cd /InternEvo/third_party/flash-attention \
     && git checkout ${FLASH_ATTEN_TAG} \
     && /opt/conda/bin/python setup.py install \
     && cd ./csrc \
@@ -138,6 +138,9 @@ RUN git submodule update --init --recursive \
     && cd ../layer_norm && /opt/conda/bin/pip install -v . \
     && cd ../../../../ \
     && cd ./third_party/apex \
-    && /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
+    && /opt/conda/bin/pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ \
+    && /opt/conda/bin/pip install pytorch-extension \
     && /opt/conda/bin/pip cache purge \
-    && rm -rf ~/.cache/pip
+    && rm -rf ~/.cache/pip \
+    && /opt/conda/bin/conda init \
+    && . ~/.bashrc
diff --git a/experiment/README-CN.md b/experiment/README-CN.md
diff --git a/experiment/README-EN.md b/experiment/README-EN.md