Skip to content

fix(docker) rocm 6.3 based image #8152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 39 additions & 14 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ RUN --mount=type=cache,target=/var/cache/apt \
libglx-mesa0 \
build-essential \
libopencv-dev \
libstdc++-10-dev
libstdc++-10-dev \
wget

ENV \
PYTHONUNBUFFERED=1 \
Expand All @@ -44,7 +45,6 @@ ENV \
UV_MANAGED_PYTHON=1 \
UV_LINK_MODE=copy \
UV_PROJECT_ENVIRONMENT=/opt/venv \
UV_INDEX="https://download.pytorch.org/whl/cu124" \
INVOKEAI_ROOT=/invokeai \
INVOKEAI_HOST=0.0.0.0 \
INVOKEAI_PORT=9090 \
Expand All @@ -54,6 +54,10 @@ ENV \

ARG GPU_DRIVER=cuda

ARG CUDA_TORCH="https://download.pytorch.org/whl/cu124"
ARG CPU_TORCH="https://download.pytorch.org/whl/cpu"
ARG ROCM_TORCH="https://download.pytorch.org/whl/rocm6.2.4"

# Install `uv` for package management
COPY --from=ghcr.io/astral-sh/uv:0.6.9 /uv /uvx /bin/

Expand All @@ -72,23 +76,41 @@ WORKDIR ${INVOKEAI_SRC}
# x86_64/CUDA is the default
RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,source=pyproject.toml,target=pyproject.toml \
--mount=type=bind,source=uv.lock,target=uv.lock \
# Cannot use uv sync and uv.lock as that is locked to CUDA version packages, which breaks rocm...
# --mount=type=bind,source=uv.lock,target=uv.lock \
# this is just to get the package manager to recognize that the project exists, without making changes to the docker layer
--mount=type=bind,source=invokeai/version,target=invokeai/version \
if [ "$TARGETPLATFORM" = "linux/arm64" ] || [ "$GPU_DRIVER" = "cpu" ]; then UV_INDEX="https://download.pytorch.org/whl/cpu"; \
elif [ "$GPU_DRIVER" = "rocm" ]; then UV_INDEX="https://download.pytorch.org/whl/rocm6.2"; \
ulimit -n 30000 && \
if [ "$TARGETPLATFORM" = "linux/arm64" ] || [ "$GPU_DRIVER" = "cpu" ]; then export UV_INDEX="$CPU_TORCH"; \
elif [ "$GPU_DRIVER" = "rocm" ]; then export UV_INDEX="$ROCM_TORCH"; \
else export UV_INDEX="$CUDA_TORCH"; \
fi && \
uv sync --frozen
uv venv --python 3.12 && \
# Use the public version to install existing known dependencies but using the UV_INDEX, not the hardcoded URLs within the uv.lock
uv pip install invokeai
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could conditionalize this logic, to use the uv.lock for cuda, and then use the UV_INDEX for CPU and ROCM, to reduce the risk of this change, but I went with this for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable to continue using uv.lock for the CUDA images, if possible, to keep it consistent with the installations produced by the official installer.

Ideally - if you're willing to work on this - we should find a way to support both cuda and rocm dependencies in a single uv.lock/pyproject.toml, perhaps by leveraging the uv dependency groups: https://docs.astral.sh/uv/concepts/projects/config/#conflicting-dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the uv.lock, there's some notes about things in the pyproject.toml that I would like your input on.


RUN --mount=type=cache,target=/var/cache/apt \
--mount=type=cache,target=/var/lib/apt \
if [ "$GPU_DRIVER" = "rocm" ]; then \
wget -O /tmp/amdgpu-install.deb \
https://repo.radeon.com/amdgpu-install/6.2.4/ubuntu/noble/amdgpu-install_6.2.60204-1_all.deb && \
apt install -y /tmp/amdgpu-install.deb && \
apt update && \
amdgpu-install --usecase=rocm -y && \
apt-get autoclean && \
apt clean && \
rm -rf /tmp/* /var/tmp/* && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely unnecessary. the gpu driver should be provided by the kernel, and rocm itself is usually not needed in the image because it's already bundled with pytorch. That is unless something changed in the most recent torch/rocm that makes this a requirement.

(to be clear, the video/render group additions for ubuntu user are needed should be kept)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped the rocm install, but kept the groups and got:
invokeai-rocm-1 | RuntimeError: No HIP GPUs are available
But there are 4 AMD GPUs on my system, so it's failing.

I went and looked at the rocm-pytorch docker, and they are installing the full rocmdev, I limited it to just the rocm binaries (also tried the hip alone but that still error'd).

Suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be sure - are you using amd-container-toolkit and the amd runtime for docker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's my goal, I don't want to have to modify the host and ensure that the container has everything. I'm running a proxmox host, with a docker LXC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't ideal, I can split that logic into my own and have this build the minimal way, or make it another config? rocm-standalone?

usermod -a -G render ubuntu && \
usermod -a -G video ubuntu && \
echo "\\n/opt/rocm/lib\\n/opt/rocm/lib64" >> /etc/ld.so.conf.d/rocm.conf && \
ldconfig && \
update-alternatives --auto rocm; \
fi

# build patchmatch
RUN cd /usr/lib/$(uname -p)-linux-gnu/pkgconfig/ && ln -sf opencv4.pc opencv.pc
RUN python -c "from patchmatch import patch_match"

# Link amdgpu.ids for ROCm builds
# contributed by https://github.com/Rubonnek
RUN mkdir -p "/opt/amdgpu/share/libdrm" &&\
ln -s "/usr/share/libdrm/amdgpu.ids" "/opt/amdgpu/share/libdrm/amdgpu.ids"

RUN mkdir -p ${INVOKEAI_ROOT} && chown -R ${CONTAINER_UID}:${CONTAINER_GID} ${INVOKEAI_ROOT}

COPY docker/docker-entrypoint.sh ./
Expand All @@ -105,9 +127,12 @@ COPY invokeai ${INVOKEAI_SRC}/invokeai
# in a previous layer
RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,source=pyproject.toml,target=pyproject.toml \
--mount=type=bind,source=uv.lock,target=uv.lock \
if [ "$TARGETPLATFORM" = "linux/arm64" ] || [ "$GPU_DRIVER" = "cpu" ]; then UV_INDEX="https://download.pytorch.org/whl/cpu"; \
elif [ "$GPU_DRIVER" = "rocm" ]; then UV_INDEX="https://download.pytorch.org/whl/rocm6.2"; \
# Cannot use the uv.lock as that is locked to CUDA version packages, which breaks rocm...
# --mount=type=bind,source=uv.lock,target=uv.lock \
ulimit -n 30000 && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ulimit doesn't affect much, wondering what's the reason for it here and the value of 30000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA and CPU doesn't hit the limit, but with ROCM it fails as to many files are being opened. I can try to lower the limit if it concerns you, I just made it something high and was able to continue, so never went back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matter much, this only applies during build, it's just really weird that this is needed at all.

if [ "$TARGETPLATFORM" = "linux/arm64" ] || [ "$GPU_DRIVER" = "cpu" ]; then export UV_INDEX="$CPU_TORCH"; \
elif [ "$GPU_DRIVER" = "rocm" ]; then export UV_INDEX="$ROCM_TORCH"; \
else export UV_INDEX="$CUDA_TORCH"; \
fi && \
uv pip install -e .

4 changes: 2 additions & 2 deletions docker/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ run() {

# parse .env file for build args
build_args=$(awk '$1 ~ /=[^$]/ && $0 !~ /^#/ {print "--build-arg " $0 " "}' .env) &&
profile="$(awk -F '=' '/GPU_DRIVER/ {print $2}' .env)"
profile="$(awk -F '=' '/GPU_DRIVER=/ {print $2}' .env)"

# default to 'cuda' profile
[[ -z "$profile" ]] && profile="cuda"
Expand All @@ -30,7 +30,7 @@ run() {

printf "%s\n" "starting service $service_name"
docker compose --profile "$profile" up -d "$service_name"
docker compose logs -f
docker compose --profile "$profile" logs -f
}

run