Skip to content

feat(vllm-tensorizer): Upgrade vLLM version and Resolve Related Build Compatibility Issues #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Jun 18, 2025

Conversation

JustinPerlman
Copy link
Contributor

@JustinPerlman JustinPerlman commented Jun 17, 2025

This PR focuses on updating the vLLM version being used within the vllm-tensorizer Docker image and resolving the various bugs and compatibility issues that arose from this upgrade.

Key Changes:

  • Updated Base Image: Switched to ghcr.io/coreweave/ml-containers/torch-extras:es-compute-12.0-67208ca-nccl-cuda12.9.0-ubuntu22.04-nccl2.27.3-1-torch2.7.1-vision0.22.1-audio2.7.1-abi1. This moves from torch:base to torch:nccl for CUDA development tools and aligns with PyTorch 2.7.1 and CUDA 12.9.0.
  • Upstream vLLM: Updated vLLM source to pull from the official vllm-project/vllm.git repository (v0.9.1), rather than CoreWeave's vLLM fork
  • Build Compatibility: Resolved various compilation errors, including:
    • FileNotFoundError: .../nvcc and CUDA::nvToolsExt not found by ensuring correct torch:nccl base image.
    • PyTorch version expected errors by integrating use_existing_torch.py to handle vLLM's PyTorch version expectations.
    • Out-of-memory issues during Flash Attention compilation by setting MAX_JOBS=16.
    • Dockerfile Hygiene: Improved Dockerfile clarity by consistently uppercasing AS keywords.

This commit updates the default VLLM_COMMIT_HASH used in the GitHub Actions workflow for the vllm-tensorizer image. This change points the build to a more recent commit of the vLLM project.
…ion with PyTorch 2.7.1; uppercase all `as` keywords
@JustinPerlman JustinPerlman requested a review from Eta0 June 17, 2025 17:11
@JustinPerlman JustinPerlman self-assigned this Jun 17, 2025
@Eta0 Eta0 added the enhancement New feature or request label Jun 17, 2025
@Eta0 Eta0 linked an issue Jun 17, 2025 that may be closed by this pull request
Copy link
Collaborator

@arsenetar arsenetar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the result of this build compare to the offical vllm image? Do we have all the same libraries etc? https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.

The CI here needs updates to build new version without manually specifying the version of base images to use.

I would also like to move to rename this to vllm-openai to align with the vllm project upstream image.

Co-authored-by: Eta <24918963+Eta0@users.noreply.github.com>
@Eta0
Copy link
Collaborator

Eta0 commented Jun 17, 2025

@JustinPerlman I think the way to go here to better gather the different workflow variables in one place (vLLM version, base image version, etc.) is to move to a matrix build strategy even if the matrix only has a single element. We can work on this together, as there is a template I put together for doing this.

@Eta0
Copy link
Collaborator

Eta0 commented Jun 17, 2025

@arsenetar, about this:

The CI here needs updates to build new version without manually specifying the version of base images to use.

What would you see as not manually specifying it? Automatically pulling the most recent image from main? Currently the only build in this repository that does that is torch-extras and it has some issues in practice, mainly that we often have changes in downstream images that require corresponding changes in the torch base image that will in turn require a new branch, so hardcoding it to look for images from main doesn't work very well for development

@arsenetar
Copy link
Collaborator

What would you see as not manually specifying it? Automatically pulling the most recent image from main? Currently the only build in this repository that does that is torch-extras and it has some issues in practice, mainly that we often have changes in downstream images that require corresponding changes in the torch base image that will in turn require a new branch, so hardcoding it to look for images from main doesn't work very well for development

I would expect the CI to set the ARG for the base image, there are likely reasons where we will want to build a matrix of versions similarly to the other torch images. Minimally that is something that should be changed to set that up for the CI to define/allow passing that.

With respect to upstream images breaking things and requiring updates, that is somewhat expected to happen. There is a difference between development and builds that should be considered validated released. This repo's package structure does not really help delineate the differences.

JustinPerlman and others added 2 commits June 17, 2025 14:53
Co-authored-by: Eta <esyra@coreweave.com>
Co-authored-by: Eta <esyra@coreweave.com>
Copy link

JustinPerlman and others added 2 commits June 17, 2025 15:18
This commit additionally renames VLLM_COMMIT_HASH to VLLM_COMMIT,
makes git clones more efficient, and reformats some YAML lists.
@Eta0
Copy link
Collaborator

Eta0 commented Jun 17, 2025

How does the result of this build compare to the offical vllm image? Do we have all the same libraries etc?

As far as I can tell, the only libraries we don't install that they do are ffmpeg libsm6 libxext6 libgl1, which is a bit of a strange set. I'm not sure we really need those, but if it comes up, we can add them. We also don't install uv to replace pip.

@arsenetar
Copy link
Collaborator

As far as I can tell, the only libraries we don't install that they do are ffmpeg libsm6 libxext6 libgl1, which is a bit of a strange set. I'm not sure we really need those, but if it comes up, we can add them. We also don't install uv to replace pip.

Out of all of that I could only think of ffmpeg as being of value if it's used by libraries for the multimodal.

@Eta0
Copy link
Collaborator

Eta0 commented Jun 17, 2025

I would also like to move to rename this to vllm-openai to align with the vllm project upstream image.

This would create a new package listing alongside vllm-tensorizer in the repo package list, so I'd prefer not to rename it unless there's value added by doing so.

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/15718383514
Image: ``

@JustinPerlman JustinPerlman requested a review from sangstar June 18, 2025 18:32
Copy link
Contributor

@sangstar sangstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested; LGTM.

@Eta0 Eta0 merged commit c87fc8f into main Jun 18, 2025
2 checks passed
@Eta0 Eta0 deleted the jp/testing/update-vllm branch June 18, 2025 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hi, Would consider update an vllm image update to latest?
4 participants