-
Notifications
You must be signed in to change notification settings - Fork 18
[Add] Blog post on transformers backend integration with vLLM #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 4 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
8be139e
add transformers backend blog post
ariG23498 50cfc7b
Apply suggestions from code review
ariG23498 3f1d230
Update _posts/2025-04-11-transformers-backend.md
ariG23498 8ffc233
Merge branch 'main' into transformers-backend
ariG23498 201b768
Update _posts/2025-04-11-transformers-backend.md
hmellor File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,163 @@ | ||
--- | ||
layout: post | ||
title: "Transformers backend integration in vLLM" | ||
author: "The Hugging Face Team" | ||
image: /assets/figures/transformers-backend/transformers-backend.png | ||
thumbnail-img: /assets/figures/transformers-backend/transformers-backend.png | ||
share-img: /assets/figures/transformers-backend/transformers-backend.png | ||
--- | ||
|
||
The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/main/en/index) | ||
offers a flexible, unified interface to a vast ecosystem of model architectures. From research to | ||
fine-tuning on custom dataset, transformers is the go-to toolkit for all. | ||
|
||
But when it comes to *deploying* these models at scale, inference speed and efficiency often take | ||
center stage. Enter [vLLM](https://docs.vllm.ai/en/latest/), a library engineered for high-throughput | ||
inference, pulling models from the Hugging Face Hub and optimizing them for production-ready performance. | ||
|
||
A recent addition to the vLLM codebase enables leveraging transformers as a backend to run models. | ||
vLLM will therefore optimize throughput/latency on top of existing transformers architectures. | ||
In this post, we’ll explore how vLLM leverages the transformers backend to combine **flexibility** | ||
with **efficiency**, enabling you to deploy state-of-the-art models faster and smarter. | ||
|
||
## Transformers and vLLM: Inference in Action | ||
|
||
Let’s start with a simple text generation task using the `meta-llama/Llama-3.2-1B` model to see how | ||
these libraries stack up. | ||
|
||
**Infer with transformers** | ||
|
||
The transformers library shines in its simplicity and versatility. Using its `pipeline` API, inference is a breeze: | ||
|
||
```py | ||
from transformers import pipeline | ||
|
||
pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B") | ||
result = pipe("The future of AI is") | ||
|
||
print(result[0]["generated_text"]) | ||
``` | ||
|
||
This approach is perfect for prototyping or small-scale tasks, but it’s not optimized for high-volume | ||
inference or low-latency deployment. | ||
|
||
**Infer with vLLM** | ||
|
||
vLLM takes a different track, prioritizing efficiency with features like `PagedAttention` | ||
(a memory-efficient attention mechanism) and dynamic batching. Here’s the same task in vLLM: | ||
|
||
```py | ||
from vllm import LLM, SamplingParams | ||
|
||
llm = LLM(model="meta-llama/Llama-3.2-1B") | ||
params = SamplingParams(max_tokens=20) | ||
outputs = llm.generate("The future of AI is", sampling_params=params) | ||
print(f"Generated text: {outputs[0].outputs[0].text}") | ||
``` | ||
|
||
vLLM’s inference is noticeably faster and more resource-efficient, especially under load. | ||
For example, it can handle thousands of requests per second with lower GPU memory usage. | ||
|
||
## vLLM’s Deployment Superpower: OpenAI Compatibility | ||
|
||
Beyond raw performance, vLLM offers an OpenAI-compatible API, making it a drop-in replacement for | ||
external services. Launch a server: | ||
|
||
```bash | ||
vllm serve meta-llama/Llama-3.2-1B | ||
``` | ||
|
||
Then query it with curl: | ||
|
||
```bash | ||
curl http://localhost:8000/v1/completions \ | ||
-H "Content-Type: application/json" \ | ||
-d '{"model": "meta-llama/Llama-3.2-1B", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0}' | ||
``` | ||
|
||
Or use Python’s OpenAI client: | ||
|
||
```py | ||
from openai import OpenAI | ||
|
||
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1") | ||
completion = client.completions.create( | ||
model="meta-llama/Llama-3.2-1B", | ||
prompt="San Francisco is a", | ||
max_tokens=7, | ||
temperature=0 | ||
) | ||
print("Completion result:", completion.choices[0].text) | ||
``` | ||
|
||
This compatibility slashes costs and boosts control, letting you scale inference locally with vLLM’s optimizations. | ||
|
||
## Why do we need the transformers backend? | ||
|
||
The transformers library is optimized for contributions and | ||
[addition of new models](https://huggingface.co/docs/transformers/en/add_new_model). Adding a new | ||
model to vLLM on the other hand is a little | ||
[more involved](https://docs.vllm.ai/en/latest/contributing/model/index.html). | ||
|
||
In the **ideal world**, we would be able to use the new model in vLLM as soon as it is added to | ||
transformers. With the integration of the transformers backend, we step towards that ideal world. | ||
|
||
Here is the [official documentation](https://docs.vllm.ai/en/latest/models/supported_models.html#remote-code) | ||
on how to make your transformers model compatible with vLLM for the integration to kick in. | ||
We followed this and made `modeling_gpt2.py` compatible with the integration! You can follow the | ||
changes in this [transformers pull request](https://github.com/huggingface/transformers/pull/36934). | ||
|
||
For a model already in transformers (and compatible with vLLM), this is what we would need to: | ||
|
||
```py | ||
llm = LLM(model="new-transformers-model", model_impl="transformers") | ||
``` | ||
|
||
> ![NOTE] | ||
> It is not a strict necessity to add `model_impl` parameter. vLLM switches to the transformers | ||
implementation on its own if the model is not natively supported in vLLM. | ||
hmellor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Or for a custom model from the Hugging Face Hub: | ||
|
||
```py | ||
llm = LLM(model="custom-hub-model", model_impl="transformers", trust_remote_code=True) | ||
``` | ||
|
||
This backend acts as a **bridge**, marrying transformers’ plug-and-play flexibility with vLLM’s | ||
inference prowess. You get the best of both worlds: rapid prototyping with transformers | ||
and optimized deployment with vLLM. | ||
|
||
## Case Study: Helium | ||
|
||
[Kyutai Team’s Helium](https://huggingface.co/docs/transformers/en/model_doc/helium) is not yet supported by vLLM. You might want to run optimized inference on the model with vLLM, and this is where the transformers backend shines. | ||
|
||
Let’s see this in action: | ||
|
||
```bash | ||
vllm serve kyutai/helium-1-preview-2b --model-impl transformers | ||
``` | ||
|
||
Query it with the OpenAI API: | ||
|
||
```py | ||
from openai import OpenAI | ||
|
||
openai_api_key = "EMPTY" | ||
openai_api_base = "http://localhost:8000/v1" | ||
|
||
client = OpenAI( | ||
api_key=openai_api_key, | ||
base_url=openai_api_base, | ||
) | ||
|
||
completion = client.completions.create(model="kyutai/helium-1-preview-2b", prompt="What is AI?") | ||
print("Completion result:", completion) | ||
``` | ||
|
||
Here, vLLM efficiently processes inputs, leveraging the transformers backend to load | ||
`kyutai/helium-1-preview-2b` seamlessly. Compared to running this natively in transformers, | ||
vLLM delivers lower latency and better resource utilization. | ||
|
||
By pairing Transformers’ model ecosystem with vLLM’s inference optimizations, you unlock a workflow | ||
that’s both flexible and scalable. Whether you’re prototyping a new model, deploying a custom | ||
creation, or scaling a multimodal app, this combination accelerates your path from research to production. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.