Skip to content

Commit 0d914c8

Browse files
authored
[Docs] Rewrite offline inference guide (#20594)
Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
1 parent 6e428cd commit 0d914c8

File tree

1 file changed

+19
-8
lines changed

1 file changed

+19
-8
lines changed

docs/serving/offline_inference.md

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,27 +3,38 @@ title: Offline Inference
33
---
44
[](){ #offline-inference }
55

6-
You can run vLLM in your own code on a list of prompts.
7-
8-
The offline API is based on the [LLM][vllm.LLM] class.
9-
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
6+
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
107

118
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
129
and runs it in vLLM using the default configuration.
1310

1411
```python
1512
from vllm import LLM
1613

14+
# Initialize the vLLM engine.
1715
llm = LLM(model="facebook/opt-125m")
1816
```
1917

20-
After initializing the `LLM` instance, you can perform model inference using various APIs.
21-
The available APIs depend on the type of model that is being run:
18+
After initializing the `LLM` instance, use the available APIs to perform model inference.
19+
The available APIs depend on the model type:
2220

2321
- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
2422
- [Pooling models][pooling-models] output their hidden states directly.
2523

26-
Please refer to the above pages for more details about each API.
27-
2824
!!! info
2925
[API Reference][offline-inference-api]
26+
27+
### Ray Data LLM API
28+
29+
Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
30+
This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
31+
32+
- Streaming execution processes datasets that exceed aggregate cluster memory.
33+
- Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
34+
- Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
35+
- Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.
36+
37+
The following example shows how to run batched inference with Ray Data and vLLM:
38+
<gh-file:examples/offline_inference/batch_llm_inference.py>
39+
40+
For more information about the Ray Data LLM API, see the [Ray Data LLM documentation](https://docs.ray.io/en/latest/data/working-with-llms.html).

0 commit comments

Comments
 (0)