@@ -3,27 +3,38 @@ title: Offline Inference
3
3
---
4
4
[ ] ( ) { #offline-inference }
5
5
6
- You can run vLLM in your own code on a list of prompts.
7
-
8
- The offline API is based on the [ LLM] [ vllm.LLM ] class.
9
- To initialize the vLLM engine, create a new instance of ` LLM ` and specify the model to run.
6
+ Offline inference is possible in your own code using vLLM's [ ` LLM ` ] [ vllm.LLM ] class.
10
7
11
8
For example, the following code downloads the [ ` facebook/opt-125m ` ] ( https://huggingface.co/facebook/opt-125m ) model from HuggingFace
12
9
and runs it in vLLM using the default configuration.
13
10
14
11
``` python
15
12
from vllm import LLM
16
13
14
+ # Initialize the vLLM engine.
17
15
llm = LLM(model = " facebook/opt-125m" )
18
16
```
19
17
20
- After initializing the ` LLM ` instance, you can perform model inference using various APIs .
21
- The available APIs depend on the type of model that is being run :
18
+ After initializing the ` LLM ` instance, use the available APIs to perform model inference .
19
+ The available APIs depend on the model type :
22
20
23
21
- [ Generative models] [ generative-models ] output logprobs which are sampled from to obtain the final output text.
24
22
- [ Pooling models] [ pooling-models ] output their hidden states directly.
25
23
26
- Please refer to the above pages for more details about each API.
27
-
28
24
!!! info
29
25
[ API Reference] [ offline-inference-api ]
26
+
27
+ ### Ray Data LLM API
28
+
29
+ Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
30
+ This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
31
+
32
+ - Streaming execution processes datasets that exceed aggregate cluster memory.
33
+ - Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
34
+ - Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
35
+ - Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.
36
+
37
+ The following example shows how to run batched inference with Ray Data and vLLM:
38
+ < gh-file:examples/offline_inference/batch_llm_inference.py >
39
+
40
+ For more information about the Ray Data LLM API, see the [ Ray Data LLM documentation] ( https://docs.ray.io/en/latest/data/working-with-llms.html ) .
0 commit comments