@@ -30,8 +30,31 @@ This API adds several batteries-included capabilities that simplify large-scale,
30
30
- Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
31
31
- Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
32
32
- Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.
33
-
34
- The following example shows how to run batched inference with Ray Data and vLLM:
35
- < gh-file:examples/offline_inference/batch_llm_inference.py >
33
+ - Reading and writing to most popular file formats and cloud object storage.
34
+ - Scaling up the workload without code changes.
35
+
36
+ ??? code
37
+
38
+ ```python
39
+ import ray # Requires ray>=2.44.1
40
+ from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
41
+
42
+ config = vLLMEngineProcessorConfig(model_source="unsloth/Llama-3.2-1B-Instruct")
43
+ processor = build_llm_processor(
44
+ config,
45
+ preprocess=lambda row: {
46
+ "messages": [
47
+ {"role": "system", "content": "You are a bot that completes unfinished haikus."},
48
+ {"role": "user", "content": row["item"]},
49
+ ],
50
+ "sampling_params": {"temperature": 0.3, "max_tokens": 250},
51
+ },
52
+ postprocess=lambda row: {"answer": row["generated_text"]},
53
+ )
54
+
55
+ ds = ray.data.from_items(["An old silent pond..."])
56
+ ds = processor(ds)
57
+ ds.write_parquet("local:///tmp/data/")
58
+ ```
36
59
37
60
For more information about the Ray Data LLM API, see the [ Ray Data LLM documentation] ( https://docs.ray.io/en/latest/data/working-with-llms.html ) .
0 commit comments