You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Batch inference (internal request queuing and dynamic batching)
23
-
outputs = llm.generate(prompts, sampling_params)
24
-
25
-
# Output results
26
-
for output in outputs:
27
-
prompt = output.prompt
28
-
generated_text = output.outputs.text
29
-
```
30
-
31
6
### Chat Interface (LLM.chat)
32
7
```python
33
8
from fastdeploy importLLM, SamplingParams
@@ -58,16 +33,116 @@ for output in outputs:
58
33
59
34
Documentation for `SamplingParams`, `LLM.generate`, `LLM.chat`, and output structure `RequestOutput` is provided below.
60
35
61
-
> Note: For X1 model output
36
+
> Note: For reasoning models, when loading the model, you need to specify the reasoning_parser parameter. Additionally, during the request, you can toggle the reasoning feature on or off by configuring the `enable_thinking` parameter within `chat_template_kwargs`.
> Note: Text completion interface, suitable for scenarios where users have predefined the context input and expect the model to output only the continuation content. No additional `prompt` concatenation will be added during the inference process.
81
+
> For the `chat` model, it is recommended to use the Chat Interface (`LLM.chat`).
82
+
83
+
For multimodal models, such as `baidu/ERNIE-4.5-VL-28B-A3B-Paddle`, when calling the `generate interface`, you need to provide a prompt that includes images. The usage is as follows:
84
+
```python
85
+
import io
86
+
import os
87
+
import requests
88
+
fromPILimport Image
89
+
90
+
from fastdeploy.entrypoints.llm importLLM
91
+
from fastdeploy.engine.sampling_params import SamplingParams
92
+
from fastdeploy.input.ernie_tokenizer import ErnieBotTokenizer
>Note: The `generate interface` does not currently support passing parameters to control the thinking function (on/off). It always uses the model's default parameters.
145
+
71
146
## 2. API Documentation
72
147
73
148
### 2.1 fastdeploy.LLM
@@ -79,18 +154,20 @@ For ```LLM``` configuration, refer to [Parameter Documentation](parameters.md).
79
154
> 2. After startup, the service logs KV Cache block count (e.g. `total_block_num:640`). Multiply this by block_size (default 64) to get total cacheable tokens.
0 commit comments