Skip to content

Commit 6a9e6b2

Browse files
authored
[doc] fold long code block (#20795)
Signed-off-by: reidliu41 <reid201711@gmail.com>
1 parent 5d09152 commit 6a9e6b2

File tree

1 file changed

+53
-53
lines changed

1 file changed

+53
-53
lines changed

docs/features/lora.md

Lines changed: 53 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -279,64 +279,64 @@ Some models, e.g., [Granite Speech](https://huggingface.co/ibm-granite/granite-s
279279

280280
To this end, we allow registration of default multimodal LoRAs to handle this automatically, where users can map each modality to a LoRA adapter to automatically apply it when the corresponding inputs are present. Note that currently, we only allow one LoRA per prompt; if several modalities are provided, each of which are registered to a given modality, none of them will be applied.
281281

282-
Example usage for offline inference:
282+
??? code "Example usage for offline inference"
283283

284-
```python
285-
from transformers import AutoTokenizer
286-
from vllm import LLM, SamplingParams
287-
from vllm.assets.audio import AudioAsset
288-
289-
model_id = "ibm-granite/granite-speech-3.3-2b"
290-
tokenizer = AutoTokenizer.from_pretrained(model_id)
291-
292-
def get_prompt(question: str, has_audio: bool):
293-
"""Build the input prompt to send to vLLM."""
294-
if has_audio:
295-
question = f"<|audio|>{question}"
296-
chat = [
297-
{
298-
"role": "user",
299-
"content": question
284+
```python
285+
from transformers import AutoTokenizer
286+
from vllm import LLM, SamplingParams
287+
from vllm.assets.audio import AudioAsset
288+
289+
model_id = "ibm-granite/granite-speech-3.3-2b"
290+
tokenizer = AutoTokenizer.from_pretrained(model_id)
291+
292+
def get_prompt(question: str, has_audio: bool):
293+
"""Build the input prompt to send to vLLM."""
294+
if has_audio:
295+
question = f"<|audio|>{question}"
296+
chat = [
297+
{
298+
"role": "user",
299+
"content": question
300+
}
301+
]
302+
return tokenizer.apply_chat_template(chat, tokenize=False)
303+
304+
305+
model = LLM(
306+
model=model_id,
307+
enable_lora=True,
308+
max_lora_rank=64,
309+
max_model_len=2048,
310+
limit_mm_per_prompt={"audio": 1},
311+
# Will always pass a `LoRARequest` with the `model_id`
312+
# whenever audio is contained in the request data.
313+
default_mm_loras = {"audio": model_id},
314+
enforce_eager=True,
315+
)
316+
317+
question = "can you transcribe the speech into a written format?"
318+
prompt_with_audio = get_prompt(
319+
question=question,
320+
has_audio=True,
321+
)
322+
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate
323+
324+
inputs = {
325+
"prompt": prompt_with_audio,
326+
"multi_modal_data": {
327+
"audio": audio,
300328
}
301-
]
302-
return tokenizer.apply_chat_template(chat, tokenize=False)
303-
304-
305-
model = LLM(
306-
model=model_id,
307-
enable_lora=True,
308-
max_lora_rank=64,
309-
max_model_len=2048,
310-
limit_mm_per_prompt={"audio": 1},
311-
# Will always pass a `LoRARequest` with the `model_id`
312-
# whenever audio is contained in the request data.
313-
default_mm_loras = {"audio": model_id},
314-
enforce_eager=True,
315-
)
316-
317-
question = "can you transcribe the speech into a written format?"
318-
prompt_with_audio = get_prompt(
319-
question=question,
320-
has_audio=True,
321-
)
322-
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate
323-
324-
inputs = {
325-
"prompt": prompt_with_audio,
326-
"multi_modal_data": {
327-
"audio": audio,
328329
}
329-
}
330330

331331

332-
outputs = model.generate(
333-
inputs,
334-
sampling_params=SamplingParams(
335-
temperature=0.2,
336-
max_tokens=64,
337-
),
338-
)
339-
```
332+
outputs = model.generate(
333+
inputs,
334+
sampling_params=SamplingParams(
335+
temperature=0.2,
336+
max_tokens=64,
337+
),
338+
)
339+
```
340340

341341
You can also pass a json dictionary of `--default-mm-loras` mapping modalities to LoRA model IDs. For example, when starting the server:
342342

0 commit comments

Comments
 (0)