How to properly serve a locally downloaded model? #4792

yitistica · 2024-05-13T18:35:16Z

yitistica
May 13, 2024

I have followed the steps on Unsloth official notebook Alpaca + Llama-3 8b full example and finetuned a llama 3 8B model and I wanted to serve it using vllm? However it does not seem to work.

This is the command I used for serving the local model, with "/content/merged_llama3" being the directory that contains all model files:

! python -m vllm.entrypoints.openai.api_server --port 8000 --model "/content/merged_llama3"

which returns an error:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 168, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 346, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 520, in create_engine_config
    model_config = ModelConfig(
  File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 131, in __init__
    self._verify_quantization()
  File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 196, in _verify_quantization
    raise ValueError(
ValueError: Unknown quantization method: bitsandbytes. Must be one of ['aqlm', 'awq', 'fp8', 'gptq', 'squeezellm', 'gptq_marlin', 'marlin'].

I dont think I ever need to provide quantization method as it should be written in the config file, it should be a mistake reading all those files. In addition, I did save the model and pushed it to the hub using the given codes in the Unsloth notebook?

# Merge to 4bit
if True: model.save_pretrained_merged("merged_llama3", tokenizer, save_method = "merged_4bit_forced",)
if True: model.push_to_hub_merged("xxxx/merged_llama3", tokenizer, save_method = "merged_4bit_forced", token = "xxxx")

my model files:

content/merged_llama3:
 - README.md
 - config.json
 - generation_config.json
 - model-000001-of-00002.safetensors
 - model-000002-of-00002.safetensors
- model.safetensors.index.json
- special_tokens_map.json
- tokenizer.json
- tokenzier_config.json

what went wrong?

vinnitu · 2024-08-22T11:01:07Z

vinnitu
Aug 22, 2024

so... is it possible in such way serve model?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to properly serve a locally downloaded model? #4792

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to properly serve a locally downloaded model? #4792

Uh oh!

yitistica May 13, 2024

Replies: 1 comment

Uh oh!

vinnitu Aug 22, 2024

yitistica
May 13, 2024

vinnitu
Aug 22, 2024