vLLM + Mixtral AWQ question about chat template and tokenizer #3092
Michelklingler
announced in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently running an instance of "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ" on a RTX A6000 ADA.
For some reason I get wierd response when I talk with the AI, or at least not as good as when I was using Ollama as an inference server. I was wondering if I need to specify a chat template location or a tokenizer?
This is the command I use to run the server:
python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --quantization gptq --dtype half --api-key BLANK --gpu-memory-utilization 0.87
And this is the launchlog I get:
python -u -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --quantization gptq --dtype half --api-key BLANK --gpu-memory-utilization 0.87 INFO 02-28 15:04:29 api_server.py:229] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='BLANK', served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.87, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='gptq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 02-28 15:04:29 config.py:577] Casting torch.bfloat16 to torch.float16. WARNING 02-28 15:04:29 config.py:186] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 02-28 15:04:29 llm_engine.py:79] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) INFO 02-28 15:04:32 weight_utils.py:163] Using model weights format ['*.safetensors']
Any advice or support would be appreciated.
Thanks,
Michel
Beta Was this translation helpful? Give feedback.
All reactions