Skip to content

v100显卡,加载量化模型Yi-34B-Chat-4bits,推理速度很慢 #484

@zxdposter

Description

@zxdposter

Reminder

  • I have searched the Github Discussion and issues and have not found anything similar to this.

Environment

- OS: Centos7.9
- Python: 3.11.6
- PyTorch: 2.1.2
- CUDA: 12.4

Current Behavior

v100显卡,加载量化模型Yi-34B-Chat-4bits,推理速度很慢,要200秒左右,显存占了20G,还有10G空余
请问有办法解决吗?

我查过 issue,有几个人遇到了,但是都没有解决方法。

Expected Behavior

No response

Steps to Reproduce

model_id = '/home/Yi-34B-Chat-4bits'

nf4_config = GPTQConfig(
    bits=4,
    use_exllama=True,
    max_input_length=2048,
    use_cuda_fp16=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    quantization_config=nf4_config,
).eval()

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False, trust_remote_code=True)

DEVICE = "cuda"
DEVICE_ID = "0"
CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE
device_use = torch.device(CUDA_DEVICE)
model = model.to(device_use)

def chat(input):
    input_ids = tokenizer.apply_chat_template(conversation=[{"role": "user", "content": input}], tokenize=True,
                                              add_generation_prompt=True,
                                              return_tensors='pt')
    output_ids = model.generate(input_ids.to(device_use))
    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
    return response

time1 = time.time()
result = chat('你是谁')
print(f'耗时{time.time() - time1}', result)

输出结果:
耗时194.49595594406128 我是零一万物开发的一个智能助手,我叫 Yi,我是由零一万物的研究员们通过大量的文本数据进行训练,学习了语言的各种模式和关联,从而能够生成文本、回答问题、翻译语言的。我可以帮助用户解答问题、提供信息,以及进行各种语言相关的任务。我并不是一个真实的人,而是由代码和算法构成的,但我尽力模仿人类的交流方式,以便更好地与用户互动。如果你有任何问题或需要帮助,请随时告诉我!

Anything Else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions