Degraded Generation Results after Loading HuggingFace Model with LlamaIndex #18612

tommycwh · 2025-05-03T17:59:42Z

tommycwh
May 3, 2025

I have been using HuggingFace and now I am trying out LlamaIndex and see if I can use LlamaIndex to load HF models. However, when I try to load a HF model as a llama_index.llms.huggingface.HuggingFaceLLM model, it does not generate reasonable texts, e.g., repeated words or meaningless texts. So, I want to ask what I am missing here that make my HF models not working in LlamaIndex.

Let me explain my test case here. To begin with, I am using a prompt from the dataset MiniMtBenchSingleGradingDataset, following the steps in the example "LlamaIndex
Benchmarking LLM Evaluators On A Mini MT-Bench (Single Grading)". I mainly use the first prompt from the dataset:

from llama_index.core.llama_dataset import download_llama_dataset

# download dataset
evaluator_dataset, _ = download_llama_dataset(
    "MiniMtBenchSingleGradingDataset", "./mini_mt_bench_data"
)
# get the first example
example = evaluator_dataset[0]

# example.query:
# 'Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.'
# example.answer:
# 'Title: Discovering the Aloha Spirit: A Recent Trip to Hawaii\n\nIntroduction:\n\nHawaii ...'

If I only use the HF packages, the models I tested can generate reasonable texts. For example:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda"

# model to test
# model_id = "openai-community/gpt2"
model_id = "mistralai/Mistral-7B-v0.1"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=device,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    # device_map=device,
)

# prompt model
prompt = 'Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.'

model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
model.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generated_text = tokenizer.batch_decode(generated_ids)[0]

# example generated_text:
# '<s> Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.\n\n## Introduction\n\nHawaii is a beautiful and diverse state with a rich culture and history. From the stunning beaches to the lush rainforests, there is something for everyone in Hawaii. In this blog post, we will explore some of the best cultural experiences and must-see attractions in Hawaii.\n\n## Cultural Experiences\n\nOne of the best ways to experience Hawaii’s culture is to attend a luau. A luau is a traditional Hawai'

However, when I load the HF model as a HuggingFaceLLM, it does not give reasonable results. For example, following the steps in this example:

from llama_index.llms.huggingface import HuggingFaceLLM
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda"


# load hf model
# model_id = "openai-community/gpt2"
model_id = "mistralai/Mistral-7B-v0.1"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=device,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    # device_map=device,
)

# load as a llamaindex model
llm = HuggingFaceLLM(
    model=model, 
)

# prompt model
response = llm.complete(
    example.query,
)

# response.text
# [',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,']
# response.raw['model_output']:
tensor([[ 2115,  3014,   271, 15966,  4288,  5311,  1501,   670,   247,  3332,
          7408,   281, 22085,    13, 27321,  8928,  8450,   285,  1364,    14,
          2887, 31154,    15,    13,    13,    13,    13,    13,    13,    13,
            13,    13,    13,    13,    13,    13,    13,    13,    13,    13,
            13,    13,    13,    13,    13,    13,    13,    13,    13,    13, ...]])

In some of my other trials, it does generate a sentence, but then just repeating it, probably till the max_token is reached.

Does anyone have any idea about why this happens, or does anyone also have a similar experience? Any comments about this will be very helpful. Thank you!

@tommycwh · 2025-05-03T18:01:10Z

dosubot[bot]
bot May 3, 2025

Hello @tommycwh! I'm here to help you solve bugs, answer your questions, and assist you in becoming a contributor. While you wait for a human maintainer, feel free to share your issue, and I'll do my best to assist you!

The issue you're experiencing with degraded generation outputs when using llama_index.llms.huggingface.HuggingFaceLLM compared to using the model directly with HuggingFace packages may be related to how prompts are formatted and processed. The HuggingFaceLLM class uses custom functions like messages_to_prompt and completion_to_prompt to convert messages into prompts, which might not align perfectly with the model's expected input format. Additionally, specific configurations such as stopping_criteria, generate_kwargs, and tokenizer_outputs_to_remove could affect the generation quality if not set optimally ^[1].

For example, the messages_to_prompt function formats messages with specific tags like <|system|>, <|user|>, and <|assistant|>, which might not be suitable for all models or could lead to unexpected behavior if the model is not trained with such formatting. Adjusting these configurations and ensuring compatibility between the model and tokenizer can help improve the output quality ^[1].

I found a similar unsolved discussion: HuggingfaceLLM Meta Llama3.1 8B generation doesn't stop, which might provide additional insights or shared experiences ^[2].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

0 replies

tommycwh · 2025-05-03T18:28:57Z

tommycwh
May 3, 2025
Author

I just found that I made a very simple mistake here. I am sorry that that. I forgot to pass the tokenizer when I wrap the HF model as a HuggingFaceLLM.

Fixing this, the code below generates reasonable texts.

(... same as above ...)

# load as a llamaindex model
llm = HuggingFaceLLM(
    model=model, 
    tokenizer=tokenizer,
)

# prompt model
response = llm.complete(
    example.query,
)

# response.text:
# '\n\n## Introduction\n\nHawaii is a beautiful and diverse state with a rich culture and history. From the stunning beaches to the lush rainforests, there is something for everyone in Hawaii. In this blog post, we will explore some of the best cultural experiences and must-see attractions in Hawaii.\n\n## Cultural Experiences\n\nOne of the best ways to experience Hawaii’s culture is to attend a luau. A luau is a traditional Hawaiian feast that includes music, dancing, and storytelling. There are many luaus to choose from, but some of the most popular ones include the Paradise Cove Luau and the Polynesian Cultural Center.\n\nAnother great way to experience Hawaii’s culture is to visit a Hawaiian village. These villages offer a glimpse into the past and allow visitors to learn about the history and traditions of the Hawaiian people. Some of the most popular Hawaiian villages include the Polynesian Cultural Center, the Kualoa Ranch, and the Waimea Valley.\n\n## Must-See Attractions\n\nHawaii is home to some of the most beautiful beaches in the world. Some of the most popular beaches'

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Degraded Generation Results after Loading HuggingFace Model with LlamaIndex #18612

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Degraded Generation Results after Loading HuggingFace Model with LlamaIndex #18612

Uh oh!

tommycwh May 3, 2025

Replies: 2 comments

Uh oh!

dosubot[bot] bot May 3, 2025

Uh oh!

tommycwh May 3, 2025 Author

tommycwh
May 3, 2025

dosubot[bot]
bot May 3, 2025

tommycwh
May 3, 2025
Author