[Question]: QuestionGeneration bias? #8610

austinmw · 2023-10-31T13:41:29Z

austinmw
Oct 31, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

Hi, I recently was checking out QuestionGeneration.

I'm curious, when evaluating multiple different LLMs for a RAG system, would you expect performance to be biased toward promoting the same LLM that was used to generate Q/A pairs?

For example if I use GPT-3.5 to generate Q/A pairs, then compare overall RAG performance with GPT-3.5 vs. Claude-V2, would you expect there to be a bias towards GPT-3.5 performing better, since it was also used to generate the Q/A pairs?

(The RAG performance metrics I'm currently exploring are Context Precision, Context Recall, Faithfulness and Answer Relevancy from the Ragas library.)

@austinmw · 2023-10-31T13:56:24Z

dosubot[bot]
bot Oct 31, 2023

🤖

Hi @austinmw,

It's good to see you again and I hope you're doing well.

Regarding your question about bias in the evaluation of different LLMs for a RAG system, it's a very insightful point you've raised.

In theory, if you use GPT-3.5 to generate Q/A pairs and then evaluate the performance of GPT-3.5 and Claude-V2 using these pairs, there could be a bias towards GPT-3.5. This is because the Q/A pairs were generated by GPT-3.5 and hence, might be more aligned with its style, context understanding, and knowledge representation.

However, the LlamaIndex framework does not have any built-in functionality or methods specifically for bias detection or mitigation when evaluating multiple different LLMs for a RAG system. The evaluation is done based on the metrics provided by the evaluators such as FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator, SemanticSimilarityEvaluator, PairwiseComparisonEvaluator, and RetrieverEvaluator.

These evaluators assess different aspects of the LLMs such as their faithfulness to the original text, their relevancy to the query, their correctness in terms of the information they provide, their semantic similarity to the expected response, their performance in pairwise comparisons, and their retrieval capabilities respectively.

While these evaluators provide a comprehensive evaluation of the LLMs, they do not specifically account for the bias that might be introduced by using the same LLM for generating the Q/A pairs and evaluating the performance.

To mitigate this potential bias, you might consider generating Q/A pairs using multiple different LLMs and then evaluating the performance of each LLM using the combined set of Q/A pairs. This could help to ensure that the evaluation is not biased towards any particular LLM.

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

nerdai · 2023-10-31T16:25:48Z

nerdai
Oct 31, 2023

Thanks for the question @austinmw! Moving this into Discussions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question]: QuestionGeneration bias? #8610

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Question]: QuestionGeneration bias? #8610

Uh oh!

austinmw Oct 31, 2023

Question Validation

Question

Replies: 2 comments

Uh oh!

dosubot[bot] bot Oct 31, 2023

Sources

Uh oh!

nerdai Oct 31, 2023

austinmw
Oct 31, 2023

dosubot[bot]
bot Oct 31, 2023

nerdai
Oct 31, 2023