-
Notifications
You must be signed in to change notification settings - Fork 58
Open
Description
I think the current binary evaluation method (right or wrong) for assessing answers is still quite coarse. Would the following approach be better?
prompt_text = f"""
"You are evaluating a student's response. Compare the student's answer to the reference answer for the following question.\n\n
Question: {question}
Student's Answer: {student_ans}
Reference Answer: {answer}
Assess whether the student's answer is correct. Consider the following factors:
1. Does the student's answer contain the same key information as the reference answer?
2. Does the student's answer contradict the reference answer?
3. Does the student's answer fully address the question?
4. Did the student use appropriate tools to retrieve information?
5. For questions requiring up-to-date information, did the student use the `web_search` tool; for local data retrieval, did the student use the `search_corpus` tool?
First, analyze the similarities and differences between the student's answer and the reference answer, then provide a score:
▪ 1.0: Completely correct, includes all key information, tool usage is appropriate
▪ 0.8: Mostly correct, includes most key information, tool usage is appropriate
▪ 0.6: Partially correct, includes some key information, tool usage is somewhat appropriate
▪ 0.3: Mostly incorrect, lacks most key information or tool usage is inappropriate
▪ 0.0: Incorrect, lacks key information or contains errors, tool usage is incorrect
Do not provide lengthy content! Your score (return only the score):
"""
Metadata
Metadata
Assignees
Labels
No labels