Fine-grained evaluation of responses.

I think the current binary evaluation method (right or wrong) for assessing answers is still quite coarse. Would the following approach be better?

```python
prompt_text = f"""
        "You are evaluating a student's response. Compare the student's answer to the reference answer for the following question.\n\n
            Question: {question}
            
            Student's Answer: {student_ans}
            
            Reference Answer: {answer}
            
            Assess whether the student's answer is correct. Consider the following factors:
            1. Does the student's answer contain the same key information as the reference answer?
            2. Does the student's answer contradict the reference answer?
            3. Does the student's answer fully address the question?
            4. Did the student use appropriate tools to retrieve information?
            5. For questions requiring up-to-date information, did the student use the `web_search` tool; for local data retrieval, did the student use the `search_corpus` tool?
            
            First, analyze the similarities and differences between the student's answer and the reference answer, then provide a score:
            ▪ 1.0: Completely correct, includes all key information, tool usage is appropriate
            ▪ 0.8: Mostly correct, includes most key information, tool usage is appropriate
            ▪ 0.6: Partially correct, includes some key information, tool usage is somewhat appropriate
            ▪ 0.3: Mostly incorrect, lacks most key information or tool usage is inappropriate
            ▪ 0.0: Incorrect, lacks key information or contains errors, tool usage is incorrect
            
            Do not provide lengthy content! Your score (return only the score):
            """
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine-grained evaluation of responses. #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fine-grained evaluation of responses. #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions