FinalResponseMatchV2Evaluator returns a score of 0 for identical Chinese strings


### **Bug Report**

**Title:** `[Bug]: FinalResponseMatchV2Evaluator` returns a score of 0 for identical Chinese strings

**Describe the bug**

The `FinalResponseMatchV2Evaluator` (which uses an LLM as a Judge) consistently returns a `response_match_score` of `0.0` when comparing two identical Chinese strings. This occurs even when the `actual_response` from the agent and the `expected_response` from the evalset are confirmed to be identical, including punctuation and without any hidden characters.

The issue appears to be a persistent, reproducible error within the Judge LLM's evaluation process for certain Chinese sentences, rather than an issue with the user's data or code.

**To Reproduce**

1.  **Create `agent.py`:**
    ```python
    # ./weather/agent.py
    from google.adk.agents.llm_agent import Agent
    import random

    def query_weather(city_name: str) -> str:
        weather_list = ['晴天']
        weather = random.choice(weather_list)
        return f"{city_name}的天氣是：晴天。"

    root_agent = Agent(
        model='gemini-2.5-flash',
        name='weather_agent',
        description='weather agent',
        instruction="""
        你是一個天氣預報助理，請根據使用者提供的中文城市名稱，中文回覆該城市的天氣狀況。
        """,
        tools=[query_weather],
    )
    ```

2.  **Create `cut_word.evalset.json`:**
    ```json
    // ./weather/cut_word.evalset.json
    {
      "eval_set_id": "cut_word",
      "eval_cases": [
        {
          "eval_id": "taipei",
          "conversation": [
            {
              "user_content": { "parts": [{ "text": "台北天氣" }] },
              "final_response": { "parts": [{ "text": "台北的天氣是：晴天。" }] }
            }
          ]
        }
      ]
    }
    ```

3.  **Create `test_config.json`:**
    ```json
    // ./weather/test_config.json
    {
      "criteria": {
        "response_match_score": 0.5
      }
    }
    ```

4.  **Run the evaluation command** from the project's root directory:
    ```bash
    adk eval ./weather/ ./weather/cut_word.evalset.json --config_file_path=./weather/test_config.json
    ```

**Expected behavior**

The `response_match_score` should be `1.0` and the `final_eval_status` should be `PASSED` (1), since the actual and expected responses are identical.

**Screenshots**

<img width="1981" height="273" alt="Image" src="https://github.com/user-attachments/assets/21d379e8-2065-48f4-b6df-c3754469c9d0" />

Below is the JSON output from the `adk eval` command, which serves as evidence. It clearly shows that `actual_invocation` and `expected_invocation` have identical `final_response` text, yet the `score` is `0.0`.

```json
{
  "eval_id": "taipei",
  "final_eval_status": 2,
  "overall_eval_metric_results": [
    {
      "metric_name": "response_match_score",
      "threshold": 0.5,
      "score": 0.0,
      "eval_status": 2
    }
  ],
  "eval_metric_result_per_invocation": [
    {
      "actual_invocation": {
        "final_response": {
          "parts": [
            { "text": "台北的天氣是：晴天。" }
          ]
        }
      },
      "expected_invocation": {
        "final_response": {
          "parts": [
            { "text": "台北的天氣是：晴天。" }
          ]
        }
      },
      "eval_metric_results": [
        {
          "metric_name": "response_match_score",
          "score": 0.0,
          "eval_status": 2
        }
      ]
    }
  ]
}
```

**Desktop (please complete the following information):**
 - OS: macOS
 - Python version(python -V): Python 3.13
 - ADK version(pip show google-adk): adk, version 1.13.0   

**Model Information:**
 - Are you using LiteLLM: No
 - Which model is being used(e.g. gemini-2.5-pro): The agent uses `gemini-2.5-flash`. The model for the "Judge" in the `FinalResponseMatchV2Evaluator` is internal to the ADK framework and is not specified by the user.

**Additional context**

We have gone through an extensive debugging process to isolate the cause:
1.  **Hidden Characters:** Ruled out by programmatically rewriting the `.evalset.json` file to ensure clean strings. The issue persisted.
2.  **Punctuation Mismatch:** Ruled out by standardizing the punctuation (from `.` to `。`) in both the agent's output and the eval set. The issue persisted.
3.  **Sentence Structure:** Ruled out by changing the sentence structure (e.g., adding a colon `:`) in both the agent and the eval set. The issue persisted with the new, but still identical, strings.
4.  **Code Flow:** The internal ADK function `get_text_from_content` was analyzed and confirmed to not alter the strings in this scenario.

**Conclusion:**

After eliminating all data and code-flow-related causes, the only remaining explanation is a persistent, reproducible bug within the Judge LLM when it is asked to compare these specific, identical Chinese strings.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FinalResponseMatchV2Evaluator returns a score of 0 for identical Chinese strings #3162

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FinalResponseMatchV2Evaluator returns a score of 0 for identical Chinese strings #3162

Description

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions