Skip to content

FinalResponseMatchV2Evaluator returns a score of 0 for identical Chinese strings #3162

@LiuYuWei

Description

@LiuYuWei

Bug Report

Title: [Bug]: FinalResponseMatchV2Evaluator returns a score of 0 for identical Chinese strings

Describe the bug

The FinalResponseMatchV2Evaluator (which uses an LLM as a Judge) consistently returns a response_match_score of 0.0 when comparing two identical Chinese strings. This occurs even when the actual_response from the agent and the expected_response from the evalset are confirmed to be identical, including punctuation and without any hidden characters.

The issue appears to be a persistent, reproducible error within the Judge LLM's evaluation process for certain Chinese sentences, rather than an issue with the user's data or code.

To Reproduce

  1. Create agent.py:

    # ./weather/agent.py
    from google.adk.agents.llm_agent import Agent
    import random
    
    def query_weather(city_name: str) -> str:
        weather_list = ['晴天']
        weather = random.choice(weather_list)
        return f"{city_name}的天氣是:晴天。"
    
    root_agent = Agent(
        model='gemini-2.5-flash',
        name='weather_agent',
        description='weather agent',
        instruction="""
        你是一個天氣預報助理,請根據使用者提供的中文城市名稱,中文回覆該城市的天氣狀況。
        """,
        tools=[query_weather],
    )
  2. Create cut_word.evalset.json:

    // ./weather/cut_word.evalset.json
    {
      "eval_set_id": "cut_word",
      "eval_cases": [
        {
          "eval_id": "taipei",
          "conversation": [
            {
              "user_content": { "parts": [{ "text": "台北天氣" }] },
              "final_response": { "parts": [{ "text": "台北的天氣是:晴天。" }] }
            }
          ]
        }
      ]
    }
  3. Create test_config.json:

    // ./weather/test_config.json
    {
      "criteria": {
        "response_match_score": 0.5
      }
    }
  4. Run the evaluation command from the project's root directory:

    adk eval ./weather/ ./weather/cut_word.evalset.json --config_file_path=./weather/test_config.json

Expected behavior

The response_match_score should be 1.0 and the final_eval_status should be PASSED (1), since the actual and expected responses are identical.

Screenshots

Image

Below is the JSON output from the adk eval command, which serves as evidence. It clearly shows that actual_invocation and expected_invocation have identical final_response text, yet the score is 0.0.

{
  "eval_id": "taipei",
  "final_eval_status": 2,
  "overall_eval_metric_results": [
    {
      "metric_name": "response_match_score",
      "threshold": 0.5,
      "score": 0.0,
      "eval_status": 2
    }
  ],
  "eval_metric_result_per_invocation": [
    {
      "actual_invocation": {
        "final_response": {
          "parts": [
            { "text": "台北的天氣是:晴天。" }
          ]
        }
      },
      "expected_invocation": {
        "final_response": {
          "parts": [
            { "text": "台北的天氣是:晴天。" }
          ]
        }
      },
      "eval_metric_results": [
        {
          "metric_name": "response_match_score",
          "score": 0.0,
          "eval_status": 2
        }
      ]
    }
  ]
}

Desktop (please complete the following information):

  • OS: macOS
  • Python version(python -V): Python 3.13
  • ADK version(pip show google-adk): adk, version 1.13.0

Model Information:

  • Are you using LiteLLM: No
  • Which model is being used(e.g. gemini-2.5-pro): The agent uses gemini-2.5-flash. The model for the "Judge" in the FinalResponseMatchV2Evaluator is internal to the ADK framework and is not specified by the user.

Additional context

We have gone through an extensive debugging process to isolate the cause:

  1. Hidden Characters: Ruled out by programmatically rewriting the .evalset.json file to ensure clean strings. The issue persisted.
  2. Punctuation Mismatch: Ruled out by standardizing the punctuation (from . to ) in both the agent's output and the eval set. The issue persisted.
  3. Sentence Structure: Ruled out by changing the sentence structure (e.g., adding a colon :) in both the agent and the eval set. The issue persisted with the new, but still identical, strings.
  4. Code Flow: The internal ADK function get_text_from_content was analyzed and confirmed to not alter the strings in this scenario.

Conclusion:

After eliminating all data and code-flow-related causes, the only remaining explanation is a persistent, reproducible bug within the Judge LLM when it is asked to compare these specific, identical Chinese strings.

Metadata

Metadata

Assignees

Labels

eval[Component] This issue is related to evaluation

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions