Skip to content

kakao/FunctionChat-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

18 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

FunctionChat-Bench: Comprehensive Evaluation of Language Model's Generative Capabilities in Korean Tool-use Dialogs

Introduction

The FunctionChat-Bench is a benchmark dataset specifically designed to evaluate the Tool-Use (Function Calling) capabilities of Language Models within conversational settings.

This dataset is built on Korean dialogs data and is meticulously crafted to precisely assess various functionalities required in both single-turn and multi-turn situations.

FunctionChat-Bench-img

Dataset Composition

The FunctionChat-Bench consists of the following datasets:

  • SingleCall
    • SingleCall evaluates how accurately the LM can select and call the necessary function among several options.

    • SingleCall contains four single-turn prompts for each of 25 different functions.

      • For example, for the function 'informDday', there are four dialogs prompts related to it:

        โ€์˜ค๋Š˜์ด ๊ฒฐํ˜ผํ•œ์ง€ ๋ฉฐ์น ์งธ์•ผ?โ€ (How many days have i been married?)
        โ€œํฌ๋ฆฌ์Šค๋งˆ์Šค๊นŒ์ง€ ์–ผ๋งˆ๋‚˜ ๋‚จ์•˜๋‚˜์š”?โ€ (How many days are left until Christmas?)
        โ€œ1์ฐจ ์‹ฌ์‚ฌ์ผ์ด ์–ธ์ œ์ธ๊ฐ€์š”?โ€ (When is the first round of judging?)
        โ€œ๋””๋ฐ์ด๋ชฉ๋ก์—์„œ ์›๊ณ ๋งˆ๊ฐ์ผ ์ฐพ์•„์ค˜โ€ (Find the manuscript deadline in the D-Day list.)
        
    • Five types of tools are defined as follows.

      • 1_exact: Only the target function is provided to the Assistant as a candidate.
      • 4_random: The target function along with 3 randomly selected functions are provided as candidates to the Assistant.
      • 4_close: The target function and 3 functions from a similar domain are provided as candidates to the Assistant.
      • 8_random: The target function along with 7 randomly selected functions are provided as candidates to the Assistant.
      • 8_close: The target function and 7 functions from a similar domain are provided as candidates to the Assistant.
    • The dataset contains 500 single dialogue turns, combining single-turn prompts built around 25 designated functions with various tools types.

  • Dialog
    • Dialog contains 45 diverse dialogs scenarios.
    • Each scenario reflects multi-turn interactions between real users and LM.
    • Defines four types of situations and evaluates how accurately the LM provides appropriate responses in each.
    • Examples of each of the four types of situations are included below to aid understanding within the dialog evaluation.
    • In the dialog evaluation, it assesses how accurately the assistant provides appropriate responses for each type of situation
      • call: An LM must accurately select functions and extract the necessary parameters to respond to a user prompt

        tools:
         [{"type":"function","function":{"name":"informWeather","description":"ํŠน์ • ์ง€์—ญ์˜ ํ˜„์žฌ ๋‚ ์”จ ์ •๋ณด ์ œ๊ณต","parameters":{"type":"object","properties":{"location":{"type":"string","description":"๋‚ ์”จ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ง€์—ญ ์ด๋ฆ„"}},"required":["location"]}}},{"type":"function","function":{"name":"add_task","description":"to-do list์— ์ƒˆ๋กœ์šด ํ•  ์ผ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.","parameters":{"type":"object","properties":{"task_name":{"type":"string","description":"The name of the task"},"deadline":{"type":"string","description":"The deadline for the task(์‚ฌ์šฉ์ž์˜ ํ‘œํ˜„ ๊ทธ๋Œ€๋กœ ์ถ”์ถœ. ์˜ˆ: ๋‹ค์Œ์ฃผ ์ˆ˜์š”์ผ, ๋‚ด์ผ 7์‹œ ๋“ฑ)"}},"required":["task_name","deadline"]}}},{"type":"function","function":{"name":"setupDday","description":"์ด๋ฆ„๊ณผ ๋‚ ์งœ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋””๋ฐ์ด ๋ชฉ๋ก์— ์ƒˆ๋กœ์šด ๋””๋ฐ์ด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.","parameters":{"type":"object","properties":{"ddayName":{"type":"string","description":"๋””๋ฐ์ด์˜ ์ด๋ฆ„"},"ddayDate":{"type":"string","description":"๋””๋ฐ์ด ๋‚ ์งœ(YYYY-MM-DD)"},"includeStartDay":{"type":"boolean","description":"๋‚จ์€ ์ผ์ˆ˜ ๋˜๋Š” ์ง€๋‚œ ์ผ์ˆ˜ ๊ณ„์‚ฐ์‹œ ๋””๋ฐ์ด ๋‹น์ผ ๋‚ ์งœ๋ฅผ 1์ผ๋กœ ํฌํ•จํ•ด ๊ณ„์‚ฐํ• ์ง€ ์—ฌ๋ถ€.(true์ด๋ฉด ๋‹น์ผ์ด 1์ผ, false์ด๋ฉด ๋‹น์ผ์ด 0์ผ)"}},"required":["ddayName","ddayDate","includeStartDay"]}}},{"type":"function","function":{"name":"informDday","description":"์ €์žฅ๋œ ๋””๋ฐ์ด ๋ชฉ๋ก์„ ๊ฒ€์ƒ‰ํ•ด ํŠน์ • ๋””๋ฐ์ด ์ •๋ณด๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.","parameters":{"type":"object","properties":{"searchTerm":{"type":"string","description":"๋””๋ฐ์ด ๋ชฉ๋ก์„ ๊ฒ€์ƒ‰ํ•  ๋””๋ฐ์ด์˜ ์ด๋ฆ„์ด๋‚˜ ํ‚ค์›Œ๋“œ"}},"required":["searchTerm"]}}},{"type":"function","function":{"name":"addMemo","description":"์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฅผ ์ถ”๊ฐ€","parameters":{"type":"object","properties":{"title":{"type":"string","description":"๋ฉ”๋ชจ์˜ ์ œ๋ชฉ(์‚ฌ์šฉ์ž๊ฐ€ ์ œ๋ชฉ์„ ์ง์ ‘ ์–ธ๊ธ‰ํ•˜์ง€ ์•Š์œผ๋ฉด ๋ฉ”๋ชจ ๋‚ด์šฉ์„ ํ†ตํ•ด ์ ์ ˆํ•œ ์ œ๋ชฉ์„ ์ƒ์„ฑํ•ด ์ง€์ •)"},"content":{"type":"string","description":"๋ฉ”๋ชจ ๋‚ด์šฉ"}},"required":["title","content"]}}}]
        
        context:
         user: ์ œ๋ฆฌ ์ถœ๊ตญ๋‚ ์ด ์–ธ์ œ์˜€์ง€?
        
        assistant: tool_calls 
                    {"type": "function", "function": {"name": "informDday", "arguments": "{\"searchTerm\": \"์ œ๋ฆฌ ์ถœ๊ตญ๋‚ \"}"}}
        
      • completion: An LM must generate appropriate responses based on the results of the tool.

        tools:
         [{"type":"function","function":{"name":"informWeather","description":"ํŠน์ • ์ง€์—ญ์˜ ํ˜„์žฌ ๋‚ ์”จ ์ •๋ณด ์ œ๊ณต","parameters":{"type":"object","properties":{"location":{"type":"string","description":"๋‚ ์”จ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ง€์—ญ ์ด๋ฆ„"}},"required":["location"]}}},{"type":"function","function":{"name":"add_task","description":"to-do list์— ์ƒˆ๋กœ์šด ํ•  ์ผ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.","parameters":{"type":"object","properties":{"task_name":{"type":"string","description":"The name of the task"},"deadline":{"type":"string","description":"The deadline for the task(์‚ฌ์šฉ์ž์˜ ํ‘œํ˜„ ๊ทธ๋Œ€๋กœ ์ถ”์ถœ. ์˜ˆ: ๋‹ค์Œ์ฃผ ์ˆ˜์š”์ผ, ๋‚ด์ผ 7์‹œ ๋“ฑ)"}},"required":["task_name","deadline"]}}},{"type":"function","function":{"name":"setupDday","description":"์ด๋ฆ„๊ณผ ๋‚ ์งœ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋””๋ฐ์ด ๋ชฉ๋ก์— ์ƒˆ๋กœ์šด ๋””๋ฐ์ด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.","parameters":{"type":"object","properties":{"ddayName":{"type":"string","description":"๋””๋ฐ์ด์˜ ์ด๋ฆ„"},"ddayDate":{"type":"string","description":"๋””๋ฐ์ด ๋‚ ์งœ(YYYY-MM-DD)"},"includeStartDay":{"type":"boolean","description":"๋‚จ์€ ์ผ์ˆ˜ ๋˜๋Š” ์ง€๋‚œ ์ผ์ˆ˜ ๊ณ„์‚ฐ์‹œ ๋””๋ฐ์ด ๋‹น์ผ ๋‚ ์งœ๋ฅผ 1์ผ๋กœ ํฌํ•จํ•ด ๊ณ„์‚ฐํ• ์ง€ ์—ฌ๋ถ€.(true์ด๋ฉด ๋‹น์ผ์ด 1์ผ, false์ด๋ฉด ๋‹น์ผ์ด 0์ผ)"}},"required":["ddayName","ddayDate","includeStartDay"]}}},{"type":"function","function":{"name":"informDday","description":"์ €์žฅ๋œ ๋””๋ฐ์ด ๋ชฉ๋ก์„ ๊ฒ€์ƒ‰ํ•ด ํŠน์ • ๋””๋ฐ์ด ์ •๋ณด๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.","parameters":{"type":"object","properties":{"searchTerm":{"type":"string","description":"๋””๋ฐ์ด ๋ชฉ๋ก์„ ๊ฒ€์ƒ‰ํ•  ๋””๋ฐ์ด์˜ ์ด๋ฆ„์ด๋‚˜ ํ‚ค์›Œ๋“œ"}},"required":["searchTerm"]}}},{"type":"function","function":{"name":"addMemo","description":"์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฅผ ์ถ”๊ฐ€","parameters":{"type":"object","properties":{"title":{"type":"string","description":"๋ฉ”๋ชจ์˜ ์ œ๋ชฉ(์‚ฌ์šฉ์ž๊ฐ€ ์ œ๋ชฉ์„ ์ง์ ‘ ์–ธ๊ธ‰ํ•˜์ง€ ์•Š์œผ๋ฉด ๋ฉ”๋ชจ ๋‚ด์šฉ์„ ํ†ตํ•ด ์ ์ ˆํ•œ ์ œ๋ชฉ์„ ์ƒ์„ฑํ•ด ์ง€์ •)"},"content":{"type":"string","description":"๋ฉ”๋ชจ ๋‚ด์šฉ"}},"required":["title","content"]}}}]
        
        context: 
         user: ์ œ๋ฆฌ ์ถœ๊ตญ๋‚ ์ด ์–ธ์ œ์˜€์ง€?
         assistant: tool_calls 
                    {"type": "function", "function": {"name": "informDday", "arguments": "{\"searchTerm\": \"์ œ๋ฆฌ ์ถœ๊ตญ๋‚ \"}"}}
         tool: {"name": "informDday", "content": "{\"ddayName\": \"์ œ๋ฆฌ ์ถœ๊ตญ๋‚ \", \"ddayDate\": \"2024-04-23\", \"daysRemaining\": 48, \"daysSince\": None}"}
        
        assistant: ์ œ๋ฆฌ ์ถœ๊ตญ๋‚ ์€ 2024๋…„ 4์›” 23์ผ์ž…๋‹ˆ๋‹ค. ์•ž์œผ๋กœ 48์ผ ๋‚จ์•˜์Šต๋‹ˆ๋‹ค.
        
      • slot: An LM must query the user for the necessary parameters to make a function call.

        tools:
         [{"type":"function","function":{"name":"informWeather","description":"ํŠน์ • ์ง€์—ญ์˜ ํ˜„์žฌ ๋‚ ์”จ ์ •๋ณด ์ œ๊ณต","parameters":{"type":"object","properties":{"location":{"type":"string","description":"๋‚ ์”จ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ง€์—ญ ์ด๋ฆ„"}},"required":["location"]}}},{"type":"function","function":{"name":"add_task","description":"to-do list์— ์ƒˆ๋กœ์šด ํ•  ์ผ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.","parameters":{"type":"object","properties":{"task_name":{"type":"string","description":"The name of the task"},"deadline":{"type":"string","description":"The deadline for the task(์‚ฌ์šฉ์ž์˜ ํ‘œํ˜„ ๊ทธ๋Œ€๋กœ ์ถ”์ถœ. ์˜ˆ: ๋‹ค์Œ์ฃผ ์ˆ˜์š”์ผ, ๋‚ด์ผ 7์‹œ ๋“ฑ)"}},"required":["task_name","deadline"]}}},{"type":"function","function":{"name":"setupDday","description":"์ด๋ฆ„๊ณผ ๋‚ ์งœ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋””๋ฐ์ด ๋ชฉ๋ก์— ์ƒˆ๋กœ์šด ๋””๋ฐ์ด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.","parameters":{"type":"object","properties":{"ddayName":{"type":"string","description":"๋””๋ฐ์ด์˜ ์ด๋ฆ„"},"ddayDate":{"type":"string","description":"๋””๋ฐ์ด ๋‚ ์งœ(YYYY-MM-DD)"},"includeStartDay":{"type":"boolean","description":"๋‚จ์€ ์ผ์ˆ˜ ๋˜๋Š” ์ง€๋‚œ ์ผ์ˆ˜ ๊ณ„์‚ฐ์‹œ ๋””๋ฐ์ด ๋‹น์ผ ๋‚ ์งœ๋ฅผ 1์ผ๋กœ ํฌํ•จํ•ด ๊ณ„์‚ฐํ• ์ง€ ์—ฌ๋ถ€.(true์ด๋ฉด ๋‹น์ผ์ด 1์ผ, false์ด๋ฉด ๋‹น์ผ์ด 0์ผ)"}},"required":["ddayName","ddayDate","includeStartDay"]}}},{"type":"function","function":{"name":"informDday","description":"์ €์žฅ๋œ ๋””๋ฐ์ด ๋ชฉ๋ก์„ ๊ฒ€์ƒ‰ํ•ด ํŠน์ • ๋””๋ฐ์ด ์ •๋ณด๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.","parameters":{"type":"object","properties":{"searchTerm":{"type":"string","description":"๋””๋ฐ์ด ๋ชฉ๋ก์„ ๊ฒ€์ƒ‰ํ•  ๋””๋ฐ์ด์˜ ์ด๋ฆ„์ด๋‚˜ ํ‚ค์›Œ๋“œ"}},"required":["searchTerm"]}}},{"type":"function","function":{"name":"addMemo","description":"์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฅผ ์ถ”๊ฐ€","parameters":{"type":"object","properties":{"title":{"type":"string","description":"๋ฉ”๋ชจ์˜ ์ œ๋ชฉ(์‚ฌ์šฉ์ž๊ฐ€ ์ œ๋ชฉ์„ ์ง์ ‘ ์–ธ๊ธ‰ํ•˜์ง€ ์•Š์œผ๋ฉด ๋ฉ”๋ชจ ๋‚ด์šฉ์„ ํ†ตํ•ด ์ ์ ˆํ•œ ์ œ๋ชฉ์„ ์ƒ์„ฑํ•ด ์ง€์ •)"},"content":{"type":"string","description":"๋ฉ”๋ชจ ๋‚ด์šฉ"}},"required":["title","content"]}}}]
        
        context: 
         user: ์ œ๋ฆฌ ์ถœ๊ตญ๋‚ ์ด ์–ธ์ œ์˜€์ง€?
         assistant: tool_calls 
                    {"type": "function", "function": {"name": "informDday", "arguments": "{\"searchTerm\": \"์ œ๋ฆฌ ์ถœ๊ตญ๋‚ \"}"}}
         tool: {"name": "informDday", "content": "{\"ddayName\": \"์ œ๋ฆฌ ์ถœ๊ตญ๋‚ \", \"ddayDate\": \"2024-04-23\", \"daysRemaining\": 48, \"daysSince\": None}"}
         assistant: ์ œ๋ฆฌ ์ถœ๊ตญ๋‚ ์€ 2024๋…„ 4์›” 23์ผ์ž…๋‹ˆ๋‹ค. ์•ž์œผ๋กœ 48์ผ ๋‚จ์•˜์Šต๋‹ˆ๋‹ค.
         user: ์†ก๋ณ„ํšŒ ์ผ์ • ์žก๊ธฐ to do list์— ์ถ”๊ฐ€ํ•ด์ค˜.
        
        assistant: ์•Œ๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ฐ๋“œ๋ผ์ธ์ด ์–ธ์ œ์ธ๊ฐ€์š”?
        
      • relevance: An LM must generate an appropriate response when it cannot provide a function for a user prompt.

        tools:
         [{"type":"function","function":{"name":"informWeather","description":"ํŠน์ • ์ง€์—ญ์˜ ํ˜„์žฌ ๋‚ ์”จ ์ •๋ณด ์ œ๊ณต","parameters":{"type":"object","properties":{"location":{"type":"string","description":"๋‚ ์”จ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ง€์—ญ ์ด๋ฆ„"}},"required":["location"]}}},{"type":"function","function":{"name":"add_task","description":"to-do list์— ์ƒˆ๋กœ์šด ํ•  ์ผ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.","parameters":{"type":"object","properties":{"task_name":{"type":"string","description":"The name of the task"},"deadline":{"type":"string","description":"The deadline for the task(์‚ฌ์šฉ์ž์˜ ํ‘œํ˜„ ๊ทธ๋Œ€๋กœ ์ถ”์ถœ. ์˜ˆ: ๋‹ค์Œ์ฃผ ์ˆ˜์š”์ผ, ๋‚ด์ผ 7์‹œ ๋“ฑ)"}},"required":["task_name","deadline"]}}},{"type":"function","function":{"name":"setupDday","description":"์ด๋ฆ„๊ณผ ๋‚ ์งœ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋””๋ฐ์ด ๋ชฉ๋ก์— ์ƒˆ๋กœ์šด ๋””๋ฐ์ด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.","parameters":{"type":"object","properties":{"ddayName":{"type":"string","description":"๋””๋ฐ์ด์˜ ์ด๋ฆ„"},"ddayDate":{"type":"string","description":"๋””๋ฐ์ด ๋‚ ์งœ(YYYY-MM-DD)"},"includeStartDay":{"type":"boolean","description":"๋‚จ์€ ์ผ์ˆ˜ ๋˜๋Š” ์ง€๋‚œ ์ผ์ˆ˜ ๊ณ„์‚ฐ์‹œ ๋””๋ฐ์ด ๋‹น์ผ ๋‚ ์งœ๋ฅผ 1์ผ๋กœ ํฌํ•จํ•ด ๊ณ„์‚ฐํ• ์ง€ ์—ฌ๋ถ€.(true์ด๋ฉด ๋‹น์ผ์ด 1์ผ, false์ด๋ฉด ๋‹น์ผ์ด 0์ผ)"}},"required":["ddayName","ddayDate","includeStartDay"]}}},{"type":"function","function":{"name":"informDday","description":"์ €์žฅ๋œ ๋””๋ฐ์ด ๋ชฉ๋ก์„ ๊ฒ€์ƒ‰ํ•ด ํŠน์ • ๋””๋ฐ์ด ์ •๋ณด๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.","parameters":{"type":"object","properties":{"searchTerm":{"type":"string","description":"๋””๋ฐ์ด ๋ชฉ๋ก์„ ๊ฒ€์ƒ‰ํ•  ๋””๋ฐ์ด์˜ ์ด๋ฆ„์ด๋‚˜ ํ‚ค์›Œ๋“œ"}},"required":["searchTerm"]}}},{"type":"function","function":{"name":"addMemo","description":"์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฅผ ์ถ”๊ฐ€","parameters":{"type":"object","properties":{"title":{"type":"string","description":"๋ฉ”๋ชจ์˜ ์ œ๋ชฉ(์‚ฌ์šฉ์ž๊ฐ€ ์ œ๋ชฉ์„ ์ง์ ‘ ์–ธ๊ธ‰ํ•˜์ง€ ์•Š์œผ๋ฉด ๋ฉ”๋ชจ ๋‚ด์šฉ์„ ํ†ตํ•ด ์ ์ ˆํ•œ ์ œ๋ชฉ์„ ์ƒ์„ฑํ•ด ์ง€์ •)"},"content":{"type":"string","description":"๋ฉ”๋ชจ ๋‚ด์šฉ"}},"required":["title","content"]}}}]
        
        context:
         user: ์ œ๋ฆฌ ์ถœ๊ตญ๋‚ ์ด ์–ธ์ œ์˜€์ง€?
         assistant: tool_calls 
                    {"type": "function", "function": {"name": "informDday", "arguments": "{\"searchTerm\": \"์ œ๋ฆฌ ์ถœ๊ตญ๋‚ \"}"}}
         tool: {"name": "informDday", "content": "{\"ddayName\": \"์ œ๋ฆฌ ์ถœ๊ตญ๋‚ \", \"ddayDate\": \"2024-04-23\", \"daysRemaining\": 48, \"daysSince\": None}"}
         assistant: ์ œ๋ฆฌ ์ถœ๊ตญ๋‚ ์€ 2024๋…„ 4์›” 23์ผ์ž…๋‹ˆ๋‹ค. ์•ž์œผ๋กœ 48์ผ ๋‚จ์•˜์Šต๋‹ˆ๋‹ค.
         user: ์†ก๋ณ„ํšŒ ์ผ์ • ์žก๊ธฐ to do list์— ์ถ”๊ฐ€ํ•ด์ค˜.
         assistant: ์•Œ๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ฐ๋“œ๋ผ์ธ์ด ์–ธ์ œ์ธ๊ฐ€์š”?
         user: ๋‹ค์Œ์ฃผ ํ† ์š”์ผ
         assistant: tool_calls
                     {"type": "function", "function": {"name": "add_task", "arguments": "{\"task_name\": \"์†ก๋ณ„ํšŒ ์ผ์ • ์žก๊ธฐ\", \"deadline\": \"๋‹ค์Œ์ฃผ ํ† ์š”์ผ\"}"}}
         tool: {"name": "add_task", "content": "{\"status\": \"success\"}"}
         assistant: ํ•  ์ผ์„ ๋“ฑ๋กํ–ˆ์Šต๋‹ˆ๋‹ค
         user: ๋‹ค๋นˆ์ดํ•œํ…Œ ๊ดœ์ฐฎ์„ ๋•Œ ์ „ํ™” ํ•œ๋ฒˆ ๋‹ฌ๋ผ๊ณ  ๋ฌธ์ž ๋‚จ๊ฒจ์ค˜.
         
        assistant: ๋ฌธ์ž ์ „์†ก ๊ธฐ๋Šฅ์€ ์—†์Šต๋‹ˆ๋‹ค.
        
  • CallDecision

Evaluation Method

The FunctionChat-Bench utilizes a rubric evaluation method called LLM-as-Judge, where OpenAI GPT-4 serves as the evaluator. This specially designed evaluation system quantitatively measures the performance of each dialogs and function call without human intervention. Using the evaluation rubric, OpenAI GPT-4 assesses the accuracy and relevance of the responses generated by LMs and assigns scores accordingly.

Installation

cd FunctionChat-Bench
pip3 install -r requirements.txt

Config

API settings required for evaluation. The evaluation API is configured in config/openai.cfg.

openai config format

{
  "api_type": "openai",
  "api_key": "__YOUR_OPENAI_KEY__",
  "api_version": "gpt-4-1106-preview",
  "temperature": 0.1,
  "max_tokens": 4096,
  "n": 3
}

openai azure config format

{
  "api_type": "azure",
  "api_key": "__YOUR_OPENAI_KEY__",
  "api_base": "__AZURE_ENDPOINT__",
  "api_version": "gpt-4-1106-preview",
  "instance": "__AZURE_INSTANCE_NAME__"",
  "temperature": 0.1,
  "max_tokens": 4096,
  "n": 3
}

Evaluation

Evaluation for openai api

# run dialog evaluation
python3 evaluate.py dialog \
--input_path data/FunctionChat-Dialog.jsonl \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model {model_name} \
--api_key {api_key} 

# run singlecall evaluation
python3 evaluate.py singlecall \
--input_path data/FunctionChat-Singlecall.jsonl \
--tools_type all \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model {model_name} \
--api_key {api_key} 

# run calldecision evaluation
python3 evaluate.py common \
--input_path data/FunctionChat-CallDecision.jsonl \
--temperature 0.1 \
--model {model_name} \
--api_key {api_key} 
  • A model_name like gpt-3.5-turbo-0125 is needed.

Evaluation for local api

# run dialog evaluation
python3 evaluate.py dialog \
--input_path data/FunctionChat-Dialog.jsonl \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model inhouse \
--base_url {base_url} \
--api_key {api_key} \
--served_model_name {model_name}

# run singlecall evaluation
python3 evaluate.py singlecall \
--input_path data/FunctionChat-Singlecall.jsonl \
--tools_type all \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model inhouse \
--base_url {base_url} \
--api_key {api_key} \
--served_model_name {model_name}

# run calldecision evaluation
python3 evaluate.py common \
--input_path data/FunctionChat-CallDecision.jsonl \
--temperature 0.1 \
--model inhouse \
--base_url {base_url} \
--api_key {api_key} 
--served_model_name {model_name}
  • If the model_path is required in the request header, add the --model_path parameter.
  • Follows OpenAI's API specifications.

Evaluation for gemini, claude api(alphachat api)

base_url="http://alpha-gateway-dev.dev.onkakao.net/v1"
model_name="gemini-2.5-pro" # or "claude-opus-4"
api_key="sk-*****"

# run dialog evaluation
python3 evaluate.py dialog \
--input_path data/FunctionChat-Dialog.jsonl \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model ${model_name} \
--api_key ${api_key} \
--base_url ${base_url}

# run singlecall evaluation
python3 evaluate.py singlecall \
--input_path data/FunctionChat-Singlecall.jsonl \
--tools_type all \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model ${model_name} \
--api_key ${api_key} \
--base_url ${base_url}

# run calldecision evaluation
python3 evaluate.py common \
--input_path data/FunctionChat-CallDecision.jsonl \
--temperature 0.1 \
--model ${model_name} \
--api_key ${api_key} \
--base_url ${base_url}

Evaluation for gemini api

# run dialog evaluation
python3 evaluate.py dialog \
--input_path data/FunctionChat-Dialog.jsonl \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model {gemini_model_name} \
--gcloud_project_id {base_url} \
--gcloud_location {api_key} 

# run singlecall evaluation
python3 evaluate.py singlecall \
--input_path data/FunctionChat-Singlecall.jsonl \
--tools_type all \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model {gemini_model_name} \
--gcloud_project_id {base_url} \
--gcloud_location {api_key} 

# run calldecision evaluation
python3 evaluate.py common \
--input_path data/FunctionChat-CallDecision.jsonl \
--temperature 0.1 \
--model {gemini_model_name} \
--gcloud_project_id {base_url} \
--gcloud_location {api_key} 

Evaluation for mistral api

# run dialog evaluation
python3 evaluate.py dialog \
--input_path data/FunctionChat-Dialog.jsonl \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model {mistral_model_name} \
--api_key {api_key} 

# run singlecall evaluation
python3 evaluate.py singlecall \
--input_path data/FunctionChat-Singlecall.jsonl \
--tools_type all \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model {mistral_model_name} \
--api_key {api_key} 

# run calldecision evaluation
python3 evaluate.py common \
--input_path data/FunctionChat-CallDecision.jsonl \
--temperature 0.1 \
--model {mistral_model_name} \
--api_key {api_key} 
  • A mistral_model_name like mistral-small-latest is needed.

Evaluation for solar api

# run dialog evaluation
python3 evaluate.py dialog \
--input_path data/FunctionChat-Dialog.jsonl \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model {solar_model_name} \
--base_url {base_url} \
--api_key {api_key} 

# run singlecall evaluation
python3 evaluate.py singlecall \
--input_path data/FunctionChat-Singlecall.jsonl \
--tools_type all \
--system_prompt_path data/system_prompt.txt \
--temperature 0.1 \
--model {solar_model_name} \
--base_url {base_url} \
--api_key {api_key} 

# run calldecision evaluation
python3 evaluate.py common \
--input_path data/FunctionChat-CallDecision.jsonl \
--temperature 0.1 \
--model {solar_model_name} \
--api_key {api_key} 
  • A solar_model_name like solar-1-mini-chat-240502 is needed.

Additional option - common

An option added to support a highly compatible data structure that allows for flexible content specification.

# openai evaluation
python3 evaluate.py common \
--input_path data/{common-evaluation-file}.jsonl \
--temperature 0.1 \
--model {model_name} \
--api_key {api_key} 

# inhouse evaluation
python3 evaluate.py common \
--input_path data/{common-evaluation-file}.jsonl \
--temperature 0.1 \
--model inhouse \
--base_url {base_url} \
--api_key {api_key} \
--model_path {model_path}
  • It is an option developed for the expansion of the evaluation set.
  • {common-evaluation-file}.jsonl : An evaluation dataset file in a format that follows the common option.
  • Currently, the only evaluation set compatible with the common option is FunctionChat-CallDecision.jsonl.

Additional option - local-inference

python3 evaluate.py common \
--input_path data/{common-evaluation-file}.jsonl \
--model inhouse-local \
--model_path {model_path} \
--tool_parser {template_name}
  • It must be run in a GPU environment compatible with vLLM.

local-inference example

python3 evaluate.py common \
--input_path data/FunctionChat-CallDecision.jsonl \
--model inhouse-local \
--model_path /data/nlp-public_338/models/decoder/internal/kanana-essence-8b-fc-v1.0.1-stage1-rc.20 \
--tool_parser functionary_v3_llama_31

License

This software is licensed under the Apache 2 license, quoted below.

Copyright 2024 Kakao Corp. http://www.kakaocorp.com

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this project except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Citation

@misc{lee2024functionchatbenchcomprehensiveevaluationlanguage,
      title={FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs}, 
      author={Shinbok Lee and Gaeun Seo and Daniel Lee and Byeongil Ko and Sunghee Jung and Myeongcheol Shin},
      year={2024},
      eprint={2411.14054},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.14054}, 
}

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 6

Languages