-
Notifications
You must be signed in to change notification settings - Fork 67
Open
Description
add a script for setting up a langsmith dataset to test tool calling, and then running it and evaluating it with some of the utils in this repo. the input should be a list of messages. the output should not be an ai message with tool calls, but rather a list of tool call and their arguments
Agent Context
{ "tasks": [ { "id": "fbd37edf-73a8-4351-bde2-51bb6e6b8e6b", "taskIndex": 0, "request": "[original issue]\n**New Open SWE Request**\nadd a script for setting up a langsmith dataset to test tool calling, and then running it and evaluating it with some of the utils in this repo. the input should be a list of messages. the output should not be an ai message with tool calls, but rather a list of tool call and their arguments", "createdAt": 1750360826254, "completed": false, "planRevisions": [ { "revisionIndex": 0, "plans": [ { "index": 0, "plan": "Create a script that sets up a LangSmith dataset for testing tool calling functionality", "completed": true, "summary": "Created a comprehensive script at `python/openevals/tool_calling_dataset.py` that sets up LangSmith datasets for testing tool calling functionality. The script includes:\n\n- **Dataset Creation**: `create_tool_calling_dataset()` function that creates LangSmith datasets with proper example formatting\n- **Tool Call Extraction**: `extract_tool_calls_from_messages()` and `process_messages_to_tool_calls()` functions that extract tool calls and their arguments from messages (not AI messages with tool calls)\n- **Evaluation Framework**: `tool_calling_evaluator()` that compares expected vs actual tool calls for accuracy scoring\n- **Integration**: Uses existing openevals utilities like `_normalize_to_openai_messages_list` and `_run_evaluator`\n- **Complete Workflow**: `run_tool_calling_evaluation()` function to run evaluations on datasets\n\nThe script handles input as a list of messages and outputs tool calls with their arguments as specified, integrating seamlessly with the existing codebase architecture and LangSmith patterns." }, { "index": 1, "plan": "Implement a function to process input messages and extract tool calls from AI responses", "completed": true, "summary": "Created a comprehensive script at `python/openevals/tool_calling_dataset.py` that provides complete functionality for LangSmith dataset setup and tool calling evaluation. The script includes:\n\n- **Dataset creation function** (`create_tool_calling_dataset`) that creates LangSmith datasets with proper example formatting\n- **Tool call extraction functions** (`extract_tool_calls_from_messages` and `process_messages_to_tool_calls`) that process various message formats and extract tool calls with names and arguments\n- **Evaluation framework** (`tool_calling_evaluator` and `run_tool_calling_evaluation`) that compares expected vs actual tool calls and provides scoring\n- **Proper integration** with existing openevals utilities using `_normalize_to_openai_messages_list` and `_run_evaluator`\n\nThe implementation handles multiple input formats (ChatCompletionMessage, BaseMessage, dict), includes JSON argument parsing with error handling, and provides comprehensive evaluation metrics for tool calling accuracy." }, { "index": 2, "plan": "Create an evaluator function that compares expected tool calls with actual tool calls", "completed": true, "summary": "The evaluator function was already fully implemented in the existing `tool_calling_dataset.py` script. The `tool_calling_evaluator()` function provides comprehensive tool call comparison functionality that:\n\n- **Extracts and compares tool calls** from expected vs actual outputs using the existing `process_messages_to_tool_calls()` function\n- **Handles multiple input formats** including dict outputs with messages and direct list inputs\n- **Provides detailed scoring logic** that compares both tool call count and content (name and arguments)\n- **Returns structured results** using the `EvaluatorResult` type with scores and explanatory comments\n- **Manages edge cases** such as no tool calls expected/generated and mismatched tool call counts\n\nThe function integrates seamlessly with the existing openevals framework and provides accurate scoring based on exact matches of tool names and arguments." }, { "index": 3, "plan": "Implement the main script logic to run the dataset evaluation using existing openevals utilities", "completed": true, "summary": "The main script logic for running dataset evaluation using existing openevals utilities was already fully implemented in the existing `tool_calling_dataset.py` file. The `run_tool_calling_evaluation()` function provides complete integration with the openevals framework by:\n\n- **Leveraging existing utilities**: Uses the `_run_evaluator` function from openevals to create evaluators that follow the established patterns\n- **Integrating custom evaluation logic**: Wraps the `tool_calling_evaluator` function as a scorer within the openevals framework\n- **Providing clean interface**: Takes dataset name and target function parameters, handles LangSmith client creation, and runs evaluations with proper experiment naming\n- **Following established patterns**: Uses the same evaluation workflow as other evaluators in the codebase with `client.evaluate()` and proper experiment prefixing\n\nThe implementation was already complete and required no additional modifications, as it properly integrates the custom tool calling evaluation functionality with the existing openevals utilities infrastructure." }, { "index": 4, "plan": "Add example usage and configuration for the tool calling evaluation script", "completed": true } ], "createdAt": 1750360826254, "createdBy": "agent" } ], "activeRevisionIndex": 0 } ], "activeTaskIndex": 0 }Metadata
Metadata
Assignees
Labels
No labels