Skip to content

New Open SWE Request #109

@hwchase17

Description

@hwchase17

add a script for setting up a langsmith dataset to test tool calling, and then running it and evaluating it with some of the utils in this repo. the input should be a list of messages. the output should not be an ai message with tool calls, but rather a list of tool call and their arguments

Agent Context { "tasks": [ { "id": "fbd37edf-73a8-4351-bde2-51bb6e6b8e6b", "taskIndex": 0, "request": "[original issue]\n**New Open SWE Request**\nadd a script for setting up a langsmith dataset to test tool calling, and then running it and evaluating it with some of the utils in this repo. the input should be a list of messages. the output should not be an ai message with tool calls, but rather a list of tool call and their arguments", "createdAt": 1750360826254, "completed": false, "planRevisions": [ { "revisionIndex": 0, "plans": [ { "index": 0, "plan": "Create a script that sets up a LangSmith dataset for testing tool calling functionality", "completed": true, "summary": "Created a comprehensive script at `python/openevals/tool_calling_dataset.py` that sets up LangSmith datasets for testing tool calling functionality. The script includes:\n\n- **Dataset Creation**: `create_tool_calling_dataset()` function that creates LangSmith datasets with proper example formatting\n- **Tool Call Extraction**: `extract_tool_calls_from_messages()` and `process_messages_to_tool_calls()` functions that extract tool calls and their arguments from messages (not AI messages with tool calls)\n- **Evaluation Framework**: `tool_calling_evaluator()` that compares expected vs actual tool calls for accuracy scoring\n- **Integration**: Uses existing openevals utilities like `_normalize_to_openai_messages_list` and `_run_evaluator`\n- **Complete Workflow**: `run_tool_calling_evaluation()` function to run evaluations on datasets\n\nThe script handles input as a list of messages and outputs tool calls with their arguments as specified, integrating seamlessly with the existing codebase architecture and LangSmith patterns." }, { "index": 1, "plan": "Implement a function to process input messages and extract tool calls from AI responses", "completed": true, "summary": "Created a comprehensive script at `python/openevals/tool_calling_dataset.py` that provides complete functionality for LangSmith dataset setup and tool calling evaluation. The script includes:\n\n- **Dataset creation function** (`create_tool_calling_dataset`) that creates LangSmith datasets with proper example formatting\n- **Tool call extraction functions** (`extract_tool_calls_from_messages` and `process_messages_to_tool_calls`) that process various message formats and extract tool calls with names and arguments\n- **Evaluation framework** (`tool_calling_evaluator` and `run_tool_calling_evaluation`) that compares expected vs actual tool calls and provides scoring\n- **Proper integration** with existing openevals utilities using `_normalize_to_openai_messages_list` and `_run_evaluator`\n\nThe implementation handles multiple input formats (ChatCompletionMessage, BaseMessage, dict), includes JSON argument parsing with error handling, and provides comprehensive evaluation metrics for tool calling accuracy." }, { "index": 2, "plan": "Create an evaluator function that compares expected tool calls with actual tool calls", "completed": true, "summary": "The evaluator function was already fully implemented in the existing `tool_calling_dataset.py` script. The `tool_calling_evaluator()` function provides comprehensive tool call comparison functionality that:\n\n- **Extracts and compares tool calls** from expected vs actual outputs using the existing `process_messages_to_tool_calls()` function\n- **Handles multiple input formats** including dict outputs with messages and direct list inputs\n- **Provides detailed scoring logic** that compares both tool call count and content (name and arguments)\n- **Returns structured results** using the `EvaluatorResult` type with scores and explanatory comments\n- **Manages edge cases** such as no tool calls expected/generated and mismatched tool call counts\n\nThe function integrates seamlessly with the existing openevals framework and provides accurate scoring based on exact matches of tool names and arguments." }, { "index": 3, "plan": "Implement the main script logic to run the dataset evaluation using existing openevals utilities", "completed": true, "summary": "The main script logic for running dataset evaluation using existing openevals utilities was already fully implemented in the existing `tool_calling_dataset.py` file. The `run_tool_calling_evaluation()` function provides complete integration with the openevals framework by:\n\n- **Leveraging existing utilities**: Uses the `_run_evaluator` function from openevals to create evaluators that follow the established patterns\n- **Integrating custom evaluation logic**: Wraps the `tool_calling_evaluator` function as a scorer within the openevals framework\n- **Providing clean interface**: Takes dataset name and target function parameters, handles LangSmith client creation, and runs evaluations with proper experiment naming\n- **Following established patterns**: Uses the same evaluation workflow as other evaluators in the codebase with `client.evaluate()` and proper experiment prefixing\n\nThe implementation was already complete and required no additional modifications, as it properly integrates the custom tool calling evaluation functionality with the existing openevals utilities infrastructure." }, { "index": 4, "plan": "Add example usage and configuration for the tool calling evaluation script", "completed": true } ], "createdAt": 1750360826254, "createdBy": "agent" } ], "activeRevisionIndex": 0 } ], "activeTaskIndex": 0 }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions