Skip to content

[WIP] React use dspy.ToolCalls #8472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

chenmoneygithub
Copy link
Collaborator

@chenmoneygithub chenmoneygithub commented Jun 28, 2025

Refactor dspy.ReAct to use the dspy.ToolCalls for consistency.

We are keeping the behavior that when JSONAdapter is used with ToolCalls, we direct it to the ChatAdapter for good quality, because we have been consistently noticing that the models are doing poorly when using structured output + dict[str, Any]. With our experiments, native tool calling can mitigate this issue, but native tool calling is not producing promising results now, and we are still doing experiments there.

Did a quick benchmark on Hover dataset for this PR, and we see a pretty clear quality regression:

Benchmark result with dspy.ToolCall:

|              |   chat_adapter |   json_adapter |   xml_adapter |
|--------------|----------------|----------------|---------------|
| 4o-mini      |             76 |          73.33 |         75.33 |
| 4o           |             76 |          71.33 |         76    |
| llama3-3-70b |             76 |          68    |         70    |


Benchmark result without dspy.ToolCal in dspy.ReAct:
|              |   chat_adapter |   json_adapter |   xml_adapter |
|--------------|----------------|----------------|---------------|
| 4o-mini      |          81.33 |          71.33 |         76    |
| 4o           |          77.33 |          75.33 |         78.67 |
| llama3-3-70b |          73.33 |          74.67 |         79.33 |

My theory is all these LMs are doing a worse job of understanding deep nested output type than flat types. In details, dspy.ToolCalls is a nest type that has a list of dspy.Toolcalls.ToolCall, which has two fields, one is a string representing the name, and the other representing the arg dict. As a comparison, the current react uses next_tool_name, which is a single string, and next_tool_args which is a dict. So this PR is introducing too much nesting to keep LM of a decent quality.

All benchmarks are done on 50 data from Hover dataset, with ChatAdapter. Benchmark script:

import dspy

lm_4o_mini = dspy.LM("openai/gpt-4o-mini", cache=False)
lm_4o = dspy.LM("openai/gpt-4o", cache=False)
llama = dspy.LM("databricks/databricks-meta-llama-3-3-70b-instruct", cache=False)
dspy.configure(
    lm=lm_4o_mini,
)


from pydantic import BaseModel

from dspy.datasets import DataLoader

kwargs = dict(fields=("claim", "supporting_facts", "hpqa_id", "num_hops"), input_keys=("claim",))
hover = DataLoader().from_huggingface(dataset_name="hover-nlp/hover", split="train", trust_remote_code=True, **kwargs)

hpqa_ids = set()
filtered_hover = []
for x in hover:
    if x["num_hops"] == 3 and x["hpqa_id"] not in hpqa_ids:
        hpqa_ids.add(x["hpqa_id"])
        filtered_hover.append(
            dspy.Example(claim=x.claim, titles=list(set([y["key"] for y in x.supporting_facts]))).with_inputs("claim")
        )
hover = filtered_hover

trainset, devset, testset = hover[:100], hover[100:150], hover[650:]

example = trainset[0]

print("Claim:", example.claim)
print("Pages that must be retrieved:", example.titles)

DOCS = {}


class SearchInput(BaseModel):
    search_query: str


def search(query: str, k: int) -> list[str]:
    results = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")(query, k=k)
    results = [x["text"] for x in results]

    for result in results:
        title, text = result.split(" | ", 1)
        DOCS[title] = text

    return results


def search_wikipedia(query: str) -> list[str]:
    """Returns top-5 results and then the titles of the top-5 to top-30 results."""

    topK = search(query, 30)
    titles, topK = [f"`{x.split(' | ')[0]}`" for x in topK[5:30]], topK[:5]
    return topK + [f"Other retrieved pages have titles: {', '.join(titles)}."]


def lookup_wikipedia(title: str) -> str:
    """Returns the text of the Wikipedia page, if it exists."""

    if title in DOCS:
        return DOCS[title]

    results = [x for x in search(title, 10) if x.startswith(title + " | ")]
    if not results:
        return f"No Wikipedia page found for title: {title}"
    return results[0]


instructions = "Find all Wikipedia titles relevant to verifying (or refuting) the claim."
signature = dspy.Signature("claim -> titles: list[str]", instructions)

tools = [dspy.Tool(search_wikipedia), dspy.Tool(lookup_wikipedia)]

react = dspy.ReAct(signature, tools=tools, max_iters=5)

output = react(claim="David Gregory was born in 1625.")
print(output)

dspy.inspect_history(n=3)


def top5_recall(example, pred, trace=None):
    gold_titles = example.titles
    recall = sum(x in pred.titles[:5] for x in gold_titles) / len(gold_titles)

    # If we're "bootstrapping" for optimization, return True if and only if the recall is perfect.
    if trace is not None:
        return recall >= 1.0

    # If we're just doing inference, just measure the recall.
    return recall


evaluate = dspy.Evaluate(devset=devset[:100], metric=top5_recall, num_threads=8, display_progress=True, display_table=5)


def safe_react(claim: str):
    try:
        return react(claim=claim)
    except Exception:
        return dspy.Prediction(titles=[])


evaluate(safe_react)

@chenmoneygithub chenmoneygithub marked this pull request as draft June 28, 2025 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant