[WIP] React use `dspy.ToolCalls` #8472

chenmoneygithub · 2025-06-28T02:50:13Z

Refactor dspy.ReAct to use the dspy.ToolCalls for consistency.

We are keeping the behavior that when JSONAdapter is used with ToolCalls, we direct it to the ChatAdapter for good quality, because we have been consistently noticing that the models are doing poorly when using structured output + dict[str, Any]. With our experiments, native tool calling can mitigate this issue, but native tool calling is not producing promising results now, and we are still doing experiments there.

Did a quick benchmark on Hover dataset for this PR, and we see a pretty clear quality regression:

Benchmark result with dspy.ToolCall:

|              |   chat_adapter |   json_adapter |   xml_adapter |
|--------------|----------------|----------------|---------------|
| 4o-mini      |             76 |          73.33 |         75.33 |
| 4o           |             76 |          71.33 |         76    |
| llama3-3-70b |             76 |          68    |         70    |


Benchmark result without dspy.ToolCal in dspy.ReAct:
|              |   chat_adapter |   json_adapter |   xml_adapter |
|--------------|----------------|----------------|---------------|
| 4o-mini      |          81.33 |          71.33 |         76    |
| 4o           |          77.33 |          75.33 |         78.67 |
| llama3-3-70b |          73.33 |          74.67 |         79.33 |

My theory is all these LMs are doing a worse job of understanding deep nested output type than flat types. In details, dspy.ToolCalls is a nest type that has a list of dspy.Toolcalls.ToolCall, which has two fields, one is a string representing the name, and the other representing the arg dict. As a comparison, the current react uses next_tool_name, which is a single string, and next_tool_args which is a dict. So this PR is introducing too much nesting to keep LM of a decent quality.

All benchmarks are done on 50 data from Hover dataset, with ChatAdapter. Benchmark script:

import dspy

lm_4o_mini = dspy.LM("openai/gpt-4o-mini", cache=False)
lm_4o = dspy.LM("openai/gpt-4o", cache=False)
llama = dspy.LM("databricks/databricks-meta-llama-3-3-70b-instruct", cache=False)
dspy.configure(
    lm=lm_4o_mini,
)


from pydantic import BaseModel

from dspy.datasets import DataLoader

kwargs = dict(fields=("claim", "supporting_facts", "hpqa_id", "num_hops"), input_keys=("claim",))
hover = DataLoader().from_huggingface(dataset_name="hover-nlp/hover", split="train", trust_remote_code=True, **kwargs)

hpqa_ids = set()
filtered_hover = []
for x in hover:
    if x["num_hops"] == 3 and x["hpqa_id"] not in hpqa_ids:
        hpqa_ids.add(x["hpqa_id"])
        filtered_hover.append(
            dspy.Example(claim=x.claim, titles=list(set([y["key"] for y in x.supporting_facts]))).with_inputs("claim")
        )
hover = filtered_hover

trainset, devset, testset = hover[:100], hover[100:150], hover[650:]

example = trainset[0]

print("Claim:", example.claim)
print("Pages that must be retrieved:", example.titles)

DOCS = {}


class SearchInput(BaseModel):
    search_query: str


def search(query: str, k: int) -> list[str]:
    results = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")(query, k=k)
    results = [x["text"] for x in results]

    for result in results:
        title, text = result.split(" | ", 1)
        DOCS[title] = text

    return results


def search_wikipedia(query: str) -> list[str]:
    """Returns top-5 results and then the titles of the top-5 to top-30 results."""

    topK = search(query, 30)
    titles, topK = [f"`{x.split(' | ')[0]}`" for x in topK[5:30]], topK[:5]
    return topK + [f"Other retrieved pages have titles: {', '.join(titles)}."]


def lookup_wikipedia(title: str) -> str:
    """Returns the text of the Wikipedia page, if it exists."""

    if title in DOCS:
        return DOCS[title]

    results = [x for x in search(title, 10) if x.startswith(title + " | ")]
    if not results:
        return f"No Wikipedia page found for title: {title}"
    return results[0]


instructions = "Find all Wikipedia titles relevant to verifying (or refuting) the claim."
signature = dspy.Signature("claim -> titles: list[str]", instructions)

tools = [dspy.Tool(search_wikipedia), dspy.Tool(lookup_wikipedia)]

react = dspy.ReAct(signature, tools=tools, max_iters=5)

output = react(claim="David Gregory was born in 1625.")
print(output)

dspy.inspect_history(n=3)


def top5_recall(example, pred, trace=None):
    gold_titles = example.titles
    recall = sum(x in pred.titles[:5] for x in gold_titles) / len(gold_titles)

    # If we're "bootstrapping" for optimization, return True if and only if the recall is perfect.
    if trace is not None:
        return recall >= 1.0

    # If we're just doing inference, just measure the recall.
    return recall


evaluate = dspy.Evaluate(devset=devset[:100], metric=top5_recall, num_threads=8, display_progress=True, display_table=5)


def safe_react(claim: str):
    try:
        return react(claim=claim)
    except Exception:
        return dspy.Prediction(titles=[])


evaluate(safe_react)

chenmoneygithub added 5 commits June 24, 2025 21:20

init

f3b7266

increment

4c01078

increment

63cb5f9

give it a pause

4779b18

increment

f56afeb

chenmoneygithub marked this pull request as draft June 28, 2025 02:50

chenmoneygithub added 16 commits June 30, 2025 16:32

increment

d00e7c8

update test

fb0a4fa

fix

839d789

cleanup

a3a664e

Merge branch 'main' into react-native-tool-calling

a610898

add docstring

1b06c69

direct jsonadapter to chatadapter

973f613

fix test

a8e0e47

better test

4fb6239

fix

8e5b6ae

fix tests

ee7da54

lint fix

a7a5a4e

better test

dbbaea2

merge the change on adapter for native function calling

ab2fa65

increment

01b3bc1

some updates

c8752b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] React use `dspy.ToolCalls` #8472

[WIP] React use `dspy.ToolCalls` #8472

Uh oh!

chenmoneygithub commented Jun 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WIP] React use dspy.ToolCalls #8472

Are you sure you want to change the base?

[WIP] React use dspy.ToolCalls #8472

Uh oh!

Conversation

chenmoneygithub commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

[WIP] React use `dspy.ToolCalls` #8472

[WIP] React use `dspy.ToolCalls` #8472

chenmoneygithub commented Jun 28, 2025 •

edited

Loading