Open
Description
What feature would you like to see?
TL;DR
- XMLAdapter only handles flat key/value XML.
- Nested models, lists or any repeated tags raise
AdapterParseError
. - Root cause: a greedy one–level regex (
<(\w+)>(.*?)</\1>
) with no recursion. - Effect: Structured-data extraction use-cases are broken on DSPy.
- I am preparing a PR that rewrites the parser to perform proper XML traversal; details & full analysis are linked below.
Minimal Reproduction
from typing import List
from pydantic import BaseModel
import dspy
class Person(BaseModel):
name: str
age: int
class PersonExtraction(dspy.Signature):
text: str = dspy.InputField()
person: Person = dspy.OutputField()
sig = PersonExtraction()
lm_output = """
<person>
<name>John Doe</name>
<age>30</age>
</person>
"""
dspy.XMLAdapter(sig).parse(lm_output) # Raises AdapterParseError
Expected result
{"person": Person(name="John Doe", age=30)}
Actual result
AdapterParseError: Failed to parse LM response …
Environment
DSPy : 3.0.0b1
Python : 3.11.
OS : macOS (Apple Silicon)
Impact
- Any signature with nested Pydantic models or List[...] cannot be parsed.
- Few-shot examples generated by format_field_with_value teach the LM an incorrect JSON-in-XML format, compounding errors.
- Users are pushed to JSONAdapter even when XML would be preferable.
Proposed Fix
Replace the one–level regex with an XML traversal (e.g., xml.etree.ElementTree) that:
- Recursively walks the tree,
- Builds Python containers (dict, list) matching the signature,
- Delegates scalar conversion to parse_value.
- Update format_field_with_value to emit canonical XML rather than JSON-in-XML.
Add unit tests covering:
- Nested models,
- Repeated tags → lists,
- Mixed scalar & container fields.
I have an implementation that passes these tests and will open a draft PR once this issue is acknowledged.
Would you like to contribute?
- Yes, I'd like to help implement this.
- No, I just want to request it.
Additional Context
No response