Poor quality results with reasoning models and structured output #15670

matthew-at-qamcom · 2025-03-28T03:26:05Z

matthew-at-qamcom
Mar 28, 2025

I'm using QwQ-32B [1] with structured outputs. I'm finding the results are of very low quality. Am I doing something wrong?

I have modified some of the example code that asks if the capital of France is either Paris or London.

With a temperate of 1, it picks London in 54 out of 100 attempts!

Changing the temperature to 0 has no real impact (44 out of 100 runs picked London).

This is less than ideal. Without structured output, I doubt QwQ would ever get the answer wrong.

Any suggestions?

Thanks in advance.

I'm running the model using:

VLLM_USE_V1=0 vllm serve ospatch/QwQ-32B-INT8-W8A8 --tensor-parallel-size 4 --max_num_batched_tokens 32768 --max_num_seqs 1024 --enable-reasoning --reasoning-parser deepseek_r1

My modifications to the example code:

import asyncio
from enum import Enum

from openai import AsyncOpenAI
from pydantic import BaseModel

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = AsyncOpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = await client.models.list()
model = models.data[0].id


# Guided decoding by Regex
prompt = ("What is the capital of France?")

async def create_task(semaphore):
    async with semaphore:
        result = await client.chat.completions.create(
            model=model,
            messages=[{
                "role": "user",
                "content": prompt,
            }],
            extra_body={
                "guided_regex": "(Paris|London)",
            },
            temperature=0,
        )
        return result


semaphore = asyncio.Semaphore(30)
tasks = []
for i in range(100):
    task = create_task(semaphore)
    tasks.append(task)

results = await asyncio.gather(*tasks)

contents = [result.choices[0].message.content for result in results]

london_count = 0
for content in contents:
    if content == "London":
        london_count += 1

print(london_count / len(contents))

[1] Specifically, I'm using ospatch/QwQ-32B-INT8-W8A8

matthew-at-qamcom · 2025-03-30T21:57:53Z

matthew-at-qamcom
Mar 30, 2025
Author

Great news! The JSON component of structured output doesn't suffer the same issues.

Here are some results from different approaches I tried.

Guided decoding by Regex (like in the example)

Paris: 58
London: 42

Guided decoding by Regex (like in the example) but this time with temperature set to zero

Paris: 48
London: 52

Turning off the guided Regex and going back to a temperature of a one

Paris: 100
London: 0

(I'm just looking to see if "Paris" or "London" appears in the response.)

Trying a one-shot prompt with guided regex

I modified the prompt to be
"Q: What is the capital of New Zealand? A: Wellington. Q: What is the capital of France? A:"

Paris: 6
London: 94

Wow.

Same one-shot prompt without guided regex

Paris: 100
London: 0

Testing out guided JSON

from pydantic import BaseModel

class Answer(BaseModel):
    country: str
    capital: str

answer_schema = Answer.model_json_schema()

prompt = """
What is the capital of France?  

Please follow the JSON template:
{
  "country": "<country>",
  "capital": "<capital of country>"
}
"""

Paris: 100
London: 0

Much better :o)

Guided JSON that's closer to the original problem

from pydantic import BaseModel

class Answer(BaseModel):
    capital: str

answer_schema = Answer.model_json_schema()

prompt = """
What is the capital of France?  

Please follow the JSON template:
{
  "capital": "<capital of country>"
}
"""

Paris: 100
London: 0

So my solution will be to use guided JSON and not Regex.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Poor quality results with reasoning models and structured output #15670

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Poor quality results with reasoning models and structured output #15670

Uh oh!

Uh oh!

matthew-at-qamcom Mar 28, 2025

Replies: 1 comment

Uh oh!

Uh oh!

matthew-at-qamcom Mar 30, 2025 Author

Guided decoding by Regex (like in the example)

Guided decoding by Regex (like in the example) but this time with temperature set to zero

Turning off the guided Regex and going back to a temperature of a one

Trying a one-shot prompt with guided regex

Same one-shot prompt without guided regex

Testing out guided JSON

Guided JSON that's closer to the original problem

matthew-at-qamcom
Mar 28, 2025

matthew-at-qamcom
Mar 30, 2025
Author