Nested <think>
and </think>
tag in deepseek-r1 reasoning parser
#12708
BillZhou233
announced in
Q&A
Replies: 1 comment
-
Example payload: {
"model": "/home/top/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B/",
"messages": [
{
"role": "user",
"content": "I am processing the output of a large model, and its output usually looks like this ```<think>thinking content...</think>main content...```. If I use the following Python code to match the content within the think tags ```re.compile(rf\"<think>(.*?)</think>\", re.DOTALL)```, and assuming there is the following content ```<think>this text contains <think> and </think>, ending here</think>this is the main content```, will the matching result have any issues?"
}
],
"stream": false,
"temperature": 0.1
} Observed response pattern:** {
"id": "chatcmpl-7c335b1466f44e8b9a682dd7c7215d58",
"object": "chat.completion",
"created": 1738610376,
"model": "/home/top/.cache/modelscope/hub/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B/",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "\nOkay, so I'm trying to figure out how to correctly extract the content between the \"think\" tags from a string. The string I have is something like \"this text contains think and think, ending here\" followed by \"this is the main content\". I remember that the user mentioned using a regular expression with re.compile(rf\"think(.*?)main\", re.DOTALL). \n\nWait, but in the example given, the string starts with \"this text contains think and think, ending here\" and then \"this is the main content\". So the regex is looking for \"think\", then capturing everything until \"main\", right? But I'm a bit confused because the string doesn't have \"think\" and \"main\" as separate tags but rather as parts of the text.\n\nLet me think about how the regex works. The pattern is \"think\" followed by (.*?) which is a non-greedy match for any characters, and then \"main\". The re.DOTALL flag makes the dot match any character, including newlines. So in the example, the string is \"this text contains think and think, ending here this is the main content\". \n\nWhen the regex runs, it looks for \"think\" and then captures everything until it finds \"main\". But in the string, after \"think\", there's more text, including another \"think\" and then \"ending here this is the main content\". So the regex will start at the first \"think\" and capture everything until the first occurrence of \"main\". \n\nWait, but in the string, \"main\" comes after \"ending here this is the main content\". So the captured group would be from the first \"think\" up to \"ending here this is the main content\", but does it stop at the first \"main\"? Or does it include everything until the last \"main\"?\n\nHmm, no, because the regex is looking for \"main\" as a literal string. So in the example, the first \"think\" is at position where the string is \"think and think, ending here this is the main content\". So the regex will capture from the first \"think\" up to the first \"main\", which is at the end of the string. So the captured content would be \" and think, ending here this is the main content\".\n\nWait, but in the example, the string is \"this text contains think and think, ending here this is the main content\". So the regex starts at the first \"think\" and captures everything until \"main\", which is at the very end. So the captured group would be \" and think, ending here this is the main content\".\n\nBut wait, the user's example output was \"this is the main content\", but according to the regex, it's capturing everything from the first \"think\" to \"main\", which includes more than just the main content. So maybe the regex isn't correctly isolating the main content.\n\nAlternatively, maybe the user intended to have the main content after the \"main\" tag, but in the example, the \"main\" is part of the main content itself. So perhaps the regex is not correctly structured.\n\nI think the issue is that the regex is matching \"think\" followed by any characters until \"main\", but in the example, \"main\" is part of the main content, so it's capturing too much. Maybe the user wants to capture the content between the \"think\" and \"main\" tags, but in the example, the \"main\" is part of the content, not a tag.\n\nSo perhaps the regex should be adjusted to match the tags properly, like using \"<think>\" and \"",
"content": "\" as delimiters. But in the user's case, the tags are just \"think\" and \"main\", without any delimiters. So maybe the regex is correct, but the way the content is structured is causing it to capture more than intended.\n\nAlternatively, maybe the user should use a positive lookbehind and lookahead to ensure that \"think\" and \"main\" are standalone tags. But without knowing the exact structure, it's hard to say.\n\nIn the example given, the regex would capture from the first \"think\" to the first \"main\", which includes everything in between, including another \"think\" and the rest of the text. So the result would be \" and think, ending here this is the main content\", which might not be what the user wants.\n\nSo perhaps the regex needs to be adjusted to match the entire content between the \"think\" and \"main\" tags, ensuring that it's capturing only the intended part. Maybe using word boundaries or other regex features to make it more precise.\n\nAlternatively, if the \"think\" and \"main\" are part of the content and not actual tags, then the regex might not be the right approach, and another method would be better to split the string into parts.\n\nI think the main issue is that the regex is capturing from the first occurrence of \"think\" to the first occurrence of \"main\", which may include more text than intended, especially if \"think\" and \"main\" appear multiple times in the content.\n</think>\n\nThe regular expression `re.compile(rf\"think(.*?)main\", re.DOTALL)` will capture all text starting from the first occurrence of \"think\" up to and including the first occurrence of \"main\". In the provided example, this will result in capturing more text than intended, including additional \"think\" and \"main\" instances. \n\n**Answer:**\n\nThe regex will capture from the first \"think\" to the first \"main\", including all text in between, which may include unintended content. To correctly isolate the desired section, consider using more precise delimiters or adjusting the regex to match specific tags if they exist.",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 100,
"total_tokens": 1273,
"completion_tokens": 1173,
"prompt_tokens_details": null
},
"prompt_logprobs": null
} |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I wonder if there will be nested
<think>
and</think>
tag in the output of deepseek-r1 like this situation:I attempted to modify the original test unit to run the testcase above and it fails:
Looks like that the parser marks the end of reasoning at the first
</think>
but actually reasoning can be ended at the second one. This may be ambigious. Also, when streaming context, it just omitted the second<think>
.Considering another situation:
By simply parsing the output text cannot determine the real end of reasoning part.
The hacked test case can be viewed here
plz forgive my poor English...
Beta Was this translation helpful? Give feedback.
All reactions