Skip to content

[Bugfix] Fix the bug in Hermes streaming parsing #20824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions requirements/common.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ lark == 1.2.2
xgrammar == 0.1.19; platform_machine == "x86_64" or platform_machine == "aarch64" or platform_machine == "arm64"
typing_extensions >= 4.10
filelock >= 3.16.1 # need to contain https://github.com/tox-dev/filelock/pull/317
json_repair # used for repairing JSON outputs
partial-json-parser # used for parsing partial JSON outputs
pyzmq >= 25.0.0
msgspec
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

import partial_json_parser
import regex as re
from partial_json_parser.core.options import Allow
from json_repair import repair_json

from vllm.entrypoints.chat_utils import random_tool_call_id
from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
Expand Down Expand Up @@ -170,6 +170,7 @@ def extract_tool_calls_streaming(
# something with tools with this diff.
# flags for partial JSON parting. exported constants from
# "Allow" are handled via BIT MASK
from partial_json_parser.core.options import Allow
flags = Allow.ALL if self.current_tool_name_sent \
else Allow.ALL & ~Allow.STR

Expand Down Expand Up @@ -237,6 +238,9 @@ def extract_tool_calls_streaming(
return delta

try:
if tool_call_portion is not None:
# repair the JSON if needed
tool_call_portion = repair_json(tool_call_portion)
Comment on lines +241 to +243
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling repair_json on every streaming delta might introduce a performance bottleneck, potentially leading to quadratic complexity (O(N^2)) for long tool calls as tool_call_portion grows with each token. Consider attempting to parse with partial_json_parser first and only calling repair_json if that fails.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in streaming mode, normally N = 1


current_tool_call = partial_json_parser.loads(
tool_call_portion or "{}",
Expand Down