Skip to content

[Model] Add ToolParser and MoE Config for Hunyuan A13B #20820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kzjeef
Copy link
Contributor

@kzjeef kzjeef commented Jul 11, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

  • Add ToolParser support for Hunyuan A13B model, and make it work with Hunyunan's Reason parser.
    Also fix some minor error in Hunyuan Reason parser.
  • Add MoE Config tuned on H20
  • Add Tune support on MoE benchmark

Test Plan

Unit test:

pytest tests/entrypoints/openai/tool_parsers/test_hunyuan_a13b_tool_parser.py 

OpenAI examples

Auto tool choice

python3 -m vllm.entrypoints.openai.api_server \
                --host 0.0.0.0 \                 
                --enable-auto-tool-choice \     
                --tool-call-parser hunyuan_a13b \ 
                --reasoning-parser hunyuan_a13b \ 
                --enable_reasoning \ 
                --tensor-parallel-size 2 \
                --enforce-eager \
                --port 8000 \ 
                --model tencent/Hunyuan-A13B-Instruct
                --trust_remote_code 

openai client test without reason:

python3 examples/online_serving/openai_chat_completion_client_with_tools.py

openai client test with reason

python3 examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py

Test Result

Unit Test:

tests/entrypoints/openai/tool_parsers/test_hunyuan_a13b_tool_parser.py .........x    
 9 passed, 1 xfailed, 1 warning in 3.96s 

Note: nested json parameter in stream mode is not supported in this version, add a failure test case.

OpenAI examples

without reason, pass tool_choice in following case:

  • auto with/without stream
  • require

with reason, pass tool_choice in following case:

  • auto with/without stream
  • require
  • function name

The require not work because they don't call tool parser, but the arguments have meta string like
<tool_calls> </tool_calls>, which not filtered by chat server.

(Optional) Documentation Update

  • Reasoning Outputs: Add Hunyuan information.
  • Tool Calling: Add Hunyuan tool calling information.

@kzjeef kzjeef requested a review from aarnphm as a code owner July 11, 2025 14:05
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kzjeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on integrating the Hunyuan A13B model with the system's tool calling and reasoning capabilities. It introduces a new tool parser tailored for Hunyuan A13B's specific output format and includes necessary adjustments to the core chat serving logic to support this new integration, particularly for streaming responses.

Highlights

  • New Tool Parser: Introduced a dedicated ToolParser for the Hunyuan A13B model, enabling it to correctly parse and extract tool calls from the model's output, both in full and streaming modes.
  • Hunyuan Reasoning Integration: Ensured compatibility and proper functioning of the new tool parser with Hunyuan's existing reasoning parser, including minor fixes to improve its behavior.
  • Streaming Output Enhancements: Improved the serving_chat.py logic to handle streaming tool call deltas more robustly, specifically addressing potential None value issues when concatenating token IDs and allowing tool parsers to modify the final message content.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kzjeef kzjeef requested a review from hmellor as a code owner July 11, 2025 14:06
@mergify mergify bot added the documentation Improvements or additions to documentation label Jul 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tool parsing support for the Hunyuan A13B model. My review focuses on improving the robustness and maintainability of the new parser. I've highlighted a potential high-severity issue with the regex for parsing nested JSON, and provided suggestions to make the code more concise and to refactor complex logic for better clarity.

Copy link

mergify bot commented Jul 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kzjeef.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 12, 2025
@kzjeef kzjeef force-pushed the hy-tool-parser-submit branch from a44378c to d462300 Compare July 12, 2025 14:37
@mergify mergify bot added performance Performance-related issues and removed needs-rebase labels Jul 12, 2025
@kzjeef kzjeef marked this pull request as draft July 14, 2025 08:43
@kzjeef kzjeef force-pushed the hy-tool-parser-submit branch 2 times, most recently from ede5da2 to da46bfd Compare July 15, 2025 06:58
@kzjeef kzjeef changed the title [Model] Add ToolParser for Hunyuan A13B. [Model] Add ToolParser and MoE Config for Hunyuan A13B Jul 15, 2025
@kzjeef kzjeef marked this pull request as ready for review July 15, 2025 07:15
@kzjeef
Copy link
Contributor Author

kzjeef commented Jul 15, 2025

I checked the entrypoints test,

It meets error when startting a qwen2.5 - 1.5B model with length 8192,

see log:

[2025-07-15T08:51:02Z] INFO 07-15 01:51:02 [core.py:69] Initializing a V1 LLM engine (v0.9.2rc2.dev201+gda46bfdeb) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.DUMMY, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}

[2025-07-15T08:50:37Z] INFO 07-15 01:50:37 [config.py:1500] Using max model len 8192

and it's meets any error input too long for 8192.

see:

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] Error in preprocessing prompt inputs

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] Traceback (most recent call last):

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 123, in create_completion

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     request_prompts, engine_prompts = await self._preprocess_completion(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 806, in _preprocess_completion

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     ) = await self._tokenize_prompt_input_or_inputs_async(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 756, in _tokenize_prompt_input_or_inputs_async

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     results = await asyncio.gather(*tasks)

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 564, in _normalize_prompt_text_to_input

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     return self._validate_input(request, input_ids, input_text)

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 636, in _validate_input

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     raise ValueError(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] ValueError: This model's maximum context length is 8192 tokens. However, you requested 10010 tokens (10000 in the messages, 10 in the completion). Please reduce the length of the messages or completion.
[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] Error in preprocessing prompt inputs

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] Traceback (most recent call last):

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 123, in create_completion

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     request_prompts, engine_prompts = await self._preprocess_completion(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 806, in _preprocess_completion

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     ) = await self._tokenize_prompt_input_or_inputs_async(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 756, in _tokenize_prompt_input_or_inputs_async

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     results = await asyncio.gather(*tasks)

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 564, in _normalize_prompt_text_to_input

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     return self._validate_input(request, input_ids, input_text)

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 636, in _validate_input

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     raise ValueError(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] ValueError: This model's maximum context length is 8192 tokens. However, you requested 10010 tokens (10000 in the messages, 10 in the completion). Please reduce the length of the messages or completion.
[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] Error in preprocessing prompt inputs

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] Traceback (most recent call last):

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 123, in create_completion

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     request_prompts, engine_prompts = await self._preprocess_completion(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 806, in _preprocess_completion

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     ) = await self._tokenize_prompt_input_or_inputs_async(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 756, in _tokenize_prompt_input_or_inputs_async

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     results = await asyncio.gather(*tasks)

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 564, in _normalize_prompt_text_to_input

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     return self._validate_input(request, input_ids, input_text)

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 636, in _validate_input

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     raise ValueError(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] ValueError: This model's maximum context length is 8192 tokens. However, you requested 10010 tokens (10000 in the messages, 10 in the completion). Please reduce the length of the messages or completion.
[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] Error in preprocessing prompt inputs

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131] Traceback (most recent call last):

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 123, in create_completion

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     request_prompts, engine_prompts = await self._preprocess_completion(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 806, in _preprocess_completion

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]     ) = await self._tokenize_prompt_input_or_inputs_async(

[2025-07-15T08:51:07Z] ERROR 07-15 01:51:07 [serving_completion.py:131]         

So how to change this test case 's length ?
@youkaichao

@DarkLight1337 DarkLight1337 added ready ONLY add when PR is ready to merge/full CI is needed and removed ready ONLY add when PR is ready to merge/full CI is needed labels Jul 16, 2025
kzjeef added 4 commits July 16, 2025 15:29
- add stream and non stream support
- reason parser use regex package.
- reason parser: add missing function.

Signed-off-by: Asher Zhang <asherszhang@tencent.com>
Signed-off-by: Asher Zhang <asherszhang@tencent.com>
- add test for hunyuan a13b tool parser.
- fix mypy error on tool parser
- refine reason parser test.
- refactory tool parser stream function.

Signed-off-by: Asher Zhang <asherszhang@tencent.com>
Signed-off-by: Asher Zhang <asherszhang@tencent.com>
- tune fused moe config.
- benchmark: add hunyuan in moe benchmark

Signed-off-by: Asher Zhang <asherszhang@tencent.com>
@kzjeef kzjeef force-pushed the hy-tool-parser-submit branch from da46bfd to af5c48a Compare July 16, 2025 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend performance Performance-related issues tool-calling
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants