Skip to content

[Feature] limit thinking tokens #20859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

llsj14
Copy link
Contributor

@llsj14 llsj14 commented Jul 12, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

  • [Feature]: Limit thinking tokens #15418 support limit the thinking tokens by sampling parameter
  • This feature is intended to prevent uncontrolled long reasoning loops and support explicit thinking limit(budget).

Implementation

  • If the number of thinking tokens exceeds the max_think_tokens sampling parameter, the logits processor will forcibly insert the thinking end token ID to terminate the thinking section.
  • This feature extends the built-in logits processor, considering the changes introduced in PR #16728.

Test Plan

will add unit tests

Test Result

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added deepseek Related to DeepSeek models frontend v1 labels Jul 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @llsj14, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial feature to manage and limit the length of 'thinking' or 'reasoning' phases in large language models that employ explicit reasoning tokens. By allowing users to set a max_think_tokens budget, the system can prevent uncontrolled long reasoning loops, ensuring more predictable and efficient model behavior. The core of this feature is a new logits processor that monitors token generation within designated thinking sections and intervenes to terminate them if the specified limit is exceeded.

Highlights

  • New max_think_tokens parameter: Introduced a max_think_tokens parameter in SamplingParams and exposed it via the OpenAI protocol's ChatCompletionRequest. This allows users to specify a maximum token limit for the 'thinking' phase of models that utilize explicit reasoning tokens.
  • ReasoningConfig and Dynamic Token ID Management: Added a new ReasoningConfig class to vllm/config.py to encapsulate think_start_token_id and think_end_token_id. These IDs are now dynamically populated in GpuModelRunner based on the configured reasoning backend (e.g., DeepSeek R1), ensuring the system correctly identifies and manages reasoning sections.
  • MaxThinkTokensLogitsProcessor Implementation: Implemented a new MaxThinkTokensLogitsProcessor in vllm/v1/sample/logits_processor.py. This processor actively monitors the number of tokens generated within a thinking section. If the max_think_tokens limit is reached, it modifies the logits to forcibly generate the think_end_token_id, effectively terminating the reasoning loop.
  • Enhanced State Tracking for Logits Processors: Modified the AddedRequest tuple in vllm/v1/sample/logits_processor.py and vllm/v1/worker/gpu_input_batch.py to include prompt_tok_ids. This provides logits processors, especially the new MaxThinkTokensLogitsProcessor, with more complete context for tracking token counts from the beginning of a request's generation.
  • Integration Across the Stack: The new max_think_tokens parameter and the ReasoningConfig are integrated throughout the system, from the API request parsing to the SamplingParams, GpuInputBatch, and finally into the LogitsProcessorManager to ensure the thinking token limit is enforced during the token generation process.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@llsj14 llsj14 force-pushed the feat/thinking-budget branch from c13ccf9 to 3a072f0 Compare July 12, 2025 09:25
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to limit the number of "thinking" tokens generated by a model, which is a great way to prevent uncontrolled reasoning loops and manage computational budgets. The implementation adds a max_think_tokens parameter and a corresponding MaxThinkTokensLogitsProcessor to enforce this limit. I've identified a couple of issues related to correctness, particularly in edge cases and state management, which I've detailed below. Addressing these will make the feature more robust.

@llsj14 llsj14 force-pushed the feat/thinking-budget branch 3 times, most recently from 35cad4f to 4d64881 Compare July 14, 2025 04:56
Copy link

mergify bot commented Jul 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @llsj14.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 14, 2025
llsj14 added 3 commits July 14, 2025 05:08
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
@llsj14 llsj14 force-pushed the feat/thinking-budget branch from 4d64881 to 3c4fc40 Compare July 14, 2025 05:09
@mergify mergify bot removed the needs-rebase label Jul 14, 2025
@llsj14 llsj14 force-pushed the feat/thinking-budget branch 4 times, most recently from 5d8490d to d5b9de1 Compare July 14, 2025 05:56
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
@llsj14 llsj14 force-pushed the feat/thinking-budget branch from d5b9de1 to 4c4251d Compare July 14, 2025 06:12
@@ -493,8 +495,113 @@ def apply(self, logits: torch.Tensor) -> torch.Tensor:
return logits


def init_builtin_logitsprocs(pin_memory_available: bool, max_num_reqs: int,
device: torch.device) -> LogitsProcessorManager:
class MaxThinkTokensLogitsProcessor(LogitsProcessor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems more appropriate to split this into separate files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s good to separate the files, but I’m just concerned about the divergence of different kinds of logits processors at the moment, since some are declared in the ops directory (e.g., bad words, penalties, top-k, top-p), while the built-in logits processors are declared in this logits_processor.py file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably create a logit_processors dir, then put diff logic processor there.

The default ones can just live under logit_processors/__init__.py, and others can have its own file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, I’ll update it.

… missing

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
vllm/config.py Outdated
Comment on lines 4408 to 4420
class ReasoningConfig:
"""Configuration for reasoning models."""

think_start_token_id: Optional[int] = None
"""Token ID that indicates the start of reasoning."""
think_end_token_id: Optional[int] = None
"""Token ID that indicates the end of reasoning."""

def __init__(self,
think_start_token_id: Optional[int] = None,
think_end_token_id: Optional[int] = None):
self.think_start_token_id = think_start_token_id
self.think_end_token_id = think_end_token_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not introduce another class for this here. I think we can coupled this with the reasoning parser.

Copy link
Contributor Author

@llsj14 llsj14 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was quite hard to pass the reasoning parser information to the logits processors. If I don’t use ReasoningConfig, I might still need to pass the reasoning parser object to the logits processor anyways, to make logits processor get the information of think start/end token ids.

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick drive by comments on configuration.

@@ -404,6 +404,7 @@ class ChatCompletionRequest(OpenAIBaseModel):
prompt_logprobs: Optional[int] = None
allowed_token_ids: Optional[list[int]] = None
bad_words: list[str] = Field(default_factory=list)
max_think_tokens: Optional[int] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we introduce some heuristic with reasoning_effort. I'm thinking:

  • low -> 1024
  • medium -> 2048
  • high -> 8192

Then we can also have this as additional extra_body for users to override if they have custom context length set to vllm server here.

Copy link
Contributor Author

@llsj14 llsj14 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. So the user should only provide "reasoning_effort": [low, medium, high] as the sampling parameter? What I’m a bit concerned about is that it’s hard to control at the token level, and it’s only configurable when the server loads.

Copy link
Collaborator

@aarnphm aarnphm Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasoning_effort are mostly for openai compatible endpoint. If users want more control, we then respect thinking_token_budget or some naming in the body instead of reasoning_effort.

Two scenarios:

  • Users who already uses reasoning_effort from openai frontend: nothing changes for them
  • If they want to increase the thinking budget, knowing that the model context length supports it:
    client.chat.completions.create(..., 
                                   reasoning_effort="medium", # we ignore reasoning_effort here for thinking_tokens_budget
                                   extra_body={"thinking_tokens_budget": 16384}
                                  )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this should be included in the max_tokens calculation as well

@@ -4461,6 +4476,8 @@ class VllmConfig:
# some opaque config, only used to provide additional information
# for the hash computation, mainly used for testing, debugging or out of
# tree config registration.
reasoning_config: Optional[ReasoningConfig] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -23,8 +23,8 @@ class DeepSeekR1ReasoningParser(ReasoningParser):
text. This parser extracts the reasoning content from the model output.
"""

start_token_id: int
end_token_id: int
think_start_token_id: int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid changing this, I don't think this is related to this PR.

Copy link
Contributor Author

@llsj14 llsj14 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this part, because I needed the start/end token ids from reasoning parser for logits processor, which needs the starting point and the end point of thinking mode.
I referenced this part as reasoning_parser.think_start_token_id for both qwen and deepseek models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid changing this, I don't think this is related to this PR.

+1.

Also, as shown in hunyuan_a13b_reasoning_parser.py, think_start_ids consists of three token IDs. Using reasoning_parser.think_start_token_id directly doesn’t seem like a good approach—I suggest using a @property instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaunceyjiang Yes, I’ll update this for extensibility.
For now, I just wanted this PR to support only Qwen and DeepSeek models, which use a single token id to start and finish the thinking mode. I think we’ll need a different workflow for reasoning models that require multiple token ids, for example, they may need partial prefill after forcing multiple tokens at the end. In that case, I’m not sure if using only logits processors is the right approach. Maybe we’ll need partial prefill workflows or some help from guided decoding. What do you think about this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think structured outputs is relevant here.

I think frontend-related features should be using logit processor, to avoid performance issue. But the new logit processor should be performant enough.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could also circumvent the use of logits processor, and use a concept of remaining budgets that can cover the case where we can make thinking budget compatible with speculative decoding and non speculative decoding as well. I have implemented some changes to support thinking budget with speculative decoding and at the same time avoid the use of logitsprocessor. This is the link to the draft PR : #20949 .
Let me know what you guys think !

Copy link
Contributor Author

@llsj14 llsj14 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is quite hard to handle multiple think end tokens using a logits processor. That’s why I’m also considering implementing this feature in the serving_chat part, the scheduler, or with guided decoding.

There are several ways to implement this, each with its own drawbacks:

  1. Logits processor: I would have to enforce multiple think end tokens across multiple decode steps. It means performance degradation. (maybe it sounds still reasonable)
  2. serving_chat: I could make the reasoning_parser count think tokens and enforce think end tokens. This could be quite easy to implement, but with the current implementation, it seems hard to make the reasoning_parser check the sampling parameters of every request. It’s challenging to implement this in non Stream API.
  3. Scheduler: Similar to the verification stage of speculative decoding, we could enforce multiple tokens and make the forward step perform a partial prefill. However, it seems quite difficult and complex to make only part of the requests in a batch build a KV cache for multiple tokens. @rishitdholakia13’s implementation appears to follow this approach. but if we need to handle multiple tokens, it would get more complex.
  4. Guided decoding: Guided decoding or structured outputs have similar needs. for example, forcing certain tokens. But I think it’s also complex to manage given the prior implementations and the use of external libraries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to apply multiple think end tokens using logits processors. The methods I described above (options 2–4) are difficult to implement at the moment. So, the logits processors will produce multiple think end tokens across multiple forward steps.

Copy link
Contributor Author

@llsj14 llsj14 Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this new commit, I made this feature work with start/end tokens defined as token sequences (multiple tokens).
Since the reasoning parsers do not have the same property, I needed a new config argument to get the think start/end strings (e.g., think_end_str="\n\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>\n\n").

@@ -493,8 +495,113 @@ def apply(self, logits: torch.Tensor) -> torch.Tensor:
return logits


def init_builtin_logitsprocs(pin_memory_available: bool, max_num_reqs: int,
device: torch.device) -> LogitsProcessorManager:
class MaxThinkTokensLogitsProcessor(LogitsProcessor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably create a logit_processors dir, then put diff logic processor there.

The default ones can just live under logit_processors/__init__.py, and others can have its own file.

@aarnphm
Copy link
Collaborator

aarnphm commented Jul 14, 2025

fyi #19912

@@ -248,6 +249,9 @@ class SamplingParams(
bad_words: Optional[list[str]] = None
_bad_words_token_ids: Optional[list[list[int]]] = None

# Maximum number of tokens allowed for thinking operations.
max_think_tokens: Optional[int] = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to thinking_budget, would help provide consistency in naming since the max thinking here would refer to the thinking budget provided by the user.

llsj14 added 2 commits July 15, 2025 12:49
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Copy link

mergify bot commented Jul 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @llsj14.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants