[Feature] limit thinking tokens #20859

llsj14 · 2025-07-12T09:22:49Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

[Feature]: Limit thinking tokens #15418 support limit the thinking tokens by sampling parameter
This feature is intended to prevent uncontrolled long reasoning loops and support explicit thinking limit(budget).

Implementation

If the number of thinking tokens exceeds the max_think_tokens sampling parameter, the logits processor will forcibly insert the thinking end token ID to terminate the thinking section.
This feature extends the built-in logits processor, considering the changes introduced in PR #16728.

Test Plan

will add unit tests

Test Result

(Optional) Documentation Update

github-actions · 2025-07-12T09:22:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @llsj14, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial feature to manage and limit the length of 'thinking' or 'reasoning' phases in large language models that employ explicit reasoning tokens. By allowing users to set a max_think_tokens budget, the system can prevent uncontrolled long reasoning loops, ensuring more predictable and efficient model behavior. The core of this feature is a new logits processor that monitors token generation within designated thinking sections and intervenes to terminate them if the specified limit is exceeded.

Highlights

New max_think_tokens parameter: Introduced a max_think_tokens parameter in SamplingParams and exposed it via the OpenAI protocol's ChatCompletionRequest. This allows users to specify a maximum token limit for the 'thinking' phase of models that utilize explicit reasoning tokens.
ReasoningConfig and Dynamic Token ID Management: Added a new ReasoningConfig class to vllm/config.py to encapsulate think_start_token_id and think_end_token_id. These IDs are now dynamically populated in GpuModelRunner based on the configured reasoning backend (e.g., DeepSeek R1), ensuring the system correctly identifies and manages reasoning sections.
MaxThinkTokensLogitsProcessor Implementation: Implemented a new MaxThinkTokensLogitsProcessor in vllm/v1/sample/logits_processor.py. This processor actively monitors the number of tokens generated within a thinking section. If the max_think_tokens limit is reached, it modifies the logits to forcibly generate the think_end_token_id, effectively terminating the reasoning loop.
Enhanced State Tracking for Logits Processors: Modified the AddedRequest tuple in vllm/v1/sample/logits_processor.py and vllm/v1/worker/gpu_input_batch.py to include prompt_tok_ids. This provides logits processors, especially the new MaxThinkTokensLogitsProcessor, with more complete context for tracking token counts from the beginning of a request's generation.
Integration Across the Stack: The new max_think_tokens parameter and the ReasoningConfig are integrated throughout the system, from the API request parsing to the SamplingParams, GpuInputBatch, and finally into the LogitsProcessorManager to ensure the thinking token limit is enforced during the token generation process.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a feature to limit the number of "thinking" tokens generated by a model, which is a great way to prevent uncontrolled reasoning loops and manage computational budgets. The implementation adds a max_think_tokens parameter and a corresponding MaxThinkTokensLogitsProcessor to enforce this limit. I've identified a couple of issues related to correctness, particularly in edge cases and state management, which I've detailed below. Addressing these will make the feature more robust.

vllm/v1/sample/logits_processor.py

mergify · 2025-07-14T04:57:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @llsj14.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

chaunceyjiang · 2025-07-14T08:34:56Z

vllm/v1/sample/logits_processor.py

@@ -493,8 +495,113 @@ def apply(self, logits: torch.Tensor) -> torch.Tensor:
        return logits


-def init_builtin_logitsprocs(pin_memory_available: bool, max_num_reqs: int,
-                             device: torch.device) -> LogitsProcessorManager:
+class MaxThinkTokensLogitsProcessor(LogitsProcessor):


It seems more appropriate to split this into separate files.

I think it’s good to separate the files, but I’m just concerned about the divergence of different kinds of logits processors at the moment, since some are declared in the ops directory (e.g., bad words, penalties, top-k, top-p), while the built-in logits processors are declared in this logits_processor.py file.

You can probably create a logit_processors dir, then put diff logic processor there.

The default ones can just live under logit_processors/__init__.py, and others can have its own file.

Good, I’ll update it.

… missing Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

aarnphm · 2025-07-14T11:18:15Z

vllm/config.py

+class ReasoningConfig:
+    """Configuration for reasoning models."""
+
+    think_start_token_id: Optional[int] = None
+    """Token ID that indicates the start of reasoning."""
+    think_end_token_id: Optional[int] = None
+    """Token ID that indicates the end of reasoning."""
+
+    def __init__(self,
+                 think_start_token_id: Optional[int] = None,
+                 think_end_token_id: Optional[int] = None):
+        self.think_start_token_id = think_start_token_id
+        self.think_end_token_id = think_end_token_id


Let's not introduce another class for this here. I think we can coupled this with the reasoning parser.

It was quite hard to pass the reasoning parser information to the logits processors. If I don’t use ReasoningConfig, I might still need to pass the reasoning parser object to the logits processor anyways, to make logits processor get the information of think start/end token ids.

aarnphm

quick drive by comments on configuration.

aarnphm · 2025-07-14T11:21:06Z

vllm/entrypoints/openai/protocol.py

@@ -404,6 +404,7 @@ class ChatCompletionRequest(OpenAIBaseModel):
    prompt_logprobs: Optional[int] = None
    allowed_token_ids: Optional[list[int]] = None
    bad_words: list[str] = Field(default_factory=list)
+    max_think_tokens: Optional[int] = None


Can we introduce some heuristic with reasoning_effort. I'm thinking:

low -> 1024

medium -> 2048

high -> 8192

Then we can also have this as additional extra_body for users to override if they have custom context length set to vllm server here.

Sounds reasonable. So the user should only provide "reasoning_effort": [low, medium, high] as the sampling parameter? What I’m a bit concerned about is that it’s hard to control at the token level, and it’s only configurable when the server loads.

reasoning_effort are mostly for openai compatible endpoint. If users want more control, we then respect thinking_token_budget or some naming in the body instead of reasoning_effort.

Two scenarios:

Users who already uses reasoning_effort from openai frontend: nothing changes for them

If they want to increase the thinking budget, knowing that the model context length supports it:
client.chat.completions.create(..., reasoning_effort="medium", # we ignore reasoning_effort here for thinking_tokens_budget extra_body={"thinking_tokens_budget": 16384} )

also this should be included in the max_tokens calculation as well

aarnphm · 2025-07-14T11:21:09Z

vllm/config.py

@@ -4461,6 +4476,8 @@ class VllmConfig:
    # some opaque config, only used to provide additional information
    # for the hash computation, mainly used for testing, debugging or out of
    # tree config registration.
+    reasoning_config: Optional[ReasoningConfig] = None


aarnphm · 2025-07-14T11:21:27Z

vllm/reasoning/deepseek_r1_reasoning_parser.py

@@ -23,8 +23,8 @@ class DeepSeekR1ReasoningParser(ReasoningParser):
    text. This parser extracts the reasoning content from the model output.
    """

-    start_token_id: int
-    end_token_id: int
+    think_start_token_id: int


let's avoid changing this, I don't think this is related to this PR.

I changed this part, because I needed the start/end token ids from reasoning parser for logits processor, which needs the starting point and the end point of thinking mode.
I referenced this part as reasoning_parser.think_start_token_id for both qwen and deepseek models.

let's avoid changing this, I don't think this is related to this PR.

+1.

Also, as shown in hunyuan_a13b_reasoning_parser.py, think_start_ids consists of three token IDs. Using reasoning_parser.think_start_token_id directly doesn’t seem like a good approach—I suggest using a @property instead.

@chaunceyjiang Yes, I’ll update this for extensibility.
For now, I just wanted this PR to support only Qwen and DeepSeek models, which use a single token id to start and finish the thinking mode. I think we’ll need a different workflow for reasoning models that require multiple token ids, for example, they may need partial prefill after forcing multiple tokens at the end. In that case, I’m not sure if using only logits processors is the right approach. Maybe we’ll need partial prefill workflows or some help from guided decoding. What do you think about this?

I don't think structured outputs is relevant here.

I think frontend-related features should be using logit processor, to avoid performance issue. But the new logit processor should be performant enough.

I think we could also circumvent the use of logits processor, and use a concept of remaining budgets that can cover the case where we can make thinking budget compatible with speculative decoding and non speculative decoding as well. I have implemented some changes to support thinking budget with speculative decoding and at the same time avoid the use of logitsprocessor. This is the link to the draft PR : #20949 .
Let me know what you guys think !

It is quite hard to handle multiple think end tokens using a logits processor. That’s why I’m also considering implementing this feature in the serving_chat part, the scheduler, or with guided decoding.

There are several ways to implement this, each with its own drawbacks:

Logits processor: I would have to enforce multiple think end tokens across multiple decode steps. It means performance degradation. (maybe it sounds still reasonable)

serving_chat: I could make the reasoning_parser count think tokens and enforce think end tokens. This could be quite easy to implement, but with the current implementation, it seems hard to make the reasoning_parser check the sampling parameters of every request. It’s challenging to implement this in non Stream API.

Scheduler: Similar to the verification stage of speculative decoding, we could enforce multiple tokens and make the forward step perform a partial prefill. However, it seems quite difficult and complex to make only part of the requests in a batch build a KV cache for multiple tokens. @rishitdholakia13’s implementation appears to follow this approach. but if we need to handle multiple tokens, it would get more complex.

Guided decoding: Guided decoding or structured outputs have similar needs. for example, forcing certain tokens. But I think it’s also complex to manage given the prior implementations and the use of external libraries.

I decided to apply multiple think end tokens using logits processors. The methods I described above (options 2–4) are difficult to implement at the moment. So, the logits processors will produce multiple think end tokens across multiple forward steps.

With this new commit, I made this feature work with start/end tokens defined as token sequences (multiple tokens).
Since the reasoning parsers do not have the same property, I needed a new config argument to get the think start/end strings (e.g., think_end_str="\n\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>\n\n").

aarnphm · 2025-07-14T11:22:27Z

vllm/v1/sample/logits_processor.py

@@ -493,8 +495,113 @@ def apply(self, logits: torch.Tensor) -> torch.Tensor:
        return logits


-def init_builtin_logitsprocs(pin_memory_available: bool, max_num_reqs: int,
-                             device: torch.device) -> LogitsProcessorManager:
+class MaxThinkTokensLogitsProcessor(LogitsProcessor):


You can probably create a logit_processors dir, then put diff logic processor there.

The default ones can just live under logit_processors/__init__.py, and others can have its own file.

aarnphm · 2025-07-14T14:53:44Z

fyi #19912

rishitdholakia13 · 2025-07-14T17:16:50Z

vllm/sampling_params.py

@@ -248,6 +249,9 @@ class SamplingParams(
    bad_words: Optional[list[str]] = None
    _bad_words_token_ids: Optional[list[list[int]]] = None

+    # Maximum number of tokens allowed for thinking operations.
+    max_think_tokens: Optional[int] = None


Can we rename this to thinking_budget, would help provide consistency in naming since the max thinking here would refer to the thinking budget provided by the user.

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

mergify · 2025-07-16T06:34:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @llsj14.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added deepseek Related to DeepSeek models frontend v1 labels Jul 12, 2025

gemini-code-assist bot reviewed Jul 12, 2025

View reviewed changes

llsj14 force-pushed the feat/thinking-budget branch from c13ccf9 to 3a072f0 Compare July 12, 2025 09:25

gemini-code-assist bot reviewed Jul 12, 2025

View reviewed changes

vllm/v1/sample/logits_processor.py Outdated Show resolved Hide resolved

llsj14 force-pushed the feat/thinking-budget branch 3 times, most recently from 35cad4f to 4d64881 Compare July 14, 2025 04:56

mergify bot added the needs-rebase label Jul 14, 2025

llsj14 added 3 commits July 14, 2025 05:08

feat: limit thinking tokens

f4274ab

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

remove comment

84aee5b

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

update states only in update_state method

f2e195a

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

llsj14 force-pushed the feat/thinking-budget branch from 4d64881 to 3c4fc40 Compare July 14, 2025 05:09

mergify bot removed the needs-rebase label Jul 14, 2025

llsj14 force-pushed the feat/thinking-budget branch 4 times, most recently from 5d8490d to d5b9de1 Compare July 14, 2025 05:56

make precommit and lint

4c4251d

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

llsj14 force-pushed the feat/thinking-budget branch from d5b9de1 to 4c4251d Compare July 14, 2025 06:12

chaunceyjiang reviewed Jul 14, 2025

View reviewed changes

disable max think tokens logits processor if reasoning parser info is…

3984711

… missing Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

aarnphm reviewed Jul 14, 2025

View reviewed changes

rishitdholakia13 reviewed Jul 14, 2025

View reviewed changes

rishitdholakia13 mentioned this pull request Jul 14, 2025

[Reasoning] Add thinking budget support #20949

Draft

4 tasks

llsj14 added 2 commits July 15, 2025 12:49

revert change of deepseek reasoning parser

5636a12

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

support think start/end as token sequences

6b424ad

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

mergify bot added the needs-rebase label Jul 16, 2025

llsj14 mentioned this pull request Jul 16, 2025

[V1] Logits processors extensibility #19912

Open

Uh oh!

[Feature] limit thinking tokens #20859

Are you sure you want to change the base?

[Feature] limit thinking tokens #20859

Conversation

llsj14 commented Jul 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Implementation

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Jul 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llsj14 Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llsj14 Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aarnphm Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llsj14 Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llsj14 Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llsj14 commented Jul 12, 2025 •

edited by github-actions bot

Loading

llsj14 Jul 14, 2025 •

edited

Loading

llsj14 Jul 14, 2025 •

edited

Loading

aarnphm Jul 14, 2025 •

edited

Loading

llsj14 Jul 14, 2025 •

edited

Loading

llsj14 Jul 15, 2025 •

edited

Loading

llsj14 Jul 16, 2025 •

edited

Loading