-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[Feature] limit thinking tokens #20859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @llsj14, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a crucial feature to manage and limit the length of 'thinking' or 'reasoning' phases in large language models that employ explicit reasoning tokens. By allowing users to set a max_think_tokens
budget, the system can prevent uncontrolled long reasoning loops, ensuring more predictable and efficient model behavior. The core of this feature is a new logits processor that monitors token generation within designated thinking sections and intervenes to terminate them if the specified limit is exceeded.
Highlights
- New
max_think_tokens
parameter: Introduced amax_think_tokens
parameter inSamplingParams
and exposed it via the OpenAI protocol'sChatCompletionRequest
. This allows users to specify a maximum token limit for the 'thinking' phase of models that utilize explicit reasoning tokens. ReasoningConfig
and Dynamic Token ID Management: Added a newReasoningConfig
class tovllm/config.py
to encapsulatethink_start_token_id
andthink_end_token_id
. These IDs are now dynamically populated inGpuModelRunner
based on the configured reasoning backend (e.g., DeepSeek R1), ensuring the system correctly identifies and manages reasoning sections.MaxThinkTokensLogitsProcessor
Implementation: Implemented a newMaxThinkTokensLogitsProcessor
invllm/v1/sample/logits_processor.py
. This processor actively monitors the number of tokens generated within a thinking section. If themax_think_tokens
limit is reached, it modifies the logits to forcibly generate thethink_end_token_id
, effectively terminating the reasoning loop.- Enhanced State Tracking for Logits Processors: Modified the
AddedRequest
tuple invllm/v1/sample/logits_processor.py
andvllm/v1/worker/gpu_input_batch.py
to includeprompt_tok_ids
. This provides logits processors, especially the newMaxThinkTokensLogitsProcessor
, with more complete context for tracking token counts from the beginning of a request's generation. - Integration Across the Stack: The new
max_think_tokens
parameter and theReasoningConfig
are integrated throughout the system, from the API request parsing to theSamplingParams
,GpuInputBatch
, and finally into theLogitsProcessorManager
to ensure the thinking token limit is enforced during the token generation process.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
c13ccf9
to
3a072f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a feature to limit the number of "thinking" tokens generated by a model, which is a great way to prevent uncontrolled reasoning loops and manage computational budgets. The implementation adds a max_think_tokens
parameter and a corresponding MaxThinkTokensLogitsProcessor
to enforce this limit. I've identified a couple of issues related to correctness, particularly in edge cases and state management, which I've detailed below. Addressing these will make the feature more robust.
35cad4f
to
4d64881
Compare
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
4d64881
to
3c4fc40
Compare
5d8490d
to
d5b9de1
Compare
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
d5b9de1
to
4c4251d
Compare
@@ -493,8 +495,113 @@ def apply(self, logits: torch.Tensor) -> torch.Tensor: | |||
return logits | |||
|
|||
|
|||
def init_builtin_logitsprocs(pin_memory_available: bool, max_num_reqs: int, | |||
device: torch.device) -> LogitsProcessorManager: | |||
class MaxThinkTokensLogitsProcessor(LogitsProcessor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems more appropriate to split this into separate files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’s good to separate the files, but I’m just concerned about the divergence of different kinds of logits processors at the moment, since some are declared in the ops directory
(e.g., bad words, penalties, top-k, top-p), while the built-in logits processors are declared in this logits_processor.py
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably create a logit_processors dir, then put diff logic processor there.
The default ones can just live under logit_processors/__init__.py
, and others can have its own file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, I’ll update it.
… missing Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
vllm/config.py
Outdated
class ReasoningConfig: | ||
"""Configuration for reasoning models.""" | ||
|
||
think_start_token_id: Optional[int] = None | ||
"""Token ID that indicates the start of reasoning.""" | ||
think_end_token_id: Optional[int] = None | ||
"""Token ID that indicates the end of reasoning.""" | ||
|
||
def __init__(self, | ||
think_start_token_id: Optional[int] = None, | ||
think_end_token_id: Optional[int] = None): | ||
self.think_start_token_id = think_start_token_id | ||
self.think_end_token_id = think_end_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not introduce another class for this here. I think we can coupled this with the reasoning parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was quite hard to pass the reasoning parser information to the logits processors. If I don’t use ReasoningConfig, I might still need to pass the reasoning parser object to the logits processor anyways, to make logits processor get the information of think start/end token ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick drive by comments on configuration.
@@ -404,6 +404,7 @@ class ChatCompletionRequest(OpenAIBaseModel): | |||
prompt_logprobs: Optional[int] = None | |||
allowed_token_ids: Optional[list[int]] = None | |||
bad_words: list[str] = Field(default_factory=list) | |||
max_think_tokens: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we introduce some heuristic with reasoning_effort
. I'm thinking:
- low -> 1024
- medium -> 2048
- high -> 8192
Then we can also have this as additional extra_body
for users to override if they have custom context length set to vllm server here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. So the user should only provide "reasoning_effort": [low, medium, high]
as the sampling parameter? What I’m a bit concerned about is that it’s hard to control at the token level, and it’s only configurable when the server loads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reasoning_effort
are mostly for openai compatible endpoint. If users want more control, we then respect thinking_token_budget
or some naming in the body instead of reasoning_effort
.
Two scenarios:
- Users who already uses
reasoning_effort
from openai frontend: nothing changes for them - If they want to increase the thinking budget, knowing that the model context length supports it:
client.chat.completions.create(..., reasoning_effort="medium", # we ignore reasoning_effort here for thinking_tokens_budget extra_body={"thinking_tokens_budget": 16384} )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this should be included in the max_tokens
calculation as well
@@ -4461,6 +4476,8 @@ class VllmConfig: | |||
# some opaque config, only used to provide additional information | |||
# for the hash computation, mainly used for testing, debugging or out of | |||
# tree config registration. | |||
reasoning_config: Optional[ReasoningConfig] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -23,8 +23,8 @@ class DeepSeekR1ReasoningParser(ReasoningParser): | |||
text. This parser extracts the reasoning content from the model output. | |||
""" | |||
|
|||
start_token_id: int | |||
end_token_id: int | |||
think_start_token_id: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid changing this, I don't think this is related to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this part, because I needed the start/end token ids from reasoning parser for logits processor, which needs the starting point and the end point of thinking mode.
I referenced this part as reasoning_parser.think_start_token_id
for both qwen and deepseek models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid changing this, I don't think this is related to this PR.
+1.
Also, as shown in hunyuan_a13b_reasoning_parser.py
, think_start_ids
consists of three token IDs. Using reasoning_parser.think_start_token_id
directly doesn’t seem like a good approach—I suggest using a @property
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chaunceyjiang Yes, I’ll update this for extensibility.
For now, I just wanted this PR to support only Qwen and DeepSeek models, which use a single token id to start and finish the thinking mode. I think we’ll need a different workflow for reasoning models that require multiple token ids, for example, they may need partial prefill after forcing multiple tokens at the end. In that case, I’m not sure if using only logits processors is the right approach. Maybe we’ll need partial prefill workflows or some help from guided decoding. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think structured outputs is relevant here.
I think frontend-related features should be using logit processor, to avoid performance issue. But the new logit processor should be performant enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could also circumvent the use of logits processor, and use a concept of remaining budgets that can cover the case where we can make thinking budget compatible with speculative decoding and non speculative decoding as well. I have implemented some changes to support thinking budget with speculative decoding and at the same time avoid the use of logitsprocessor. This is the link to the draft PR : #20949 .
Let me know what you guys think !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is quite hard to handle multiple think end tokens using a logits processor. That’s why I’m also considering implementing this feature in the serving_chat part, the scheduler, or with guided decoding.
There are several ways to implement this, each with its own drawbacks:
- Logits processor: I would have to enforce multiple think end tokens across multiple decode steps. It means performance degradation. (maybe it sounds still reasonable)
- serving_chat: I could make the
reasoning_parser
count think tokens and enforce think end tokens. This could be quite easy to implement, but with the current implementation, it seems hard to make thereasoning_parser
check the sampling parameters of every request. It’s challenging to implement this in non Stream API. - Scheduler: Similar to the verification stage of speculative decoding, we could enforce multiple tokens and make the forward step perform a partial prefill. However, it seems quite difficult and complex to make only part of the requests in a batch build a KV cache for multiple tokens. @rishitdholakia13’s implementation appears to follow this approach. but if we need to handle multiple tokens, it would get more complex.
- Guided decoding: Guided decoding or structured outputs have similar needs. for example, forcing certain tokens. But I think it’s also complex to manage given the prior implementations and the use of external libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to apply multiple think end tokens using logits processors. The methods I described above (options 2–4) are difficult to implement at the moment. So, the logits processors will produce multiple think end tokens across multiple forward steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this new commit, I made this feature work with start/end tokens defined as token sequences (multiple tokens).
Since the reasoning parsers do not have the same property, I needed a new config argument to get the think start/end strings (e.g., think_end_str="\n\nConsidering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>\n\n"
).
@@ -493,8 +495,113 @@ def apply(self, logits: torch.Tensor) -> torch.Tensor: | |||
return logits | |||
|
|||
|
|||
def init_builtin_logitsprocs(pin_memory_available: bool, max_num_reqs: int, | |||
device: torch.device) -> LogitsProcessorManager: | |||
class MaxThinkTokensLogitsProcessor(LogitsProcessor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably create a logit_processors dir, then put diff logic processor there.
The default ones can just live under logit_processors/__init__.py
, and others can have its own file.
fyi #19912 |
@@ -248,6 +249,9 @@ class SamplingParams( | |||
bad_words: Optional[list[str]] = None | |||
_bad_words_token_ids: Optional[list[list[int]]] = None | |||
|
|||
# Maximum number of tokens allowed for thinking operations. | |||
max_think_tokens: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to thinking_budget, would help provide consistency in naming since the max thinking here would refer to the thinking budget provided by the user.
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
This pull request has merge conflicts that must be resolved before it can be |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Implementation
max_think_tokens
sampling parameter, the logits processor will forcibly insert the thinking end token ID to terminate the thinking section.Test Plan
will add unit tests
Test Result
(Optional) Documentation Update