-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
feat: Add --enable-log-outputs flag for logging model generations #20707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @mizadri, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new, optional --enable-log-outputs
command-line flag to vLLM, enabling comprehensive logging of model generations. This feature complements the existing input logging by providing detailed output information, which is crucial for debugging, monitoring, auditing, and development workflows. It supports various response types, including streaming deltas, complete responses, and tool calls, while ensuring full backward compatibility and minimal performance impact.
Highlights
- New Output Logging Flag: A new
--enable-log-outputs
CLI flag has been added to enable comprehensive logging of model generations (outputs), disabled by default for backward compatibility. - Comprehensive Output Logging: The system now logs model outputs for both streaming (individual token deltas and final complete responses) and non-streaming requests, including support for tool calls and their arguments.
- Integration with Existing Logging: The new output logging functionality integrates seamlessly with the existing
RequestLogger
and respects the--max-log-len
setting for truncation of long outputs and token ID lists. - API Endpoint Coverage: Output logging has been implemented for both
/v1/chat/completions
(streaming and non-streaming) and/v1/responses
endpoints, ensuring broad applicability.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an --enable-log-outputs flag to enable logging of model generations, complementing vLLM's existing input logging. The changes include modifications to RequestLogger, CLI arguments, and enhancements to OpenAIServingChat and OpenAIServingResponses. The new unit tests for the logger are not correctly validating the behavior, and there are a couple of suggestions to enhance the logging functionality.
tests/test_logger.py
Outdated
mock_logger.info.assert_called_once() | ||
call_args = mock_logger.info.call_args[0] | ||
assert "Generated response test-123" in call_args[0] | ||
assert "Hello, world!" in call_args[1] | ||
assert [1, 2, 3, 4] == call_args[2] | ||
assert "stop" == call_args[3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertions in this test, and all subsequent new tests for log_outputs
, are asserting against the wrong argument indices from the mocked logger call. This means the tests are not correctly validating the intended behavior and are likely passing incorrectly.
For example, call_args[0]
is asserted to be the outputs
string, but it's actually the format string. The outputs
argument is at index 3.
The arguments passed to logger.info
are (format_string, request_id, stream_info, outputs, output_token_ids, finish_reason)
. The assertions need to be updated to check the arguments at their correct positions.
This issue is present in:
test_request_logger_log_outputs
test_request_logger_log_outputs_streaming_delta
test_request_logger_log_outputs_streaming_complete
test_request_logger_log_outputs_with_truncation
test_request_logger_log_outputs_none_values
test_request_logger_log_outputs_empty_output
mock_logger.info.assert_called_once()
call_args = mock_logger.info.call_args.args
# logger.info(format_string, request_id, stream_info, outputs, output_token_ids, finish_reason)
assert "Generated response %s%s" in call_args[0]
assert call_args[1] == "test-123"
assert call_args[3] == "Hello, world!"
assert call_args[4] == [1, 2, 3, 4]
assert call_args[5] == "stop"
self.request_logger.log_outputs( | ||
request_id=request_id, | ||
outputs=f"<streaming_complete: {num_completion_tokens} tokens>", | ||
output_token_ids=None, | ||
finish_reason="streaming_complete", | ||
is_streaming=True, | ||
delta=False, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message for a completed streaming response currently only includes the number of generated tokens, not the full concatenated text of the response. This doesn't align with the goal of "comprehensive output logging" mentioned in the PR description and reduces the utility of these logs for debugging and auditing.
The comment // For now, we'll log the completion tokens count as final output
indicates this might be a known limitation.
To make this more useful, I recommend accumulating the full response text from the streaming deltas and logging it here. This would provide a complete record of the generated output. You might be able to use or adapt the previous_texts
variable for this purpose, though you'd need to ensure it's populated even when tools are not in use.
for i in range(num_choices):
full_text = (previous_texts[i] if previous_texts
and i < len(previous_texts) else
f"<streaming_complete: {previous_num_tokens[i]} tokens>"
)
self.request_logger.log_outputs(
request_id=request_id,
outputs=full_text,
output_token_ids=None, # Consider also logging all token IDs
finish_reason="streaming_complete",
is_streaming=True,
delta=False,
)
if delta_message.content: | ||
delta_content = delta_message.content | ||
elif delta_message.tool_calls and delta_message.tool_calls[0].function and delta_message.tool_calls[0].function.arguments: | ||
delta_content = delta_message.tool_calls[0].function.arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for extracting delta_content
from tool calls only considers the first tool call in a delta message (delta_message.tool_calls[0]
). While a delta typically contains a single tool call, the tool_calls
attribute is a list, suggesting multiple could be present. If a delta ever contains more than one tool call, the arguments from subsequent tool calls will not be logged.
To make this more robust, I suggest iterating through all items in delta_message.tool_calls
to ensure all arguments are captured.
if delta_message.content:
delta_content = delta_message.content
elif delta_message.tool_calls:
delta_content = "".join(
tc.function.arguments
for tc in delta_message.tool_calls
if tc.function and tc.function.arguments)
Thanks for contributing! Can you resolve the pre-commit issues? |
5c58910
to
d356a3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aarnphm can you help review?
Add optional output logging functionality to complement existing input logging. By default, vLLM only logs incoming requests but not model outputs. This feature adds comprehensive output logging controlled by a new CLI flag. Key features: - New --enable-log-outputs CLI flag (disabled by default) - Logs both streaming and non-streaming responses - Supports individual token deltas in streaming mode - Handles tool calls and function arguments - Respects existing --max-log-len truncation settings - Maintains full backward compatibility Implementation: - Added RequestLogger.log_outputs() method for output logging - Enhanced OpenAIServingChat with output logging in both generators - Enhanced OpenAIServingResponses with output logging support - Added comprehensive test coverage for all scenarios Usage: python -m vllm.entrypoints.openai.api_server --model MODEL_NAME --enable-log-outputs Docker: docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model MODEL_NAME --enable-log-outputs This addresses the common need for debugging and monitoring model outputs while preserving the existing behavior by default. Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Fix type annotation and variable naming issues identified by mypy: - Change output_token_ids parameter type from list[int] to Sequence[int] to handle compatibility with different sequence types from output objects - Fix variable naming conflict in tool call logging (tool_call_info -> tool_call_descriptions) - Add proper type conversion in log_outputs method for truncation - Update test imports to include Sequence type These fixes ensure the output logging feature passes mypy type checking while maintaining full functionality and backward compatibility. Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
- Break long conditional expressions into multiple lines - Fix tool call logging lines exceeding 80 characters - Remove trailing whitespace - Maintain code readability and functionality Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Shorten comment from 81 to 71 characters to comply with E501 line length limit. The comment 'Log individual streaming delta if output logging is enabled' was shortened to 'Log streaming delta if output logging is enabled' while maintaining clarity and meaning. Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
bab92a8
to
4a10460
Compare
I tried to fix the issues mentioned, but I am not sure why the pre commit hooks are failing now |
It looks like the code is not formatted properly. You should install the pre-commit hook and run it locally before committing and pushing them to remote |
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Hey there @DarkLight1337 I addressed the formatting changes but in my last commit there was a fail during the Lint and Deploy charts, it appears to be an issue related to Triton, It does not seem to be related to my changes. The job failed because tl.int32 does not exist in the Triton installation; Not sure if we need to use tl.int64 or upgrade the triton package.
|
The assertions in log_outputs test methods were checking wrong argument indices from mocked logger calls, causing tests to validate incorrect behavior and pass incorrectly. The logger.info call signature is: logger.info(format_string, request_id, stream_info, outputs, output_token_ids, finish_reason) Fixed argument index assertions in all affected test methods: - test_request_logger_log_outputs - test_request_logger_log_outputs_streaming_delta - test_request_logger_log_outputs_streaming_complete - test_request_logger_log_outputs_with_truncation - test_request_logger_log_outputs_none_values - test_request_logger_log_outputs_empty_output - test_request_logger_log_outputs_integration Tests now correctly validate outputs at index 3, output_token_ids at index 4, and finish_reason at index 5, instead of the previous incorrect indices 1, 2, and 3 respectively. Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Previously, the log message for a completed streaming response only included the number of generated tokens, which limited debugging and auditing capabilities. This change: - Modifies the streaming response logging to include the full concatenated text instead of just token counts - Adds test coverage to verify the full text logging behavior - Ensures all logger.info call argument indices are correct in tests The change improves the utility of logs for debugging and auditing by providing complete output records. Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Previously only the first tool call’s arguments were captured when logging streaming delta content, which could miss information if multiple tool calls were present in a single delta. The extraction logic now concatenates the arguments from *all* tool calls ensuring complete logging. Additional changes: * Updated unit tests to remain within Ruff line-length limits (E501). * Auto-formatted touched files via project pre-commit hooks. Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
This pull request has merge conflicts that must be resolved before it can be |
Resolved merge conflicts in vllm/entrypoints/openai/api_server.py while preserving logger enhancements and SSE decoding added in this branch. All logger tests pass. Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Solved all merge conflicts and brought latest changes from main |
tests/test_logger.py
Outdated
|
||
mock_logger.info.assert_called_once() | ||
call_args = mock_logger.info.call_args.args | ||
# logger.info(format_string, request_id, stream_info, outputs, output_token_ids, finish_reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the commented-out code
940250b
to
cfe4146
Compare
…outputs Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
cfe4146
to
11543a0
Compare
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
…outputs Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
There is a single test that fails in buildkite/fastcheck/pr that is related to the openai package, I am not sure it is related to my changes at all:
|
Retrying the test |
This pull request has merge conflicts that must be resolved before it can be |
Add --enable-log-outputs flag for logging model generations
📋 Summary
This PR adds optional output logging functionality to complement vLLM's existing input logging. By default, vLLM logs incoming requests (prompts, parameters, token IDs) but does not log model outputs. This feature adds comprehensive output logging controlled by a new CLI flag.
🚀 Motivation
✨ Key Features
--enable-log-outputs
CLI flag (disabled by default for backward compatibility)--max-log-len
settings/v1/chat/completions
and/v1/responses
🔧 Implementation
Components Added/Modified:
RequestLogger.log_outputs()
method (vllm/entrypoints/logger.py
)CLI argument (
vllm/entrypoints/openai/cli_args.py
)--enable-log-outputs
flag with help textOpenAIServingChat enhancements (
vllm/entrypoints/openai/serving_chat.py
)chat_completion_stream_generator()
for streamingchat_completion_full_generator()
for non-streamingOpenAIServingResponses enhancements (
vllm/entrypoints/openai/serving_responses.py
)responses_full_generator()
methodServer initialization (
vllm/entrypoints/openai/api_server.py
)enable_log_outputs
flag to serving classesComprehensive tests (
tests/test_logger.py
)🧪 Testing
Manual Testing Performed:
--max-log-len
Automated Tests:
log_outputs()
methodlog_inputs()
📝 Usage Examples
Command Line:
Docker:
Environment Variables (SageMaker style):
export SM_VLLM_ENABLE_LOG_OUTPUTS=true
🔍 Log Output Examples
Input Logging (existing, always active):
Output Logging (NEW, with --enable-log-outputs):
Streaming Delta Logging (NEW):
🔄 Backward Compatibility
--enable-log-outputs
📊 Performance Impact
🔐 Security Considerations
🎯 Future Enhancements (Out of Scope)
📋 Checklist
🤝 Related Issues
This addresses common requests for output logging capability that have appeared in:
📸 Screenshots/Demo
// Does not let me attach screenshot
Successfully tested with DialoGPT-small model showing:
Ready for review! This feature provides a much-requested capability while maintaining full backward compatibility and following vLLM's existing patterns.