Skip to content

feat: Add --enable-log-outputs flag for logging model generations #20707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

mizadri
Copy link

@mizadri mizadri commented Jul 9, 2025

Add --enable-log-outputs flag for logging model generations

📋 Summary

This PR adds optional output logging functionality to complement vLLM's existing input logging. By default, vLLM logs incoming requests (prompts, parameters, token IDs) but does not log model outputs. This feature adds comprehensive output logging controlled by a new CLI flag.

🚀 Motivation

  • Debugging: Enables developers to see both inputs and outputs for debugging inference issues
  • Monitoring: Allows production monitoring of model generations for quality control
  • Auditing: Provides complete request/response audit trails for compliance
  • Development: Helps with prompt engineering and model behavior analysis

✨ Key Features

  • New --enable-log-outputs CLI flag (disabled by default for backward compatibility)
  • Streaming support: Logs individual token deltas and complete responses
  • Non-streaming support: Logs final generated outputs
  • Tool call support: Properly handles function calls and arguments
  • Truncation support: Respects existing --max-log-len settings
  • Multiple endpoints: Works with /v1/chat/completions and /v1/responses
  • Full backward compatibility: No changes to default behavior

🔧 Implementation

Components Added/Modified:

  1. RequestLogger.log_outputs() method (vllm/entrypoints/logger.py)

    • Handles output logging with streaming/non-streaming modes
    • Supports truncation and proper formatting
  2. CLI argument (vllm/entrypoints/openai/cli_args.py)

    • Added --enable-log-outputs flag with help text
  3. OpenAIServingChat enhancements (vllm/entrypoints/openai/serving_chat.py)

    • Output logging in chat_completion_stream_generator() for streaming
    • Output logging in chat_completion_full_generator() for non-streaming
    • Proper handling of tool calls and function arguments
  4. OpenAIServingResponses enhancements (vllm/entrypoints/openai/serving_responses.py)

    • Output logging in responses_full_generator() method
  5. Server initialization (vllm/entrypoints/openai/api_server.py)

    • Pass enable_log_outputs flag to serving classes
  6. Comprehensive tests (tests/test_logger.py)

    • Tests for all logging modes and edge cases
    • Truncation, streaming, tool calls, error handling

🧪 Testing

Manual Testing Performed:

  • ✅ Server starts successfully with new flag
  • ✅ Non-streaming requests log outputs correctly
  • ✅ Streaming requests log individual deltas
  • ✅ Tool calls are properly logged
  • ✅ Truncation works with --max-log-len
  • ✅ Backward compatibility (no output logging without flag)

Automated Tests:

  • ✅ Unit tests for log_outputs() method
  • ✅ Tests for streaming delta and complete modes
  • ✅ Tests for truncation behavior
  • ✅ Tests for None/empty value handling
  • ✅ Integration tests with existing log_inputs()

📝 Usage Examples

Command Line:

python -m vllm.entrypoints.openai.api_server \
  --model microsoft/DialoGPT-small \
  --enable-log-outputs

Docker:

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model microsoft/DialoGPT-small \
  --enable-log-outputs

Environment Variables (SageMaker style):

export SM_VLLM_ENABLE_LOG_OUTPUTS=true

🔍 Log Output Examples

Input Logging (existing, always active):

INFO:vllm.entrypoints.logger:Received request chatcmpl-abc123: prompt: 'Hello, how are you?', params: SamplingParams(...), prompt_token_ids: [15496, 11, 1428, 527, 499, 30], ...

Output Logging (NEW, with --enable-log-outputs):

INFO:vllm.entrypoints.logger:Generated response chatcmpl-abc123: output: 'Hello! I am doing well, thank you for asking.', output_token_ids: [9906, 0, 358, 1097, 3815, 1664, 11, 9901, 499, 369, 10371, 13], finish_reason: stop

Streaming Delta Logging (NEW):

INFO:vllm.entrypoints.logger:Generated response chatcmpl-abc123 (streaming delta): output: 'Hello', output_token_ids: [9906], finish_reason: None

🔄 Backward Compatibility

  • Default behavior unchanged: Output logging is OFF by default
  • Existing logs preserved: Input logging continues exactly as before
  • No breaking changes: All existing APIs and functionality preserved
  • Optional feature: Users explicitly opt-in with --enable-log-outputs

📊 Performance Impact

  • Zero impact when disabled: No performance overhead without the flag
  • Minimal impact when enabled: Simple string logging operations
  • Efficient truncation: Respects existing max-log-len limits
  • No blocking operations: Uses existing logging infrastructure

🔐 Security Considerations

  • ⚠️ Sensitive data: Output logs may contain sensitive generated content
  • Controlled access: Logs go through existing logging infrastructure
  • Truncation support: Respects max-log-len for limiting exposure
  • Opt-in only: Feature is disabled by default

🎯 Future Enhancements (Out of Scope)

  • Structured logging formats (JSON, etc.)
  • Selective endpoint logging
  • Log filtering by content type
  • Export to external monitoring systems

📋 Checklist

  • Code follows project conventions
  • Comprehensive tests added and passing
  • Backward compatibility maintained
  • Documentation updated (CLI help text)
  • Manual testing performed
  • Performance impact considered
  • Security implications reviewed
  • Example usage provided

🤝 Related Issues

This addresses common requests for output logging capability that have appeared in:

  • Debugging workflows
  • Production monitoring needs
  • Development and testing scenarios

📸 Screenshots/Demo

// Does not let me attach screenshot

Successfully tested with DialoGPT-small model showing:

  • ✅ Input logging: "Received request..." messages
  • ✅ Output logging: "Generated response..." messages
  • ✅ Streaming deltas: Individual token logging
  • ✅ Tool calls: Function call logging

Ready for review! This feature provides a much-requested capability while maintaining full backward compatibility and following vLLM's existing patterns.

@mizadri mizadri requested a review from aarnphm as a code owner July 9, 2025 20:34
Copy link

github-actions bot commented Jul 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Jul 9, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @mizadri, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, optional --enable-log-outputs command-line flag to vLLM, enabling comprehensive logging of model generations. This feature complements the existing input logging by providing detailed output information, which is crucial for debugging, monitoring, auditing, and development workflows. It supports various response types, including streaming deltas, complete responses, and tool calls, while ensuring full backward compatibility and minimal performance impact.

Highlights

  • New Output Logging Flag: A new --enable-log-outputs CLI flag has been added to enable comprehensive logging of model generations (outputs), disabled by default for backward compatibility.
  • Comprehensive Output Logging: The system now logs model outputs for both streaming (individual token deltas and final complete responses) and non-streaming requests, including support for tool calls and their arguments.
  • Integration with Existing Logging: The new output logging functionality integrates seamlessly with the existing RequestLogger and respects the --max-log-len setting for truncation of long outputs and token ID lists.
  • API Endpoint Coverage: Output logging has been implemented for both /v1/chat/completions (streaming and non-streaming) and /v1/responses endpoints, ensuring broad applicability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an --enable-log-outputs flag to enable logging of model generations, complementing vLLM's existing input logging. The changes include modifications to RequestLogger, CLI arguments, and enhancements to OpenAIServingChat and OpenAIServingResponses. The new unit tests for the logger are not correctly validating the behavior, and there are a couple of suggestions to enhance the logging functionality.

Comment on lines 277 to 282
mock_logger.info.assert_called_once()
call_args = mock_logger.info.call_args[0]
assert "Generated response test-123" in call_args[0]
assert "Hello, world!" in call_args[1]
assert [1, 2, 3, 4] == call_args[2]
assert "stop" == call_args[3]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The assertions in this test, and all subsequent new tests for log_outputs, are asserting against the wrong argument indices from the mocked logger call. This means the tests are not correctly validating the intended behavior and are likely passing incorrectly.

For example, call_args[0] is asserted to be the outputs string, but it's actually the format string. The outputs argument is at index 3.

The arguments passed to logger.info are (format_string, request_id, stream_info, outputs, output_token_ids, finish_reason). The assertions need to be updated to check the arguments at their correct positions.

This issue is present in:

  • test_request_logger_log_outputs
  • test_request_logger_log_outputs_streaming_delta
  • test_request_logger_log_outputs_streaming_complete
  • test_request_logger_log_outputs_with_truncation
  • test_request_logger_log_outputs_none_values
  • test_request_logger_log_outputs_empty_output
mock_logger.info.assert_called_once()
        call_args = mock_logger.info.call_args.args
        # logger.info(format_string, request_id, stream_info, outputs, output_token_ids, finish_reason)
        assert "Generated response %s%s" in call_args[0]
        assert call_args[1] == "test-123"
        assert call_args[3] == "Hello, world!"
        assert call_args[4] == [1, 2, 3, 4]
        assert call_args[5] == "stop"

Comment on lines 970 to 980
self.request_logger.log_outputs(
request_id=request_id,
outputs=f"<streaming_complete: {num_completion_tokens} tokens>",
output_token_ids=None,
finish_reason="streaming_complete",
is_streaming=True,
delta=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log message for a completed streaming response currently only includes the number of generated tokens, not the full concatenated text of the response. This doesn't align with the goal of "comprehensive output logging" mentioned in the PR description and reduces the utility of these logs for debugging and auditing.

The comment // For now, we'll log the completion tokens count as final output indicates this might be a known limitation.

To make this more useful, I recommend accumulating the full response text from the streaming deltas and logging it here. This would provide a complete record of the generated output. You might be able to use or adapt the previous_texts variable for this purpose, though you'd need to ensure it's populated even when tools are not in use.

for i in range(num_choices):
                    full_text = (previous_texts[i] if previous_texts
                                 and i < len(previous_texts) else
                                 f"<streaming_complete: {previous_num_tokens[i]} tokens>"
                                 )
                    self.request_logger.log_outputs(
                        request_id=request_id,
                        outputs=full_text,
                        output_token_ids=None,  # Consider also logging all token IDs
                        finish_reason="streaming_complete",
                        is_streaming=True,
                        delta=False,
                    )

Comment on lines 831 to 834
if delta_message.content:
delta_content = delta_message.content
elif delta_message.tool_calls and delta_message.tool_calls[0].function and delta_message.tool_calls[0].function.arguments:
delta_content = delta_message.tool_calls[0].function.arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for extracting delta_content from tool calls only considers the first tool call in a delta message (delta_message.tool_calls[0]). While a delta typically contains a single tool call, the tool_calls attribute is a list, suggesting multiple could be present. If a delta ever contains more than one tool call, the arguments from subsequent tool calls will not be logged.

To make this more robust, I suggest iterating through all items in delta_message.tool_calls to ensure all arguments are captured.

if delta_message.content:
                            delta_content = delta_message.content
                        elif delta_message.tool_calls:
                            delta_content = "".join(
                                tc.function.arguments
                                for tc in delta_message.tool_calls
                                if tc.function and tc.function.arguments)

@DarkLight1337
Copy link
Member

Thanks for contributing! Can you resolve the pre-commit issues?

@mizadri mizadri force-pushed the feature/enable-log-outputs branch from 5c58910 to d356a3e Compare July 10, 2025 08:35
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aarnphm can you help review?

mizadri added 4 commits July 10, 2025 15:04
Add optional output logging functionality to complement existing input logging.
By default, vLLM only logs incoming requests but not model outputs. This feature
adds comprehensive output logging controlled by a new CLI flag.

Key features:
- New --enable-log-outputs CLI flag (disabled by default)
- Logs both streaming and non-streaming responses
- Supports individual token deltas in streaming mode
- Handles tool calls and function arguments
- Respects existing --max-log-len truncation settings
- Maintains full backward compatibility

Implementation:
- Added RequestLogger.log_outputs() method for output logging
- Enhanced OpenAIServingChat with output logging in both generators
- Enhanced OpenAIServingResponses with output logging support
- Added comprehensive test coverage for all scenarios

Usage:
python -m vllm.entrypoints.openai.api_server --model MODEL_NAME --enable-log-outputs

Docker:
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model MODEL_NAME --enable-log-outputs

This addresses the common need for debugging and monitoring model outputs
while preserving the existing behavior by default.

Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Fix type annotation and variable naming issues identified by mypy:
- Change output_token_ids parameter type from list[int] to Sequence[int]
  to handle compatibility with different sequence types from output objects
- Fix variable naming conflict in tool call logging (tool_call_info -> tool_call_descriptions)
- Add proper type conversion in log_outputs method for truncation
- Update test imports to include Sequence type

These fixes ensure the output logging feature passes mypy type checking
while maintaining full functionality and backward compatibility.

Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
- Break long conditional expressions into multiple lines
- Fix tool call logging lines exceeding 80 characters
- Remove trailing whitespace
- Maintain code readability and functionality

Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Shorten comment from 81 to 71 characters to comply with E501 line length limit.
The comment 'Log individual streaming delta if output logging is enabled'
was shortened to 'Log streaming delta if output logging is enabled' while
maintaining clarity and meaning.

Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
@mizadri mizadri force-pushed the feature/enable-log-outputs branch from bab92a8 to 4a10460 Compare July 10, 2025 11:05
@mizadri
Copy link
Author

mizadri commented Jul 10, 2025

I tried to fix the issues mentioned, but I am not sure why the pre commit hooks are failing now

@DarkLight1337
Copy link
Member

It looks like the code is not formatted properly. You should install the pre-commit hook and run it locally before committing and pushing them to remote

mizadri added 2 commits July 11, 2025 10:21
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
@mizadri
Copy link
Author

mizadri commented Jul 11, 2025

Hey there @DarkLight1337 I addressed the formatting changes but in my last commit there was a fail during the Lint and Deploy charts, it appears to be an issue related to Triton, It does not seem to be related to my changes.

The job failed because tl.int32 does not exist in the Triton installation; Not sure if we need to use tl.int64 or upgrade the triton package.

INFO 07-11 06:58:17 [core.py:69] Initializing a V1 LLM engine (v0.9.2rc2.dev173+g681de6d3f) with config: model='/data/', speculative_config=None, tokenizer='/data/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
INFO 07-11 06:58:17 [importing.py:43] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 07-11 06:58:17 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
....
ERROR 07-11 06:58:17 [core.py:586]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 30, in <module>
ERROR 07-11 06:58:17 [core.py:586]     batch: tl.int32,  # actually padded_batch
ERROR 07-11 06:58:17 [core.py:586]            ^^^^^^^^
ERROR 07-11 06:58:17 [core.py:586] AttributeError: module 'triton.language' has no attribute 'int32'. Did you mean: 'int64'?

mizadri added 3 commits July 11, 2025 18:32
The assertions in log_outputs test methods were checking wrong argument
indices from mocked logger calls, causing tests to validate incorrect
behavior and pass incorrectly.

The logger.info call signature is:
logger.info(format_string, request_id, stream_info, outputs,
           output_token_ids, finish_reason)

Fixed argument index assertions in all affected test methods:
- test_request_logger_log_outputs
- test_request_logger_log_outputs_streaming_delta
- test_request_logger_log_outputs_streaming_complete
- test_request_logger_log_outputs_with_truncation
- test_request_logger_log_outputs_none_values
- test_request_logger_log_outputs_empty_output
- test_request_logger_log_outputs_integration

Tests now correctly validate outputs at index 3, output_token_ids at
index 4, and finish_reason at index 5, instead of the previous
incorrect indices 1, 2, and 3 respectively.

Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Previously, the log message for a completed streaming response only included
the number of generated tokens, which limited debugging and auditing
capabilities. This change:

- Modifies the streaming response logging to include the full concatenated
  text instead of just token counts
- Adds test coverage to verify the full text logging behavior
- Ensures all logger.info call argument indices are correct in tests

The change improves the utility of logs for debugging and auditing by
providing complete output records.

Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Previously only the first tool call’s arguments were captured when logging
streaming delta content, which could miss information if multiple tool calls
were present in a single delta.  The extraction logic now concatenates the
arguments from *all* tool calls ensuring complete logging.

Additional changes:
* Updated unit tests to remain within Ruff line-length limits (E501).
* Auto-formatted touched files via project pre-commit hooks.

Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Copy link

mergify bot commented Jul 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mizadri.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 11, 2025
Resolved merge conflicts in vllm/entrypoints/openai/api_server.py while preserving
logger enhancements and SSE decoding added in this branch. All logger tests pass.

Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
@mizadri
Copy link
Author

mizadri commented Jul 11, 2025

Solved all merge conflicts and brought latest changes from main

@mergify mergify bot removed the needs-rebase label Jul 11, 2025

mock_logger.info.assert_called_once()
call_args = mock_logger.info.call_args.args
# logger.info(format_string, request_id, stream_info, outputs, output_token_ids, finish_reason)
Copy link
Member

@DarkLight1337 DarkLight1337 Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the commented-out code

@mizadri mizadri force-pushed the feature/enable-log-outputs branch from 940250b to cfe4146 Compare July 12, 2025 09:30
mizadri added 3 commits July 12, 2025 14:30
…outputs

Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
@mizadri mizadri force-pushed the feature/enable-log-outputs branch from cfe4146 to 11543a0 Compare July 12, 2025 10:31
mizadri added 3 commits July 13, 2025 18:17
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
…outputs

Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai>
@mizadri
Copy link
Author

mizadri commented Jul 15, 2025

There is a single test that fails in buildkite/fastcheck/pr that is related to the openai package, I am not sure it is related to my changes at all:


[2025-07-14T09:30:22Z] =================================== FAILURES ===================================
--
  | [2025-07-14T09:30:22Z] __________ test_required_tool_use[False-HuggingFaceH4/zephyr-7b-beta] __________
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z] client = <openai.AsyncOpenAI object at 0x7fa376584ec0>, is_v1_server = False
  | [2025-07-14T09:30:22Z] model_name = 'HuggingFaceH4/zephyr-7b-beta'
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z]     @pytest.mark.asyncio
  | [2025-07-14T09:30:22Z]     @pytest.mark.parametrize("model_name", [MODEL_NAME])
  | [2025-07-14T09:30:22Z]     async def test_required_tool_use(client: openai.AsyncOpenAI,
  | [2025-07-14T09:30:22Z]                                      is_v1_server: bool, model_name: str):
  | [2025-07-14T09:30:22Z]         if is_v1_server:
  | [2025-07-14T09:30:22Z]             pytest.skip(
  | [2025-07-14T09:30:22Z]                 "tool_choice='required' requires features unsupported on V1")
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z]         tools = [
  | [2025-07-14T09:30:22Z]             {
  | [2025-07-14T09:30:22Z]                 "type": "function",
  | [2025-07-14T09:30:22Z]                 "function": {
  | [2025-07-14T09:30:22Z]                     "name": "get_current_weather",
  | [2025-07-14T09:30:22Z]                     "description": "Get the current weather in a given location",
  | [2025-07-14T09:30:22Z]                     "parameters": {
  | [2025-07-14T09:30:22Z]                         "type": "object",
  | [2025-07-14T09:30:22Z]                         "properties": {
  | [2025-07-14T09:30:22Z]                             "city": {
  | [2025-07-14T09:30:22Z]                                 "type": "string",
  | [2025-07-14T09:30:22Z]                                 "description":
  | [2025-07-14T09:30:22Z]                                 "The city to find the weather for, e.g. 'Vienna'",
  | [2025-07-14T09:30:22Z]                                 "default": "Vienna",
  | [2025-07-14T09:30:22Z]                             },
  | [2025-07-14T09:30:22Z]                             "country": {
  | [2025-07-14T09:30:22Z]                                 "type":
  | [2025-07-14T09:30:22Z]                                 "string",
  | [2025-07-14T09:30:22Z]                                 "description":
  | [2025-07-14T09:30:22Z]                                 "The country that the city is in, e.g. 'Austria'",
  | [2025-07-14T09:30:22Z]                             },
  | [2025-07-14T09:30:22Z]                             "unit": {
  | [2025-07-14T09:30:22Z]                                 "type": "string",
  | [2025-07-14T09:30:22Z]                                 "description":
  | [2025-07-14T09:30:22Z]                                 "The unit to fetch the temperature in",
  | [2025-07-14T09:30:22Z]                                 "enum": ["celsius", "fahrenheit"],
  | [2025-07-14T09:30:22Z]                             },
  | [2025-07-14T09:30:22Z]                         },
  | [2025-07-14T09:30:22Z]                         "required": ["country", "unit"],
  | [2025-07-14T09:30:22Z]                     },
  | [2025-07-14T09:30:22Z]                 },
  | [2025-07-14T09:30:22Z]             },
  | [2025-07-14T09:30:22Z]             {
  | [2025-07-14T09:30:22Z]                 "type": "function",
  | [2025-07-14T09:30:22Z]                 "function": {
  | [2025-07-14T09:30:22Z]                     "name": "get_forecast",
  | [2025-07-14T09:30:22Z]                     "description": "Get the weather forecast for a given location",
  | [2025-07-14T09:30:22Z]                     "parameters": {
  | [2025-07-14T09:30:22Z]                         "type": "object",
  | [2025-07-14T09:30:22Z]                         "properties": {
  | [2025-07-14T09:30:22Z]                             "city": {
  | [2025-07-14T09:30:22Z]                                 "type": "string",
  | [2025-07-14T09:30:22Z]                                 "description":
  | [2025-07-14T09:30:22Z]                                 "The city to get the forecast for, e.g. 'Vienna'",
  | [2025-07-14T09:30:22Z]                                 "default": "Vienna",
  | [2025-07-14T09:30:22Z]                             },
  | [2025-07-14T09:30:22Z]                             "country": {
  | [2025-07-14T09:30:22Z]                                 "type":
  | [2025-07-14T09:30:22Z]                                 "string",
  | [2025-07-14T09:30:22Z]                                 "description":
  | [2025-07-14T09:30:22Z]                                 "The country that the city is in, e.g. 'Austria'",
  | [2025-07-14T09:30:22Z]                             },
  | [2025-07-14T09:30:22Z]                             "days": {
  | [2025-07-14T09:30:22Z]                                 "type":
  | [2025-07-14T09:30:22Z]                                 "integer",
  | [2025-07-14T09:30:22Z]                                 "description":
  | [2025-07-14T09:30:22Z]                                 "Number of days to get the forecast for (1-7)",
  | [2025-07-14T09:30:22Z]                             },
  | [2025-07-14T09:30:22Z]                             "unit": {
  | [2025-07-14T09:30:22Z]                                 "type": "string",
  | [2025-07-14T09:30:22Z]                                 "description":
  | [2025-07-14T09:30:22Z]                                 "The unit to fetch the temperature in",
  | [2025-07-14T09:30:22Z]                                 "enum": ["celsius", "fahrenheit"],
  | [2025-07-14T09:30:22Z]                             },
  | [2025-07-14T09:30:22Z]                         },
  | [2025-07-14T09:30:22Z]                         "required": ["country", "days", "unit"],
  | [2025-07-14T09:30:22Z]                     },
  | [2025-07-14T09:30:22Z]                 },
  | [2025-07-14T09:30:22Z]             },
  | [2025-07-14T09:30:22Z]         ]
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z]         messages = [
  | [2025-07-14T09:30:22Z]             {
  | [2025-07-14T09:30:22Z]                 "role": "user",
  | [2025-07-14T09:30:22Z]                 "content": "Hi! How are you doing today?"
  | [2025-07-14T09:30:22Z]             },
  | [2025-07-14T09:30:22Z]             {
  | [2025-07-14T09:30:22Z]                 "role": "assistant",
  | [2025-07-14T09:30:22Z]                 "content": "I'm doing well! How can I help you?"
  | [2025-07-14T09:30:22Z]             },
  | [2025-07-14T09:30:22Z]             {
  | [2025-07-14T09:30:22Z]                 "role":
  | [2025-07-14T09:30:22Z]                 "user",
  | [2025-07-14T09:30:22Z]                 "content":
  | [2025-07-14T09:30:22Z]                 "Can you tell me what the current weather is in Berlin and the "\
  | [2025-07-14T09:30:22Z]                 "forecast for the next 5 days, in fahrenheit?",
  | [2025-07-14T09:30:22Z]             },
  | [2025-07-14T09:30:22Z]         ]
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z]         # Non-streaming test
  | [2025-07-14T09:30:22Z]         chat_completion = await client.chat.completions.create(
  | [2025-07-14T09:30:22Z]             messages=messages,
  | [2025-07-14T09:30:22Z]             model=model_name,
  | [2025-07-14T09:30:22Z]             tools=tools,
  | [2025-07-14T09:30:22Z]             tool_choice="required",
  | [2025-07-14T09:30:22Z]         )
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z]         assert chat_completion.choices[0].message.tool_calls is not None
  | [2025-07-14T09:30:22Z]         assert len(chat_completion.choices[0].message.tool_calls) > 0
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z]         # Streaming test
  | [2025-07-14T09:30:22Z]         stream = await client.chat.completions.create(
  | [2025-07-14T09:30:22Z]             messages=messages,
  | [2025-07-14T09:30:22Z]             model=model_name,
  | [2025-07-14T09:30:22Z]             tools=tools,
  | [2025-07-14T09:30:22Z]             tool_choice="required",
  | [2025-07-14T09:30:22Z]             stream=True,
  | [2025-07-14T09:30:22Z]         )
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z]         output = []
  | [2025-07-14T09:30:22Z] >       async for chunk in stream:
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z] entrypoints/openai/test_chat.py:862:
  | [2025-07-14T09:30:22Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-07-14T09:30:22Z] /usr/local/lib/python3.12/dist-packages/openai/_streaming.py:147: in __aiter__
  | [2025-07-14T09:30:22Z]     async for item in self._iterator:
  | [2025-07-14T09:30:22Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z] self = <openai.AsyncStream object at 0x7fa37649ebd0>
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z]     async def __stream__(self) -> AsyncIterator[_T]:
  | [2025-07-14T09:30:22Z]         cast_to = cast(Any, self._cast_to)
  | [2025-07-14T09:30:22Z]         response = self.response
  | [2025-07-14T09:30:22Z]         process_data = self._client._process_response_data
  | [2025-07-14T09:30:22Z]         iterator = self._iter_events()
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z]         async for sse in iterator:
  | [2025-07-14T09:30:22Z]             if sse.data.startswith("[DONE]"):
  | [2025-07-14T09:30:22Z]                 break
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z]             if sse.event is None or sse.event.startswith("response.") or sse.event.startswith("transcript."):
  | [2025-07-14T09:30:22Z]                 data = sse.json()
  | [2025-07-14T09:30:22Z]                 if is_mapping(data) and data.get("error"):
  | [2025-07-14T09:30:22Z]                     message = None
  | [2025-07-14T09:30:22Z]                     error = data.get("error")
  | [2025-07-14T09:30:22Z]                     if is_mapping(error):
  | [2025-07-14T09:30:22Z]                         message = error.get("message")
  | [2025-07-14T09:30:22Z]                     if not message or not isinstance(message, str):
  | [2025-07-14T09:30:22Z]                         message = "An error occurred during streaming"
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z] >                   raise APIError(
  | [2025-07-14T09:30:22Z]                         message=message,
  | [2025-07-14T09:30:22Z]                         request=self.response.request,
  | [2025-07-14T09:30:22Z]                         body=data["error"],
  | [2025-07-14T09:30:22Z]                     )
  | [2025-07-14T09:30:22Z] E                   openai.APIError: Expecting property name enclosed in double quotes: line 1 column 4 (char 3)
  | [2025-07-14T09:30:22Z]
  | [2025-07-14T09:30:22Z] /usr/local/lib/python3.12/dist-packages/openai/_streaming.py:174: APIError


@DarkLight1337
Copy link
Member

Retrying the test

Copy link

mergify bot commented Jul 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mizadri.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants