Skip to content

Commit 0c2f705

Browse files
authored
[Feat] Add Responses API - Routing Affinity logic for sessions (#10193)
* test for test_responses_api_routing_with_previous_response_id * test_responses_api_routing_with_previous_response_id * add ResponsesApiDeploymentCheck * ResponsesApiDeploymentCheck * ResponsesApiDeploymentCheck * fix ResponsesApiDeploymentCheck * test_responses_api_routing_with_previous_response_id * ResponsesApiDeploymentCheck * test_responses_api_deployment_check.py * docs routing affinity * simplify ResponsesApiDeploymentCheck * test response id * fix code quality check
1 parent 4eac0f6 commit 0c2f705

File tree

9 files changed

+862
-29
lines changed

9 files changed

+862
-29
lines changed

docs/my-website/docs/response_api.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -520,3 +520,121 @@ for event in response:
520520
| `azure_ai` | [See supported parameters here](https://github.com/BerriAI/litellm/blob/f39d9178868662746f159d5ef642c7f34f9bfe5f/litellm/responses/litellm_completion_transformation/transformation.py#L57) |
521521
| All other llm api providers | [See supported parameters here](https://github.com/BerriAI/litellm/blob/f39d9178868662746f159d5ef642c7f34f9bfe5f/litellm/responses/litellm_completion_transformation/transformation.py#L57) |
522522

523+
## Load Balancing with Routing Affinity
524+
525+
When using the Responses API with multiple deployments of the same model (e.g., multiple Azure OpenAI endpoints), LiteLLM provides routing affinity for conversations. This ensures that follow-up requests using a `previous_response_id` are routed to the same deployment that generated the original response.
526+
527+
528+
#### Example Usage
529+
530+
<Tabs>
531+
<TabItem value="python-sdk" label="Python SDK">
532+
533+
```python showLineNumbers title="Python SDK with Routing Affinity"
534+
import litellm
535+
536+
# Set up router with multiple deployments of the same model
537+
router = litellm.Router(
538+
model_list=[
539+
{
540+
"model_name": "azure-gpt4-turbo",
541+
"litellm_params": {
542+
"model": "azure/gpt-4-turbo",
543+
"api_key": "your-api-key-1",
544+
"api_version": "2024-06-01",
545+
"api_base": "https://endpoint1.openai.azure.com",
546+
},
547+
},
548+
{
549+
"model_name": "azure-gpt4-turbo",
550+
"litellm_params": {
551+
"model": "azure/gpt-4-turbo",
552+
"api_key": "your-api-key-2",
553+
"api_version": "2024-06-01",
554+
"api_base": "https://endpoint2.openai.azure.com",
555+
},
556+
},
557+
],
558+
optional_pre_call_checks=["responses_api_deployment_check"],
559+
)
560+
561+
# Initial request
562+
response = await router.aresponses(
563+
model="azure-gpt4-turbo",
564+
input="Hello, who are you?",
565+
truncation="auto",
566+
)
567+
568+
# Store the response ID
569+
response_id = response.id
570+
571+
# Follow-up request - will be automatically routed to the same deployment
572+
follow_up = await router.aresponses(
573+
model="azure-gpt4-turbo",
574+
input="Tell me more about yourself",
575+
truncation="auto",
576+
previous_response_id=response_id # This ensures routing to the same deployment
577+
)
578+
```
579+
580+
</TabItem>
581+
<TabItem value="proxy-server" label="Proxy Server">
582+
583+
#### 1. Setup routing affinity on proxy config.yaml
584+
585+
To enable routing affinity for Responses API in your LiteLLM proxy, set `optional_pre_call_checks: ["responses_api_deployment_check"]` in your proxy config.yaml.
586+
587+
```yaml showLineNumbers title="config.yaml with Responses API Routing Affinity"
588+
model_list:
589+
- model_name: azure-gpt4-turbo
590+
litellm_params:
591+
model: azure/gpt-4-turbo
592+
api_key: your-api-key-1
593+
api_version: 2024-06-01
594+
api_base: https://endpoint1.openai.azure.com
595+
- model_name: azure-gpt4-turbo
596+
litellm_params:
597+
model: azure/gpt-4-turbo
598+
api_key: your-api-key-2
599+
api_version: 2024-06-01
600+
api_base: https://endpoint2.openai.azure.com
601+
602+
router_settings:
603+
optional_pre_call_checks: ["responses_api_deployment_check"]
604+
```
605+
606+
#### 2. Use the OpenAI Python SDK to make requests to LiteLLM Proxy
607+
608+
```python showLineNumbers title="OpenAI Client with Proxy Server"
609+
from openai import OpenAI
610+
611+
client = OpenAI(
612+
base_url="http://localhost:4000",
613+
api_key="your-api-key"
614+
)
615+
616+
# Initial request
617+
response = client.responses.create(
618+
model="azure-gpt4-turbo",
619+
input="Hello, who are you?"
620+
)
621+
622+
response_id = response.id
623+
624+
# Follow-up request - will be automatically routed to the same deployment
625+
follow_up = client.responses.create(
626+
model="azure-gpt4-turbo",
627+
input="Tell me more about yourself",
628+
previous_response_id=response_id # This ensures routing to the same deployment
629+
)
630+
```
631+
632+
</TabItem>
633+
</Tabs>
634+
635+
#### How It Works
636+
637+
1. When a user makes an initial request to the Responses API, LiteLLM caches which model deployment that returned the specific response. (Stored in Redis if you connected LiteLLM to Redis)
638+
2. When a subsequent request includes `previous_response_id`, LiteLLM automatically routes it to the same deployment
639+
3. If the original deployment is unavailable, or if the `previous_response_id` isn't found in the cache, LiteLLM falls back to normal routing
640+

litellm/proxy/proxy_config.yaml

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,16 @@
11
model_list:
2-
- model_name: openai/*
2+
- model_name: azure-computer-use-preview
33
litellm_params:
4-
model: openai/*
5-
- model_name: anthropic/*
4+
model: azure/computer-use-preview
5+
api_key: mock-api-key
6+
api_version: mock-api-version
7+
api_base: https://mock-endpoint.openai.azure.com
8+
- model_name: azure-computer-use-preview
69
litellm_params:
7-
model: anthropic/*
8-
- model_name: gemini/*
9-
litellm_params:
10-
model: gemini/*
11-
litellm_settings:
12-
drop_params: true
13-
10+
model: azure/computer-use-preview-2
11+
api_key: mock-api-key-2
12+
api_version: mock-api-version-2
13+
api_base: https://mock-endpoint-2.openai.azure.com
14+
15+
router_settings:
16+
optional_pre_call_checks: ["responses_api_deployment_check"]

litellm/responses/main.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,13 @@ async def aresponses(
116116
response = await init_response
117117
else:
118118
response = init_response
119+
120+
# Update the responses_api_response_id with the model_id
121+
if isinstance(response, ResponsesAPIResponse):
122+
response = ResponsesAPIRequestUtils._update_responses_api_response_id_with_model_id(
123+
responses_api_response=response,
124+
kwargs=kwargs,
125+
)
119126
return response
120127
except Exception as e:
121128
raise litellm.exception_type(
@@ -248,6 +255,13 @@ def responses(
248255
),
249256
)
250257

258+
# Update the responses_api_response_id with the model_id
259+
if isinstance(response, ResponsesAPIResponse):
260+
response = ResponsesAPIRequestUtils._update_responses_api_response_id_with_model_id(
261+
responses_api_response=response,
262+
kwargs=kwargs,
263+
)
264+
251265
return response
252266
except Exception as e:
253267
raise litellm.exception_type(

litellm/responses/utils.py

Lines changed: 65 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
1-
from typing import Any, Dict, Union, cast, get_type_hints
1+
import base64
2+
from typing import Any, Dict, Optional, Tuple, Union, cast, get_type_hints
23

34
import litellm
5+
from litellm._logging import verbose_logger
46
from litellm.llms.base_llm.responses.transformation import BaseResponsesAPIConfig
57
from litellm.types.llms.openai import (
68
ResponseAPIUsage,
79
ResponsesAPIOptionalRequestParams,
10+
ResponsesAPIResponse,
811
)
9-
from litellm.types.utils import Usage
12+
from litellm.types.utils import SpecialEnums, Usage
1013

1114

1215
class ResponsesAPIRequestUtils:
@@ -77,6 +80,66 @@ def get_requested_response_api_optional_param(
7780
}
7881
return cast(ResponsesAPIOptionalRequestParams, filtered_params)
7982

83+
@staticmethod
84+
def _update_responses_api_response_id_with_model_id(
85+
responses_api_response: ResponsesAPIResponse,
86+
kwargs: Dict[str, Any],
87+
) -> ResponsesAPIResponse:
88+
"""Update the responses_api_response_id with the model_id"""
89+
litellm_metadata: Dict[str, Any] = kwargs.get("litellm_metadata", {}) or {}
90+
model_info: Dict[str, Any] = litellm_metadata.get("model_info", {}) or {}
91+
model_id = model_info.get("id")
92+
updated_id = ResponsesAPIRequestUtils._build_responses_api_response_id(
93+
model_id=model_id,
94+
response_id=responses_api_response.id,
95+
)
96+
responses_api_response.id = updated_id
97+
return responses_api_response
98+
99+
@staticmethod
100+
def _build_responses_api_response_id(
101+
model_id: Optional[str],
102+
response_id: str,
103+
) -> str:
104+
"""Build the responses_api_response_id"""
105+
if model_id is None:
106+
return response_id
107+
assembled_id: str = str(
108+
SpecialEnums.LITELLM_MANAGED_RESPONSE_COMPLETE_STR.value
109+
).format(model_id, response_id)
110+
base64_encoded_id: str = base64.b64encode(assembled_id.encode("utf-8")).decode(
111+
"utf-8"
112+
)
113+
return f"resp_{base64_encoded_id}"
114+
115+
@staticmethod
116+
def _decode_responses_api_response_id(
117+
response_id: str,
118+
) -> Tuple[Optional[str], str]:
119+
"""
120+
Decode the responses_api_response_id
121+
122+
Returns:
123+
Tuple of model_id, response_id (from upstream provider)
124+
"""
125+
try:
126+
# Remove prefix and decode
127+
cleaned_id = response_id.replace("resp_", "")
128+
decoded_id = base64.b64decode(cleaned_id.encode("utf-8")).decode("utf-8")
129+
130+
# Parse components using known prefixes
131+
if ";" not in decoded_id:
132+
return None, response_id
133+
134+
model_part, response_part = decoded_id.split(";", 1)
135+
model_id = model_part.replace("litellm:model_id:", "")
136+
decoded_response_id = response_part.replace("response_id:", "")
137+
138+
return model_id, decoded_response_id
139+
except Exception as e:
140+
verbose_logger.debug(f"Error decoding response_id '{response_id}': {e}")
141+
return None, response_id
142+
80143

81144
class ResponseAPILoggingUtils:
82145
@staticmethod

litellm/router.py

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,9 @@
9898
from litellm.router_utils.pre_call_checks.prompt_caching_deployment_check import (
9999
PromptCachingDeploymentCheck,
100100
)
101+
from litellm.router_utils.pre_call_checks.responses_api_deployment_check import (
102+
ResponsesApiDeploymentCheck,
103+
)
101104
from litellm.router_utils.router_callbacks.track_deployment_metrics import (
102105
increment_deployment_failures_for_current_minute,
103106
increment_deployment_successes_for_current_minute,
@@ -339,9 +342,9 @@ def __init__( # noqa: PLR0915
339342
) # names of models under litellm_params. ex. azure/chatgpt-v-2
340343
self.deployment_latency_map = {}
341344
### CACHING ###
342-
cache_type: Literal[
343-
"local", "redis", "redis-semantic", "s3", "disk"
344-
] = "local" # default to an in-memory cache
345+
cache_type: Literal["local", "redis", "redis-semantic", "s3", "disk"] = (
346+
"local" # default to an in-memory cache
347+
)
345348
redis_cache = None
346349
cache_config: Dict[str, Any] = {}
347350

@@ -562,9 +565,9 @@ def __init__( # noqa: PLR0915
562565
)
563566
)
564567

565-
self.model_group_retry_policy: Optional[
566-
Dict[str, RetryPolicy]
567-
] = model_group_retry_policy
568+
self.model_group_retry_policy: Optional[Dict[str, RetryPolicy]] = (
569+
model_group_retry_policy
570+
)
568571

569572
self.allowed_fails_policy: Optional[AllowedFailsPolicy] = None
570573
if allowed_fails_policy is not None:
@@ -765,6 +768,8 @@ def add_optional_pre_call_checks(
765768
provider_budget_config=self.provider_budget_config,
766769
model_list=self.model_list,
767770
)
771+
elif pre_call_check == "responses_api_deployment_check":
772+
_callback = ResponsesApiDeploymentCheck()
768773
if _callback is not None:
769774
litellm.logging_callback_manager.add_litellm_callback(_callback)
770775

@@ -3247,11 +3252,11 @@ async def async_function_with_fallbacks(self, *args, **kwargs): # noqa: PLR0915
32473252

32483253
if isinstance(e, litellm.ContextWindowExceededError):
32493254
if context_window_fallbacks is not None:
3250-
fallback_model_group: Optional[
3251-
List[str]
3252-
] = self._get_fallback_model_group_from_fallbacks(
3253-
fallbacks=context_window_fallbacks,
3254-
model_group=model_group,
3255+
fallback_model_group: Optional[List[str]] = (
3256+
self._get_fallback_model_group_from_fallbacks(
3257+
fallbacks=context_window_fallbacks,
3258+
model_group=model_group,
3259+
)
32553260
)
32563261
if fallback_model_group is None:
32573262
raise original_exception
@@ -3283,11 +3288,11 @@ async def async_function_with_fallbacks(self, *args, **kwargs): # noqa: PLR0915
32833288
e.message += "\n{}".format(error_message)
32843289
elif isinstance(e, litellm.ContentPolicyViolationError):
32853290
if content_policy_fallbacks is not None:
3286-
fallback_model_group: Optional[
3287-
List[str]
3288-
] = self._get_fallback_model_group_from_fallbacks(
3289-
fallbacks=content_policy_fallbacks,
3290-
model_group=model_group,
3291+
fallback_model_group: Optional[List[str]] = (
3292+
self._get_fallback_model_group_from_fallbacks(
3293+
fallbacks=content_policy_fallbacks,
3294+
model_group=model_group,
3295+
)
32913296
)
32923297
if fallback_model_group is None:
32933298
raise original_exception
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""
2+
For Responses API, we need routing affinity when a user sends a previous_response_id.
3+
4+
eg. If proxy admins are load balancing between N gpt-4.1-turbo deployments, and a user sends a previous_response_id,
5+
we want to route to the same gpt-4.1-turbo deployment.
6+
7+
This is different from the normal behavior of the router, which does not have routing affinity for previous_response_id.
8+
9+
10+
If previous_response_id is provided, route to the deployment that returned the previous response
11+
"""
12+
13+
from typing import List, Optional
14+
15+
from litellm.integrations.custom_logger import CustomLogger, Span
16+
from litellm.responses.utils import ResponsesAPIRequestUtils
17+
from litellm.types.llms.openai import AllMessageValues
18+
19+
20+
class ResponsesApiDeploymentCheck(CustomLogger):
21+
async def async_filter_deployments(
22+
self,
23+
model: str,
24+
healthy_deployments: List,
25+
messages: Optional[List[AllMessageValues]],
26+
request_kwargs: Optional[dict] = None,
27+
parent_otel_span: Optional[Span] = None,
28+
) -> List[dict]:
29+
request_kwargs = request_kwargs or {}
30+
previous_response_id = request_kwargs.get("previous_response_id", None)
31+
if previous_response_id is None:
32+
return healthy_deployments
33+
34+
model_id, response_id = (
35+
ResponsesAPIRequestUtils._decode_responses_api_response_id(
36+
response_id=previous_response_id,
37+
)
38+
)
39+
if model_id is None:
40+
return healthy_deployments
41+
42+
for deployment in healthy_deployments:
43+
if deployment["model_info"]["id"] == model_id:
44+
return [deployment]
45+
46+
return healthy_deployments

0 commit comments

Comments
 (0)