[Feat] Add Responses API - Routing Affinity logic for sessions (#10193)

ishaan-jaff · web-flow · commit 0c2f70541718 · 2025-04-21T20:00:27.000-07:00
* test for test_responses_api_routing_with_previous_response_id

* test_responses_api_routing_with_previous_response_id

* add ResponsesApiDeploymentCheck

* ResponsesApiDeploymentCheck

* ResponsesApiDeploymentCheck

* fix ResponsesApiDeploymentCheck

* test_responses_api_routing_with_previous_response_id

* ResponsesApiDeploymentCheck

* test_responses_api_deployment_check.py

* docs routing affinity

* simplify ResponsesApiDeploymentCheck

* test response id

* fix code quality check
diff --git a/docs/my-website/docs/response_api.md b/docs/my-website/docs/response_api.md
@@ -520,3 +520,121 @@ for event in response:
 | `azure_ai` | [See supported parameters here](https://github.com/BerriAI/litellm/blob/f39d9178868662746f159d5ef642c7f34f9bfe5f/litellm/responses/litellm_completion_transformation/transformation.py#L57) |
 | All other llm api providers | [See supported parameters here](https://github.com/BerriAI/litellm/blob/f39d9178868662746f159d5ef642c7f34f9bfe5f/litellm/responses/litellm_completion_transformation/transformation.py#L57) |
 
+## Load Balancing with Routing Affinity
+
+When using the Responses API with multiple deployments of the same model (e.g., multiple Azure OpenAI endpoints), LiteLLM provides routing affinity for conversations. This ensures that follow-up requests using a `previous_response_id` are routed to the same deployment that generated the original response.
+
+
+#### Example Usage
+
+<Tabs>
+<TabItem value="python-sdk" label="Python SDK">
+
+```python showLineNumbers title="Python SDK with Routing Affinity"
+import litellm
+
+# Set up router with multiple deployments of the same model
+router = litellm.Router(
+    model_list=[
+        {
+            "model_name": "azure-gpt4-turbo",
+            "litellm_params": {
+                "model": "azure/gpt-4-turbo",
+                "api_key": "your-api-key-1",
+                "api_version": "2024-06-01",
+                "api_base": "https://endpoint1.openai.azure.com",
+            },
+        },
+        {
+            "model_name": "azure-gpt4-turbo",
+            "litellm_params": {
+                "model": "azure/gpt-4-turbo",
+                "api_key": "your-api-key-2",
+                "api_version": "2024-06-01",
+                "api_base": "https://endpoint2.openai.azure.com",
+            },
+        },
+    ],
+    optional_pre_call_checks=["responses_api_deployment_check"],
+)
+
+# Initial request
+response = await router.aresponses(
+    model="azure-gpt4-turbo",
+    input="Hello, who are you?",
+    truncation="auto",
+)
+
+# Store the response ID
+response_id = response.id
+
+# Follow-up request - will be automatically routed to the same deployment
+follow_up = await router.aresponses(
+    model="azure-gpt4-turbo",
+    input="Tell me more about yourself",
+    truncation="auto",
+    previous_response_id=response_id  # This ensures routing to the same deployment
+)
+```
+
+</TabItem>
+<TabItem value="proxy-server" label="Proxy Server">
+
+#### 1. Setup routing affinity on proxy config.yaml
+
+To enable routing affinity for Responses API in your LiteLLM proxy, set `optional_pre_call_checks: ["responses_api_deployment_check"]` in your proxy config.yaml.
+
+```yaml showLineNumbers title="config.yaml with Responses API Routing Affinity"
+model_list:
+  - model_name: azure-gpt4-turbo
+    litellm_params:
+      model: azure/gpt-4-turbo
+      api_key: your-api-key-1
+      api_version: 2024-06-01
+      api_base: https://endpoint1.openai.azure.com
+  - model_name: azure-gpt4-turbo
+    litellm_params:
+      model: azure/gpt-4-turbo
+      api_key: your-api-key-2
+      api_version: 2024-06-01
+      api_base: https://endpoint2.openai.azure.com
+
+router_settings:
+  optional_pre_call_checks: ["responses_api_deployment_check"]
+```
+
+#### 2. Use the OpenAI Python SDK to make requests to LiteLLM Proxy
+
+```python showLineNumbers title="OpenAI Client with Proxy Server"
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://localhost:4000",
+    api_key="your-api-key"
+)
+
+# Initial request
+response = client.responses.create(
+    model="azure-gpt4-turbo",
+    input="Hello, who are you?"
+)
+
+response_id = response.id
+
+# Follow-up request - will be automatically routed to the same deployment
+follow_up = client.responses.create(
+    model="azure-gpt4-turbo",
+    input="Tell me more about yourself",
+    previous_response_id=response_id  # This ensures routing to the same deployment
+)
+```
+
+</TabItem>
+</Tabs>
+
+#### How It Works
+
+1. When a user makes an initial request to the Responses API, LiteLLM caches which model deployment that returned the specific response. (Stored in Redis if you connected LiteLLM to Redis)
+2. When a subsequent request includes `previous_response_id`, LiteLLM automatically routes it to the same deployment
+3. If the original deployment is unavailable, or if the `previous_response_id` isn't found in the cache, LiteLLM falls back to normal routing
+
diff --git a/litellm/proxy/proxy_config.yaml b/litellm/proxy/proxy_config.yaml
@@ -1,13 +1,16 @@
 model_list:
-  - model_name: openai/*
+  - model_name: azure-computer-use-preview
     litellm_params:
-      model: openai/*
-  - model_name: anthropic/*
+      model: azure/computer-use-preview
+      api_key: mock-api-key
+      api_version: mock-api-version
+      api_base: https://mock-endpoint.openai.azure.com
+  - model_name: azure-computer-use-preview
     litellm_params:
-      model: anthropic/*
-  - model_name: gemini/*
-    litellm_params:
-      model: gemini/*
-litellm_settings:
-  drop_params: true
-      
+      model: azure/computer-use-preview-2
+      api_key: mock-api-key-2
+      api_version: mock-api-version-2
+      api_base: https://mock-endpoint-2.openai.azure.com
+
+router_settings:
+  optional_pre_call_checks: ["responses_api_deployment_check"]
diff --git a/litellm/responses/main.py b/litellm/responses/main.py
@@ -116,6 +116,13 @@ async def aresponses(
             response = await init_response
         else:
             response = init_response
+
+        # Update the responses_api_response_id with the model_id
+        if isinstance(response, ResponsesAPIResponse):
+            response = ResponsesAPIRequestUtils._update_responses_api_response_id_with_model_id(
+                responses_api_response=response,
+                kwargs=kwargs,
+            )
         return response
     except Exception as e:
         raise litellm.exception_type(
@@ -248,6 +255,13 @@ def responses(
             ),
         )
 
+        # Update the responses_api_response_id with the model_id
+        if isinstance(response, ResponsesAPIResponse):
+            response = ResponsesAPIRequestUtils._update_responses_api_response_id_with_model_id(
+                responses_api_response=response,
+                kwargs=kwargs,
+            )
+
         return response
     except Exception as e:
         raise litellm.exception_type(
diff --git a/litellm/responses/utils.py b/litellm/responses/utils.py
@@ -1,12 +1,15 @@
-from typing import Any, Dict, Union, cast, get_type_hints
+import base64
+from typing import Any, Dict, Optional, Tuple, Union, cast, get_type_hints
 
 import litellm
+from litellm._logging import verbose_logger
 from litellm.llms.base_llm.responses.transformation import BaseResponsesAPIConfig
 from litellm.types.llms.openai import (
     ResponseAPIUsage,
     ResponsesAPIOptionalRequestParams,
+    ResponsesAPIResponse,
 )
-from litellm.types.utils import Usage
+from litellm.types.utils import SpecialEnums, Usage
 
 
 class ResponsesAPIRequestUtils:
@@ -77,6 +80,66 @@ def get_requested_response_api_optional_param(
         }
         return cast(ResponsesAPIOptionalRequestParams, filtered_params)
 
+    @staticmethod
+    def _update_responses_api_response_id_with_model_id(
+        responses_api_response: ResponsesAPIResponse,
+        kwargs: Dict[str, Any],
+    ) -> ResponsesAPIResponse:
+        """Update the responses_api_response_id with the model_id"""
+        litellm_metadata: Dict[str, Any] = kwargs.get("litellm_metadata", {}) or {}
+        model_info: Dict[str, Any] = litellm_metadata.get("model_info", {}) or {}
+        model_id = model_info.get("id")
+        updated_id = ResponsesAPIRequestUtils._build_responses_api_response_id(
+            model_id=model_id,
+            response_id=responses_api_response.id,
+        )
+        responses_api_response.id = updated_id
+        return responses_api_response
+
+    @staticmethod
+    def _build_responses_api_response_id(
+        model_id: Optional[str],
+        response_id: str,
+    ) -> str:
+        """Build the responses_api_response_id"""
+        if model_id is None:
+            return response_id
+        assembled_id: str = str(
+            SpecialEnums.LITELLM_MANAGED_RESPONSE_COMPLETE_STR.value
+        ).format(model_id, response_id)
+        base64_encoded_id: str = base64.b64encode(assembled_id.encode("utf-8")).decode(
+            "utf-8"
+        )
+        return f"resp_{base64_encoded_id}"
+
+    @staticmethod
+    def _decode_responses_api_response_id(
+        response_id: str,
+    ) -> Tuple[Optional[str], str]:
+        """
+        Decode the responses_api_response_id
+
+        Returns:
+            Tuple of model_id, response_id (from upstream provider)
+        """
+        try:
+            # Remove prefix and decode
+            cleaned_id = response_id.replace("resp_", "")
+            decoded_id = base64.b64decode(cleaned_id.encode("utf-8")).decode("utf-8")
+
+            # Parse components using known prefixes
+            if ";" not in decoded_id:
+                return None, response_id
+
+            model_part, response_part = decoded_id.split(";", 1)
+            model_id = model_part.replace("litellm:model_id:", "")
+            decoded_response_id = response_part.replace("response_id:", "")
+
+            return model_id, decoded_response_id
+        except Exception as e:
+            verbose_logger.debug(f"Error decoding response_id '{response_id}': {e}")
+            return None, response_id
+
 
 class ResponseAPILoggingUtils:
     @staticmethod
diff --git a/litellm/router.py b/litellm/router.py
@@ -98,6 +98,9 @@
 from litellm.router_utils.pre_call_checks.prompt_caching_deployment_check import (
     PromptCachingDeploymentCheck,
 )
+from litellm.router_utils.pre_call_checks.responses_api_deployment_check import (
+    ResponsesApiDeploymentCheck,
+)
 from litellm.router_utils.router_callbacks.track_deployment_metrics import (
     increment_deployment_failures_for_current_minute,
     increment_deployment_successes_for_current_minute,
@@ -339,9 +342,9 @@ def __init__(  # noqa: PLR0915
         )  # names of models under litellm_params. ex. azure/chatgpt-v-2
         self.deployment_latency_map = {}
         ### CACHING ###
-        cache_type: Literal[
-            "local", "redis", "redis-semantic", "s3", "disk"
-        ] = "local"  # default to an in-memory cache
+        cache_type: Literal["local", "redis", "redis-semantic", "s3", "disk"] = (
+            "local"  # default to an in-memory cache
+        )
         redis_cache = None
         cache_config: Dict[str, Any] = {}
 
@@ -562,9 +565,9 @@ def __init__(  # noqa: PLR0915
                 )
             )
 
-        self.model_group_retry_policy: Optional[
-            Dict[str, RetryPolicy]
-        ] = model_group_retry_policy
+        self.model_group_retry_policy: Optional[Dict[str, RetryPolicy]] = (
+            model_group_retry_policy
+        )
 
         self.allowed_fails_policy: Optional[AllowedFailsPolicy] = None
         if allowed_fails_policy is not None:
@@ -765,6 +768,8 @@ def add_optional_pre_call_checks(
                         provider_budget_config=self.provider_budget_config,
                         model_list=self.model_list,
                     )
+                elif pre_call_check == "responses_api_deployment_check":
+                    _callback = ResponsesApiDeploymentCheck()
                 if _callback is not None:
                     litellm.logging_callback_manager.add_litellm_callback(_callback)
 
@@ -3247,11 +3252,11 @@ async def async_function_with_fallbacks(self, *args, **kwargs):  # noqa: PLR0915
 
                 if isinstance(e, litellm.ContextWindowExceededError):
                     if context_window_fallbacks is not None:
-                        fallback_model_group: Optional[
-                            List[str]
-                        ] = self._get_fallback_model_group_from_fallbacks(
-                            fallbacks=context_window_fallbacks,
-                            model_group=model_group,
+                        fallback_model_group: Optional[List[str]] = (
+                            self._get_fallback_model_group_from_fallbacks(
+                                fallbacks=context_window_fallbacks,
+                                model_group=model_group,
+                            )
                         )
                         if fallback_model_group is None:
                             raise original_exception
@@ -3283,11 +3288,11 @@ async def async_function_with_fallbacks(self, *args, **kwargs):  # noqa: PLR0915
                         e.message += "\n{}".format(error_message)
                 elif isinstance(e, litellm.ContentPolicyViolationError):
                     if content_policy_fallbacks is not None:
-                        fallback_model_group: Optional[
-                            List[str]
-                        ] = self._get_fallback_model_group_from_fallbacks(
-                            fallbacks=content_policy_fallbacks,
-                            model_group=model_group,
+                        fallback_model_group: Optional[List[str]] = (
+                            self._get_fallback_model_group_from_fallbacks(
+                                fallbacks=content_policy_fallbacks,
+                                model_group=model_group,
+                            )
                         )
                         if fallback_model_group is None:
                             raise original_exception
diff --git a/litellm/router_utils/pre_call_checks/responses_api_deployment_check.py b/litellm/router_utils/pre_call_checks/responses_api_deployment_check.py
@@ -0,0 +1,46 @@
+"""
+For Responses API, we need routing affinity when a user sends a previous_response_id.
+
+eg. If proxy admins are load balancing between N gpt-4.1-turbo deployments, and a user sends a previous_response_id,
+we want to route to the same gpt-4.1-turbo deployment.
+
+This is different from the normal behavior of the router, which does not have routing affinity for previous_response_id.
+
+
+If previous_response_id is provided, route to the deployment that returned the previous response
+"""
+
+from typing import List, Optional
+
+from litellm.integrations.custom_logger import CustomLogger, Span
+from litellm.responses.utils import ResponsesAPIRequestUtils
+from litellm.types.llms.openai import AllMessageValues
+
+
+class ResponsesApiDeploymentCheck(CustomLogger):
+    async def async_filter_deployments(
+        self,
+        model: str,
+        healthy_deployments: List,
+        messages: Optional[List[AllMessageValues]],
+        request_kwargs: Optional[dict] = None,
+        parent_otel_span: Optional[Span] = None,
+    ) -> List[dict]:
+        request_kwargs = request_kwargs or {}
+        previous_response_id = request_kwargs.get("previous_response_id", None)
+        if previous_response_id is None:
+            return healthy_deployments
+
+        model_id, response_id = (
+            ResponsesAPIRequestUtils._decode_responses_api_response_id(
+                response_id=previous_response_id,
+            )
+        )
+        if model_id is None:
+            return healthy_deployments
+
+        for deployment in healthy_deployments:
+            if deployment["model_info"]["id"] == model_id:
+                return [deployment]
+
+        return healthy_deployments
diff --git a/litellm/types/router.py b/litellm/types/router.py
diff --git a/litellm/types/utils.py b/litellm/types/utils.py
diff --git a/tests/litellm/router_utils/pre_call_checks/test_responses_api_deployment_check.py b/tests/litellm/router_utils/pre_call_checks/test_responses_api_deployment_check.py