The easier way to use Azure AI Inference SDK ✨
Enhanced wrapper that makes Azure AI Inference SDK simple and reliable with automatic retry, JSON validation, and reasoning separation.
✅ Reasoning separation - automatically splits thinking from output (.content
and .reasoning
)
✅ Automatic retries - never lose requests to transient failures
✅ JSON that works - guaranteed valid JSON or automatic retry
✅ One import - no need for multiple Azure SDK imports
✅ 100% compatible - drop-in replacement for Azure AI Inference SDK
Automatic retries for the errors you actually encounter in production:
🔄 Service overloaded (timeouts) → Auto-retry with backoff
🔄 Rate limits (429) → Smart retry timing
🔄 Azure service hiccups (5xx) → Exponential backoff
🔄 Invalid JSON responses → Re-request clean JSON
🔄 Network timeouts → Multiple quick attempts
Just works. No manual error handling needed.
pip install azure-ai-inference-plus
Supports Python 3.11+
from azure_ai_inference_plus import ChatCompletionsClient, SystemMessage, UserMessage
# Uses environment variables: AZURE_AI_ENDPOINT, AZURE_AI_API_KEY
client = ChatCompletionsClient()
response = client.complete(
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="What's the capital of France?"),
],
max_tokens=100,
model="Codestral-2501"
)
print(response.choices[0].message.content)
# "The capital of France is Paris..."
Or with manual credentials (everything from one import!):
from azure_ai_inference_plus import ChatCompletionsClient, SystemMessage, UserMessage, AzureKeyCredential
client = ChatCompletionsClient(
endpoint="https://your-resource.services.ai.azure.com/models",
credential=AzureKeyCredential("your-api-key")
)
Game changer for reasoning models like DeepSeek-R1 - automatically separates thinking from output:
response = client.complete(
messages=[
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="What's 2+2? Think step by step."),
],
model="DeepSeek-R1",
reasoning_tags=["<think>", "</think>"] # ✨ Auto-separation
)
# Clean output without reasoning clutter
print(response.choices[0].message.content)
# "2 + 2 equals 4."
# Access the reasoning separately
print(response.choices[0].message.reasoning)
# "Let me think about this step by step. 2 + 2 is a basic addition..."
No more JSON parsing errors - automatic validation and retry.
Simple JSON (standard models like GPT-4o):
response = client.complete(
messages=[
SystemMessage(content="You are a helpful assistant that returns JSON."),
UserMessage(content="Give me Tokyo info as JSON with keys: name, country, population"),
],
max_tokens=500,
model="gpt-4o",
response_format="json_object" # ✨ Auto-validation + retry
)
# Always valid JSON, no try/catch needed!
import json
data = json.loads(response.choices[0].message.content) # ✅ Works perfectly
JSON with reasoning models (like DeepSeek-R1):
response = client.complete(
messages=[
SystemMessage(content="You are a helpful assistant that returns JSON."),
UserMessage(content="Give me Paris info as JSON with keys: name, country, population"),
],
max_tokens=2000, # More tokens needed for reasoning + JSON
model="DeepSeek-R1",
response_format="json_object", # ✨ Clean JSON guaranteed
reasoning_tags=["<think>", "</think>"] # Required for reasoning separation
)
# Pure JSON - reasoning automatically stripped
data = json.loads(response.choices[0].message.content) # {"name": "Paris", ...}
# But reasoning is still accessible
thinking = response.choices[0].message.reasoning # "Let me think about Paris..."
Note: JSON responses are automatically cleaned of markdown wrappers (like ```json blocks) for reliable parsing.
Built-in retry with exponential backoff - no configuration needed:
# Automatically retries on failures (including timeouts) - just works!
response = client.complete(
messages=[UserMessage(content="Tell me a joke")],
model="Phi-4"
)
from azure_ai_inference_plus import RetryConfig
# Override default behavior (with smart timeout strategy)
client = ChatCompletionsClient(
connection_timeout=100.0, # Better: 100s + retries vs 300s timeout
retry_config=RetryConfig(max_retries=5, delay_seconds=2.0)
)
Get notified when retries happen - perfect for logging and monitoring:
from azure_ai_inference_plus import RetryConfig
def on_chat_retry(attempt, max_retries, exception, delay):
print(f"🔄 Chat retry {attempt}/{max_retries}: {type(exception).__name__} - waiting {delay:.1f}s")
def on_json_retry(attempt, max_retries, message):
print(f"📝 JSON retry {attempt}/{max_retries}: {message}")
# Add callbacks to your retry config
client = ChatCompletionsClient(
retry_config=RetryConfig(
max_retries=3,
on_chat_retry=on_chat_retry, # Called for general failures
on_json_retry=on_json_retry # Called for JSON validation failures
)
)
# Now you'll see retry notifications:
# 🔄 Chat retry 1/3: HttpResponseError - waiting 1.0s
# 📝 JSON retry 2/3: Retry 2 after JSON validation failed
Why callbacks? The library doesn't print anything by default (clean for production), but callbacks let you add your own logging, metrics, or notifications exactly how you want them.
from azure_ai_inference_plus import EmbeddingsClient
client = EmbeddingsClient()
response = client.embed(
input=["Hello world", "Python is great"],
model="text-embedding-3-large"
)
Create a .env
file:
AZURE_AI_ENDPOINT=https://your-resource.services.ai.azure.com/models
AZURE_AI_API_KEY=your-api-key-here
2 simple steps:
-
pip install azure-ai-inference-plus
-
Change your import:
# Before from azure.ai.inference import ChatCompletionsClient from azure.ai.inference.models import SystemMessage, UserMessage from azure.core.credentials import AzureKeyCredential # After from azure_ai_inference_plus import ChatCompletionsClient, SystemMessage, UserMessage, AzureKeyCredential
That's it! Your existing code works unchanged with automatic retries and JSON validation.
from azure_ai_inference_plus import ChatCompletionsClient, AzureKeyCredential
client = ChatCompletionsClient(
endpoint="https://your-resource.services.ai.azure.com/models",
credential=AzureKeyCredential("your-api-key")
)
Check out the examples/
directory for complete demonstrations:
basic_usage.py
- Reasoning separation, JSON validation, retry features, and timeout strategyembeddings_example.py
- Embeddings with retry and credential setupcallbacks_example.py
- Retry callbacks for logging and monitoring
All examples show real-world usage patterns and advanced features.
Contributions are welcome! Whether it's bug fixes, feature additions, or documentation improvements, we appreciate your help in making this project better. For major changes or new features, please open an issue first to discuss what you would like to change.
- langchain-azure-ai-inference-plus - The easier way to use Azure AI Inference SDK with LangChain ✨