Azure GPT-4o says it cannot recognize image when called with langchain, but when called natively with Azure OpenAI library it can recognize image. #25333

VStev · 2024-08-13T08:59:45Z

VStev
Aug 13, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

formatted = f"""
                {{
                    "type": "text",
                    "text": "{messages}"
                }},
                {{
                    "type": "image_url",
                    "image_url": {{"url": "data:image/jpeg;base64,{image}"}}
                }}
            """
        agent = create_react_agent(model, tools, agent_prompt)
        
        agent_executor = AgentExecutor(
            agent=agent,
            tools=tools,
            verbose=True,
            handle_parsing_errors=True,
        )
        
        historical_chain = RunnableWithMessageHistory(
            agent_executor,
            lambda session_id: chat_history,
            input_messages_key="input",
            history_messages_key="chat_history",
        )
        chain_with_summarization = (
            RunnablePassthrough.assign(messages_summarized=summarize_history)
            | historical_chain
        )
        with get_openai_callback() as cb:
                response = chain_with_summarization.invoke({
                    "input": formatted }, config)
                print(f"Image chain 4o {cb}")
            return response, model_name

Description

I'm trying to do a multimodal input using GPT-4o. I tested with the native azure documentation, it works.
But when I integrated it in my langchain-agent workflow, it says it cannot process images and unable to do so. Why is this?
All it says is
I'm sorry for any confusion, but I don't have the capability to view or describe images. However, if you describe the details of the image to me, I'll do my best to help you understand or interpret it!
While using the Azure documentation here produces the desired output (image described)

The prompt template I'm using is basically hwchase react chat

System Info

System Information

OS: Linux
OS Version: #1 SMP PREEMPT_DYNAMIC Wed Jul 31 10:18:50 UTC 2024
Python Version: 3.9.16 (main, Jul 5 2024, 00:00:00)
[GCC 11.4.1 20230605 (Red Hat 11.4.1-2)]

Package Information

langchain_core: 0.2.30
langchain: 0.2.13
langchain_community: 0.2.12
langsmith: 0.1.99
langchain_openai: 0.1.21
langchain_text_splitters: 0.2.2

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.3
async-timeout: 4.0.3
dataclasses-json: 0.6.7
jsonpatch: 1.33
numpy: 1.26.4
openai: 1.40.6
orjson: 3.10.7
packaging: 24.1
pydantic: 2.8.2
PyYAML: 6.0.2
requests: 2.32.3
SQLAlchemy: 2.0.32
tenacity: 8.5.0
tiktoken: 0.7.0
typing-extensions: 4.12.2

@VStev · 2024-08-13T09:05:26Z

dosubot[bot]
bot Aug 13, 2024

Hey @VStev, I'm here to assist you with any bugs, questions, or contribution inquiries.

The issue you're encountering is due to the LangChain-agent workflow not being configured to handle image inputs, while the native Azure OpenAI library can process images through its specific image analysis capabilities.

To resolve this, you need to ensure that the LangChain-agent is properly configured to use the Azure Cognitive Services Image Analysis API. Here are the steps to set up and use the AzureAiServicesImageAnalysisTool class to analyze images using Azure's Image Analysis API:

Install the required Azure AI Vision Image Analysis library:
```
pip install azure-ai-vision-imageanalysis
```
Set up the Azure Cognitive Services Image Analysis Tool:
Configure the AzureAiServicesImageAnalysisTool with the necessary API key, endpoint, and region. This can be done through environment variables or directly in the code.
Create the Toolkit:
Initialize the AzureAiServicesToolkit which bundles the image analysis tool along with other Azure AI services tools.
Use the tool to analyze images:
The _image_analysis method in the AzureAiServicesImageAnalysisTool class handles both local and remote images by detecting the source type and then analyzing the image using the Azure Cognitive Services Image Analysis API.

Here is the relevant code snippet for setting up and using the tool:

import os
from langchain_community.agent_toolkits import AzureAiServicesToolkit
from langchain import hub
from langchain.agents import AgentExecutor, create_structured_chat_agent
from langchain_openai import OpenAI

# Set environment variables
os.environ["AZURE_AI_SERVICES_KEY"] = "your_azure_cogs_key"
os.environ["AZURE_AI_SERVICES_ENDPOINT"] = "your_azure_cogs_endpoint"
os.environ["AZURE_AI_SERVICES_REGION"] = "your_azure_cogs_region"

# Create the toolkit
toolkit = AzureAiServicesToolkit()

# Initialize the agent
llm = OpenAI(temperature=0)
tools = toolkit.get_tools()
prompt = hub.pull("hwchase17/structured-chat-agent")
agent = create_structured_chat_agent(llm, tools, prompt)

agent_executor = AgentExecutor(
    agent=agent, tools=tools, verbose=True, handle_parsing_errors=True
)

# Analyze an image (local or remote)
result = agent_executor.invoke(
    {
        "input": "Analyze this image: https://path_or_url_to_image"
    }
)

print(result)

Ensure that the AZURE_AI_SERVICES_KEY, AZURE_AI_SERVICES_ENDPOINT, and AZURE_AI_SERVICES_REGION are correctly set up in your environment or passed directly to the tool. This will enable the LangChain-agent to process images using the Azure Cognitive Services Image Analysis API ^[1]^[2]^[3].

Additionally, the AzureChatOpenAI class in the LangChain framework supports image input by encoding the image data in base64 and including it in the message content. Here is an example of how to use it:

import base64
import httpx
from langchain_core.messages import HumanMessage
from langchain_openai.chat_models.azure import AzureChatOpenAI

# Set environment variables
os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"

# Initialize the AzureChatOpenAI model
llm = AzureChatOpenAI(azure_deployment="your-deployment-name")

# Encode the image data
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")

# Create the message with image content
message = HumanMessage(
    content=[
        {"type": "text", "text": "describe the weather in this image"},
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
        },
    ]
)

# Invoke the model
ai_msg = llm.invoke([message])
print(ai_msg.content)

This example demonstrates how to use the AzureChatOpenAI class to process an image input and get a description of the weather in the image ^[4].

To continue talking to Dosu, mention @dosu.

1 reply

VStev Aug 13, 2024
Author

I already tried base64 image, best case scenario it will say the base64 image looks like an image about error but without visual processing capabilities I cannot provide any additional information and the worst case scenario the token exceeded.
I also tried GPT-4o Mini too, same result.

VStev · 2024-08-13T09:16:35Z

VStev
Aug 13, 2024
Author

@dosu I already tried base64 image, best case scenario it will say the base64 image looks like an image about error but without visual processing capabilities I cannot provide any additional information and the worst case scenario the token exceeded.
I also tried GPT-4o Mini too, same result.

1 reply

dosubot[bot] bot Aug 13, 2024

To properly configure your LangChain-agent to handle image inputs using Azure's capabilities, you need to ensure that the agent is set up to use Azure's image analysis tools. Here are the steps to achieve this:

Install the necessary Azure AI Vision Image Analysis library:
```
pip install azure-ai-vision-imageanalysis
```

Set up the environment variables for your Azure AI Services:

import os

os.environ["AZURE_AI_SERVICES_KEY"] = "your_azure_ai_services_key"
os.environ["AZURE_AI_SERVICES_ENDPOINT"] = "your_azure_ai_services_endpoint"
os.environ["AZURE_AI_SERVICES_REGION"] = "your_azure_ai_services_region"

Create the Azure AI Services Toolkit:

from langchain_community.agent_toolkits import AzureAiServicesToolkit

toolkit = AzureAiServicesToolkit()

Use the toolkit within an agent:

from langchain import hub
from langchain.agents import AgentExecutor, create_structured_chat_agent
from langchain_openai import OpenAI

llm = OpenAI(temperature=0)
tools = toolkit.get_tools()
prompt = hub.pull("hwchase17/structured-chat-agent")
agent = create_structured_chat_agent(llm, tools, prompt)

agent_executor = AgentExecutor(
    agent=agent, tools=tools, verbose=True, handle_parsing_errors=True
)

Invoke the agent with an image input:

result = agent_executor.invoke(
    {
        "input": "Analyze this image: https://path_or_url_to_image"
    }
)

print(result)

Additionally, you can use the AzureCogsImageAnalysisTool to analyze images. Here is an example of how to set it up:

from langchain_community.tools.azure_cognitive_services.image_analysis import AzureCogsImageAnalysisTool

image_analysis_tool = AzureCogsImageAnalysisTool(
    azure_cogs_key=os.environ["AZURE_AI_SERVICES_KEY"],
    azure_cogs_endpoint=os.environ["AZURE_AI_SERVICES_ENDPOINT"]
)

# Example usage
result = image_analysis_tool._run("https://path_or_url_to_image")
print(result)

By following these steps, you can configure your LangChain-agent to handle image inputs using Azure's capabilities, ensuring that the agent can process and analyze images effectively ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Azure GPT-4o says it cannot recognize image when called with langchain, but when called natively with Azure OpenAI library it can recognize image. #25333

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Azure GPT-4o says it cannot recognize image when called with langchain, but when called natively with Azure OpenAI library it can recognize image. #25333

Uh oh!

Uh oh!

VStev Aug 13, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

Replies: 2 comments · 2 replies

Uh oh!

dosubot[bot] bot Aug 13, 2024

Uh oh!

VStev Aug 13, 2024 Author

Uh oh!

VStev Aug 13, 2024 Author

Uh oh!

dosubot[bot] bot Aug 13, 2024

VStev
Aug 13, 2024

Replies: 2 comments 2 replies

dosubot[bot]
bot Aug 13, 2024

VStev Aug 13, 2024
Author

VStev
Aug 13, 2024
Author