How to integrate Multi-Modal VLMs that uses AutoProcessor and AutoModelForImageTextToText into Langchain as an LLM or chatModel. #31799

Darrshan-Sankar · 2025-07-01T10:29:05Z

Darrshan-Sankar
Jul 1, 2025

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Description

I am trying to integrate Multi-Modal VLMs that uses AutoProcessor and AutoModelForImageTextToText into Langchain as an LLM or chatModel. When I had a look at the HuggingFacePipeline, it seems that currently the wrapper doesn't support image-text-to-text models. How to bring in the same with Langchain

@Darrshan-Sankar · 2025-07-01T10:35:08Z

dosubot[bot]
bot Jul 1, 2025

Hi @Darrshan-Sankar! I'm Dosu and I’m helping the langchain team.

LangChain's HuggingFacePipeline does not currently support multi-modal models like those using AutoProcessor and AutoModelForImageTextToText out of the box—it expects text-only input and output. To integrate such a vision-language model, you'll need to create a custom wrapper that subclasses either the LLM or ChatModel base classes and overrides the input/output handling to support both image and text inputs. This involves preprocessing inputs with AutoProcessor, invoking the model, and formatting outputs for LangChain's expected schema.

The main extensibility points are the _call (for LLM) or _generate (for ChatModel) methods, where you can handle multi-modal input (e.g., a dict with image and text), process it, and return the model's output as a string or message. There are no official examples for this yet, but the architecture is designed to allow such extensions by subclassing and customizing the wrappers as needed. For reference, see the discussion of these extensibility points and the need for custom logic for multi-modal support here.

Here's a minimal pseudocode outline for a custom LLM wrapper:

from langchain_core.language_models.llms import LLM

class MultiModalLLM(LLM):
    def __init__(self, processor, model, **kwargs):
        super().__init__(**kwargs)
        self.processor = processor
        self.model = model

    def _call(self, prompt, image=None, **kwargs):
        # Preprocess input
        inputs = self.processor(text=prompt, images=image, return_tensors="pt")
        # Run model
        outputs = self.model.generate(**inputs)
        # Decode output
        result = self.processor.decode(outputs[0], skip_special_tokens=True)
        return result

You'd then use this class in your LangChain chains or agents, passing both the text prompt and image as needed. If you want chat-style interaction, subclass BaseChatModel and implement _generate similarly.

Note: LangChain's message system now supports content blocks (like images) for OpenAI models, but this is not yet generalized to HuggingFace-based multi-modal models. So, this custom approach is currently required.

Let me know if you want a more detailed example or guidance on integrating this into a specific LangChain workflow.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to integrate Multi-Modal VLMs that uses AutoProcessor and AutoModelForImageTextToText into Langchain as an LLM or chatModel. #31799

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to integrate Multi-Modal VLMs that uses AutoProcessor and AutoModelForImageTextToText into Langchain as an LLM or chatModel. #31799

Uh oh!

Uh oh!

Darrshan-Sankar Jul 1, 2025

Checked other resources

Commit to Help

Description

Replies: 1 comment

Uh oh!

dosubot[bot] bot Jul 1, 2025

Darrshan-Sankar
Jul 1, 2025

dosubot[bot]
bot Jul 1, 2025