How to integrate Multi-Modal VLMs that uses AutoProcessor and AutoModelForImageTextToText into Langchain as an LLM or chatModel. #31799
Replies: 1 comment
-
Hi @Darrshan-Sankar! I'm Dosu and I’m helping the langchain team. LangChain's HuggingFacePipeline does not currently support multi-modal models like those using AutoProcessor and AutoModelForImageTextToText out of the box—it expects text-only input and output. To integrate such a vision-language model, you'll need to create a custom wrapper that subclasses either the LLM or ChatModel base classes and overrides the input/output handling to support both image and text inputs. This involves preprocessing inputs with AutoProcessor, invoking the model, and formatting outputs for LangChain's expected schema. The main extensibility points are the Here's a minimal pseudocode outline for a custom LLM wrapper: from langchain_core.language_models.llms import LLM
class MultiModalLLM(LLM):
def __init__(self, processor, model, **kwargs):
super().__init__(**kwargs)
self.processor = processor
self.model = model
def _call(self, prompt, image=None, **kwargs):
# Preprocess input
inputs = self.processor(text=prompt, images=image, return_tensors="pt")
# Run model
outputs = self.model.generate(**inputs)
# Decode output
result = self.processor.decode(outputs[0], skip_special_tokens=True)
return result You'd then use this class in your LangChain chains or agents, passing both the text prompt and image as needed. If you want chat-style interaction, subclass Note: LangChain's message system now supports content blocks (like images) for OpenAI models, but this is not yet generalized to HuggingFace-based multi-modal models. So, this custom approach is currently required. Let me know if you want a more detailed example or guidance on integrating this into a specific LangChain workflow. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Description
I am trying to integrate Multi-Modal VLMs that uses AutoProcessor and AutoModelForImageTextToText into Langchain as an LLM or chatModel. When I had a look at the HuggingFacePipeline, it seems that currently the wrapper doesn't support image-text-to-text models. How to bring in the same with Langchain
Beta Was this translation helpful? Give feedback.
All reactions