Replies: 5 comments 4 replies
-
Hello @magaton! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing. I found a similar discussion that might be helpful: Dynamic document loader based on file type [1]. To implement a dynamic document loader in LangChain that uses custom parsing methods for binary files (like docx, pptx, pdf) to convert them into markdown, and then utilize the existing
Here is the code to achieve this: from typing import Iterator
from langchain_core.document_loaders import BaseBlobParser, Blob
from langchain_core.documents import Document
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader
from langchain_core.text_splitter import MarkdownHeaderTextSplitter
from langchain_community.document_loaders import Docx2txtLoader, UnstructuredWordDocumentLoader, AzureAIDocumentIntelligenceLoader
class CustomBinaryToMarkdownParser(BaseBlobParser):
"""Parser that converts binary files to markdown and summarizes images."""
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Parse a binary blob into markdown and summarize images."""
# Implement custom logic to convert binary files (docx, pptx, pdf) to markdown
# For simplicity, let's assume we have a function `convert_to_markdown` that does this
markdown_content = convert_to_markdown(blob.as_bytes())
# Summarize images in the markdown content
summarized_markdown = summarize_images_in_markdown(markdown_content)
yield Document(
page_content=summarized_markdown,
metadata={"source": blob.source}
)
def convert_to_markdown(binary_data: bytes) -> str:
# Placeholder function to convert binary data to markdown
# Implement the actual conversion logic here
return "Converted markdown content"
def summarize_images_in_markdown(markdown_content: str) -> str:
# Placeholder function to summarize images in markdown content
# Implement the actual summarization logic here
return markdown_content + "\n\n[Image summaries here]"
# Create a custom loader using the FileSystemBlobLoader and the custom parser
blob_loader = FileSystemBlobLoader(path=".", glob="*.pdf", show_progress=True)
parser = CustomBinaryToMarkdownParser()
loader = GenericLoader(blob_loader=blob_loader, parser=parser)
# Use the loader to load and process documents
for doc in loader.lazy_load():
# Use MarkdownHeaderTextSplitter for further processing
splitter = MarkdownHeaderTextSplitter()
split_docs = splitter.split(doc.page_content)
for split_doc in split_docs:
print(split_doc)
# Example of using existing loaders for docx files
docx_loader = Docx2txtLoader("./example_data/fake.docx")
docx_data = docx_loader.load()
# Example of using Unstructured loader for docx files
unstructured_loader = UnstructuredWordDocumentLoader("./example_data/fake.docx")
unstructured_data = unstructured_loader.load()
# Example of using Azure AI Document Intelligence loader
azure_loader = AzureAIDocumentIntelligenceLoader(
api_endpoint="<endpoint>", api_key="<key>", file_path="<filepath>", api_model="prebuilt-layout"
)
azure_documents = azure_loader.load() This code sets up a custom document loader that:
|
Beta Was this translation helpful? Give feedback.
-
Thanks dosu! Looks good, but in: I need to know which 3rd party library I should use. |
Beta Was this translation helpful? Give feedback.
-
@dosu, I am not able to use any cloud based solution, so AzureAIDocumentIntelligenceLoader is out of the picture. |
Beta Was this translation helpful? Give feedback.
-
Thanks @dosu, I am missing now how to combine Loaders (like Directory, Confluence, Github, Sharepoint) with MimeTypeBasedParser. |
Beta Was this translation helpful? Give feedback.
-
@dosu, are you sure about your answer?
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I am looking for a way to implement a dynamic loader, similar to:
#10507 where:
The resource can be anything from local fs, GitHub, Confluence, SharePoint. This is why I would like to preserve the existing Langchain loader implementations, but:
Also, I would like to summarise extracted images in the generated MD and store them on FS.
I have checked Unstructured but I see it as a single pass; file to page_content, without MD parsing and splitting step.
I am a bit lost in BlobLoaders and BlobParsers, so I would appreciate an example of how to do that.
Beta Was this translation helpful? Give feedback.
All reactions