Multi step document loading #26621

magaton · 2024-09-18T15:20:30Z

magaton
Sep 18, 2024

Hello, I am looking for a way to implement a dynamic loader, similar to:
#10507 where:

The resource can be anything from local fs, GitHub, Confluence, SharePoint. This is why I would like to preserve the existing Langchain loader implementations, but:

in the case of the binary file and its type (docx, pptx, pdf, etc) I would like to invoke a custom parsing method (e.g pptx2md, docx2md, PyMuPDF4LLM) that will convert the binary content into markdown and then use existing MarkdownHeaderTextSplitter.
Also, I would like to summarise extracted images in the generated MD and store them on FS.
I have checked Unstructured but I see it as a single pass; file to page_content, without MD parsing and splitting step.

I am a bit lost in BlobLoaders and BlobParsers, so I would appreciate an example of how to do that.

@magaton · 2024-09-18T15:22:34Z

dosubot[bot]
bot Sep 18, 2024

Hello @magaton! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing.

I found a similar discussion that might be helpful: Dynamic document loader based on file type ^[1].

To implement a dynamic document loader in LangChain that uses custom parsing methods for binary files (like docx, pptx, pdf) to convert them into markdown, and then utilize the existing MarkdownHeaderTextSplitter for further processing while preserving existing loader implementations and summarizing extracted images in the generated markdown, you can follow these steps:

Create a Custom Blob Parser: Implement a parser that converts binary files into markdown and summarizes images.
Use the MarkdownHeaderTextSplitter: Integrate the MarkdownHeaderTextSplitter for further processing of the markdown content.
Combine with Existing Loaders: Use the GenericLoader to combine the custom parser with existing blob loaders.

Here is the code to achieve this:

from typing import Iterator
from langchain_core.document_loaders import BaseBlobParser, Blob
from langchain_core.documents import Document
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader
from langchain_core.text_splitter import MarkdownHeaderTextSplitter
from langchain_community.document_loaders import Docx2txtLoader, UnstructuredWordDocumentLoader, AzureAIDocumentIntelligenceLoader

class CustomBinaryToMarkdownParser(BaseBlobParser):
    """Parser that converts binary files to markdown and summarizes images."""

    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        """Parse a binary blob into markdown and summarize images."""
        # Implement custom logic to convert binary files (docx, pptx, pdf) to markdown
        # For simplicity, let's assume we have a function `convert_to_markdown` that does this
        markdown_content = convert_to_markdown(blob.as_bytes())
        
        # Summarize images in the markdown content
        summarized_markdown = summarize_images_in_markdown(markdown_content)
        
        yield Document(
            page_content=summarized_markdown,
            metadata={"source": blob.source}
        )

def convert_to_markdown(binary_data: bytes) -> str:
    # Placeholder function to convert binary data to markdown
    # Implement the actual conversion logic here
    return "Converted markdown content"

def summarize_images_in_markdown(markdown_content: str) -> str:
    # Placeholder function to summarize images in markdown content
    # Implement the actual summarization logic here
    return markdown_content + "\n\n[Image summaries here]"

# Create a custom loader using the FileSystemBlobLoader and the custom parser
blob_loader = FileSystemBlobLoader(path=".", glob="*.pdf", show_progress=True)
parser = CustomBinaryToMarkdownParser()

loader = GenericLoader(blob_loader=blob_loader, parser=parser)

# Use the loader to load and process documents
for doc in loader.lazy_load():
    # Use MarkdownHeaderTextSplitter for further processing
    splitter = MarkdownHeaderTextSplitter()
    split_docs = splitter.split(doc.page_content)
    for split_doc in split_docs:
        print(split_doc)

# Example of using existing loaders for docx files
docx_loader = Docx2txtLoader("./example_data/fake.docx")
docx_data = docx_loader.load()

# Example of using Unstructured loader for docx files
unstructured_loader = UnstructuredWordDocumentLoader("./example_data/fake.docx")
unstructured_data = unstructured_loader.load()

# Example of using Azure AI Document Intelligence loader
azure_loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint="<endpoint>", api_key="<key>", file_path="<filepath>", api_model="prebuilt-layout"
)
azure_documents = azure_loader.load()

This code sets up a custom document loader that:

Converts binary files to markdown.
Summarizes images in the markdown content.
Uses the MarkdownHeaderTextSplitter for further processing.
Combines the custom parser with existing blob loaders using GenericLoader.
Demonstrates the use of existing loaders for docx files using Docx2txtLoader, UnstructuredWordDocumentLoader, and AzureAIDocumentIntelligenceLoader ^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

0 replies

magaton · 2024-09-18T15:28:19Z

magaton
Sep 18, 2024
Author

Thanks dosu!

Looks good, but in:
markdown_content = convert_to_markdown(blob.as_bytes())

I need to know which 3rd party library I should use.
Is there a way to figure out document type automatically (docx, pdf, pptx, etc)

1 reply

dosubot[bot] bot Sep 18, 2024

To convert binary files to markdown and automatically determine the document type, you can use the Azure AI Document Intelligence service. This service supports various file types including PDF, JPEG/JPG, PNG, BMP, TIFF, HEIF, DOCX, XLSX, PPTX, and HTML.

Here is an example of how to use the AzureAIDocumentIntelligenceLoader from the langchain_community.document_loaders module to achieve this:

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "<filepath>"
endpoint = "<endpoint>"
key = "<key>"
loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, 
    api_key=key, 
    file_path=file_path, 
    api_model="prebuilt-layout"
)

documents = loader.load()

The documents variable will contain the content in markdown format. This loader can handle various document types and convert them to markdown automatically ^[1].

To continue talking to Dosu, mention @dosu.

magaton · 2024-09-18T16:51:02Z

magaton
Sep 18, 2024
Author

@dosu, I am not able to use any cloud based solution, so AzureAIDocumentIntelligenceLoader is out of the picture.
Can I use something else that is installed as python library?

1 reply

dosubot[bot] bot Sep 18, 2024

Yes, you can use a Python library to convert binary files to markdown and automatically determine the document type without relying on a cloud-based solution. The LangChain framework provides tools for MIME-type based parsing, which can infer the MIME type from the binary content of the file and use appropriate parsers to convert the content.

Here is an example of how you can achieve this using LangChain:

import requests
import magic
from langchain.document_loaders.parsers import BS4HTMLParser, PDFMinerParser
from langchain.document_loaders.parsers.txt import TextParser
from langchain_community.document_loaders.parsers.generic import MimeTypeBasedParser
from langchain_community.document_loaders import Blob
from langchain_community.document_loaders.word_document import Docx2txtLoader, UnstructuredWordDocumentLoader

# Download some content (example: HTML file)
response = requests.get("https://en.wikipedia.org/wiki/Car")
data = response.content

# Configure the parsers that you want to use per mime-type
HANDLERS = {
    "application/pdf": PDFMinerParser(),
    "text/plain": TextParser(),
    "text/html": BS4HTMLParser(),
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document": Docx2txtLoader,
    "application/msword": UnstructuredWordDocumentLoader,
}

# Instantiate a mimetype based parser with the given parsers
MIMETYPE_BASED_PARSER = MimeTypeBasedParser(
    handlers=HANDLERS,
    fallback_parser=None,
)

# Determine the MIME type from the binary content
mime = magic.Magic(mime=True)
mime_type = mime.from_buffer(data)

# Create a Blob object representing the binary data
blob = Blob.from_data(
    data=data,
    mime_type=mime_type,
)

# Parse the document using the appropriate parser
documents = list(MIMETYPE_BASED_PARSER.lazy_parse(blob))

# Print the parsed content
print(documents[0].page_content[:30].strip())

This code will download an HTML file, determine its MIME type, and use the appropriate parser to convert the content to text. You can extend this approach to handle other file types, including DOCX and DOC files, and convert the parsed content to markdown as needed ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

magaton · 2024-09-18T21:45:45Z

magaton
Sep 18, 2024
Author

Thanks @dosu, I am missing now how to combine Loaders (like Directory, Confluence, Github, Sharepoint) with MimeTypeBasedParser.
I understand from your example that in order to decide how to parse the file, I need to first read it to detect MIME time. Note that I am not lading content via requests, but get files through one of Langchain standard loaders. How would that work?
Also, I see among your HANDLERS, a mix of Loaders and Parsers. Why is that?

1 reply

dosubot[bot] bot Sep 18, 2024

To combine LangChain standard loaders (like Directory, Confluence, GitHub, SharePoint) with MimeTypeBasedParser to detect MIME type and parse files, you need to configure the MimeTypeBasedParser with appropriate handlers for each MIME type you expect to encounter. The mix of loaders and parsers in the HANDLERS configuration is necessary because loaders are responsible for fetching the raw data (blobs), while parsers are responsible for interpreting and converting these blobs into documents based on their MIME types.

Here is an example of how you can set this up:

import magic
from langchain_community.document_loaders.parsers.generic import MimeTypeBasedParser
from langchain_community.document_loaders import DirectoryLoader, ConfluenceLoader, GitHubLoader, SharePointLoader
from langchain_community.document_loaders.parsers.pdf_parser import PDFParser
from langchain_community.document_loaders.parsers.text_parser import TextParser
from langchain_community.document_loaders.parsers.word_document import Docx2txtLoader, UnstructuredWordDocumentLoader

# Define handlers for different MIME types
handlers = {
    "application/pdf": PDFParser(),
    "text/plain": TextParser(),
    "text/html": BS4HTMLParser(),
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document": Docx2txtLoader(),
    "application/msword": UnstructuredWordDocumentLoader(),
    # Add more handlers as needed
}

# Create a MimeTypeBasedParser with the handlers
mime_type_parser = MimeTypeBasedParser(handlers=handlers, fallback_parser=None)

# Example of using a loader and then parsing the blobs
directory_loader = DirectoryLoader("/path/to/directory")
blobs = directory_loader.load()

# Determine the MIME type and parse each blob
mime = magic.Magic(mime=True)
for blob in blobs:
    blob.mimetype = mime.from_buffer(blob.as_bytes())
    documents = list(mime_type_parser.lazy_parse(blob))
    for document in documents:
        print(document)

In this setup:

Loaders like DirectoryLoader, ConfluenceLoader, GitHubLoader, and SharePointLoader are used to fetch the raw data (blobs).
The MimeTypeBasedParser is configured with handlers for different MIME types (e.g., PDFParser for PDFs, TextParser for plain text).
The lazy_parse method of MimeTypeBasedParser is used to parse the blobs based on their MIME types, converting them into documents.

This approach allows you to separate the concerns of data fetching (loaders) and data parsing (parsers), making your code modular and easier to maintain ^[1].

To continue talking to Dosu, mention @dosu.

magaton · 2024-09-19T07:38:11Z

magaton
Sep 19, 2024
Author

@dosu, are you sure about your answer?
Namely:

blobs = directory_loader.load() this method returns List[Document], not List[Blob]
magic requires libmagic to be installed on the host machine, which is not ideal. Can't you maybe use puremagic, which is standalone python library.
import paths for parsers are not correct (they do not exist)

1 reply

dosubot[bot] bot Sep 19, 2024

To address your points and provide a correct solution:

Correcting the Return Type: The DirectoryLoader and other loaders return a list of Document objects, not Blob objects.
Using puremagic: We can use puremagic for MIME type detection, which is a standalone Python library.
Correct Import Paths: Ensure the import paths for parsers are correct.

Here is the updated example:

Install puremagic:
```
pip install puremagic
```

Define the handlers for different MIME types:

from langchain_community.document_loaders.parsers.generic import MimeTypeBasedParser
from langchain_core.document_loaders import BaseBlobParser
from langchain_core.documents import Document
from langchain_core.document_loaders.blob_loaders.schema import Blob
import puremagic
from typing import Iterator

class PDFParser(BaseBlobParser):
    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        # Implement PDF parsing logic here
        pass

class TextParser(BaseBlobParser):
    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        # Implement Text parsing logic here
        pass

handlers = {
    "application/pdf": PDFParser(),
    "text/plain": TextParser(),
}

parser = MimeTypeBasedParser(handlers=handlers)

Detect MIME type using puremagic and parse the blob:

def detect_mime_type(blob: Blob) -> str:
    return puremagic.from_string(blob.content)

# Example of using a loader and then parsing the blobs
from langchain_community.document_loaders import DirectoryLoader

directory_loader = DirectoryLoader("/path/to/directory")
documents = directory_loader.load()

for document in documents:
    blob = Blob(content=document.page_content.encode('utf-8'))
    blob.mimetype = detect_mime_type(blob)
    parsed_documents = list(parser.lazy_parse(blob))
    for parsed_document in parsed_documents:
        print(parsed_document)

This code sets up a MimeTypeBasedParser with handlers for different MIME types and uses puremagic to detect the MIME type of a blob before parsing it. Additionally, ensure you have the correct import paths for the loaders you are using, as shown in the LangChain documentation. For example, you can import DirectoryLoader, ConfluenceLoader, GitHubFileLoader, and SharePointLoader from langchain_community.document_loaders ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Multi step document loading #26621

Uh oh!

magaton Sep 18, 2024

Replies: 5 comments · 4 replies

Uh oh!

dosubot[bot] bot Sep 18, 2024

Uh oh!

magaton Sep 18, 2024 Author

Uh oh!

dosubot[bot] bot Sep 18, 2024

Uh oh!

Uh oh!

magaton Sep 18, 2024 Author

Uh oh!

dosubot[bot] bot Sep 18, 2024

Uh oh!

magaton Sep 18, 2024 Author

Uh oh!

dosubot[bot] bot Sep 18, 2024

Uh oh!

magaton Sep 19, 2024 Author

Uh oh!

dosubot[bot] bot Sep 19, 2024

magaton
Sep 18, 2024

Replies: 5 comments 4 replies

dosubot[bot]
bot Sep 18, 2024

magaton
Sep 18, 2024
Author

magaton
Sep 18, 2024
Author

magaton
Sep 18, 2024
Author

magaton
Sep 19, 2024
Author