Image Extraction From PyPDF & PyMuDF Loader #30509

m0han22 · 2025-03-27T05:38:40Z

m0han22
Mar 27, 2025

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders.parsers import TesseractBlobParser
from langchain_community.document_loaders.parsers import RapidOCRBlobParser

for file in os.listdir(pdf_folder_path):
    if file.endswith('.pdf'):
        pdf_path = os.path.join(pdf_folder_path, file)
        print(f'Processing...{file}')
        loader = PyMuPDFLoader(
            pdf_path,
            mode='page',
            images_parser=TesseractBlobParser(),
        )
        print('Extracting the elements...')
        pdf_elements = loader.load()

Description

I am trying to extract images Using LangChain PyPDF Loader & PyMuDF Loader. As mentioned in the documentation i have tried using TesseractBlobParser & RapidOCR...both of them resulted in error, "TypeError: Cannot handle this data type: (1, 1, 1), |u1".

Is this a bug..or there any issue in my code ?

Below is the detailed version of my error..

`
KeyError                                  Traceback (most recent call last)
File C:\anconda3\envs\omr\Lib\site-packages\PIL\Image.py:3311, in fromarray(obj, mode)
   3310 try:
-> 3311     mode, rawmode = _fromarray_typemap[typekey]
   3312 except KeyError as e:

KeyError: ((1, 1, 1), '|u1')

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[5], line 24
     18 loader = PyMuPDFLoader(
     19     pdf_path,
     20     mode='page',
     21     images_parser=TesseractBlobParser(),
     22 )
     23 print('Extracting the elements...')
---> 24 pdf_elements = loader.load()
     25 print(f'Chunking....{file}')
     26 chunked_elements = text_splitter.split_documents(pdf_elements)

File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\pdf.py:859, in PyMuPDFLoader.load(self, **kwargs)
    858 def load(self, **kwargs: Any) -> list[Document]:
--> 859     return list(self._lazy_load(**kwargs))

File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\pdf.py:856, in PyMuPDFLoader._lazy_load(self, **kwargs)
    854 else:
    855     blob = Blob.from_path(self.file_path)  # type: ignore[attr-defined]
--> 856 yield from parser._lazy_parse(blob, text_kwargs=kwargs)

File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py:996, in PyMuPDFParser._lazy_parse(self, blob, text_kwargs)
    994 full_content = []
    995 for page in doc:
--> 996     all_text = self._get_page_content(doc, page, text_kwargs).strip()
    997     if self.mode == "page":
    998         yield Document(
    999             page_content=all_text,
   1000             metadata=_validate_metadata(
   1001                 doc_metadata | {"page": page.number}
   1002             ),
   1003         )

File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py:1031, in PyMuPDFParser._get_page_content(self, doc, page, text_kwargs)
   1019 """Get the text of the page using PyMuPDF and RapidOCR and issue a warning
   1020 if it is empty.
   1021 
   (...)   1028     str: The text content of the page.
   1029 """
   1030 text_from_page = page.get_text(**{**self.text_kwargs, **text_kwargs})
-> 1031 images_from_page = self._extract_images_from_page(doc, page)
   1032 tables_from_page = self._extract_tables_from_page(page)
   1033 extras = []

File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py:1104, in PyMuPDFParser._extract_images_from_page(self, doc, page)
   1100         numpy.save(image_bytes, image)
   1101         blob = Blob.from_data(
   1102             image_bytes.getvalue(), mime_type="application/x-npy"
   1103         )
-> 1104         image_text = next(self.images_parser.lazy_parse(blob)).page_content
   1106         images.append(
   1107             _format_inner_image(blob, image_text, self.images_inner_format)
   1108         )
   1109 return _FORMAT_IMAGE_STR.format(
   1110     image_text=_JOIN_IMAGES.join(filter(None, images))
   1111 )

File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\parsers\images.py:56, in BaseImageBlobParser.lazy_parse(self, blob)
     54 with blob.as_bytes_io() as buf:
     55     if blob.mimetype == "application/x-npy":
---> 56         img = Img.fromarray(numpy.load(buf))
     57     else:
     58         img = Img.open(buf)

File C:\anconda3\envs\omr\Lib\site-packages\PIL\Image.py:3315, in fromarray(obj, mode)
   3313         typekey_shape, typestr = typekey
   3314         msg = f"Cannot handle this data type: {typekey_shape}, {typestr}"
-> 3315         raise TypeError(msg) from e
   3316 else:
   3317     rawmode = mode

TypeError: Cannot handle this data type: (1, 1, 1), |u1
`

System Info

System Information

OS: Windows
OS Version: 10.0.19045
Python Version: 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]

Package Information

langchain_core: 0.3.46
langchain: 0.3.21
langchain_community: 0.3.20
langsmith: 0.3.18
langchain_openai: 0.3.9
langchain_postgres: 0.0.13
langchain_text_splitters: 0.3.7

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.45: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.7: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.21: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy: 1.26.4
numpy<3,>=1.26.2: Installed. No version info available.
openai-agents: Installed. No version info available.
openai<2.0.0,>=1.66.3: Installed. No version info available.
opentelemetry-api: Installed. No version info available.
opentelemetry-exporter-otlp-proto-http: Installed. No version info available.
opentelemetry-sdk: Installed. No version info available.
orjson: 3.10.15
packaging: 24.2
packaging<25,>=23.2: Installed. No version info available.
pgvector: 0.3.6
psycopg: 3.2.6
psycopg-pool: 3.2.6
pydantic: 2.10.6
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: Installed. No version info available.
sqlalchemy: 2.0.39
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tiktoken<1,>=0.7: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Image Extraction From PyPDF & PyMuDF Loader #30509

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Image Extraction From PyPDF & PyMuDF Loader #30509

Uh oh!

m0han22 Mar 27, 2025

Checked other resources

Commit to Help

Example Code

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

Replies: 0 comments

m0han22
Mar 27, 2025