You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to extract images Using LangChain PyPDF Loader & PyMuDF Loader. As mentioned in the documentation i have tried using TesseractBlobParser & RapidOCR...both of them resulted in error, "TypeError: Cannot handle this data type: (1, 1, 1), |u1".
Is this a bug..or there any issue in my code ?
Below is the detailed version of my error..
`
KeyError Traceback (most recent call last)
File C:\anconda3\envs\omr\Lib\site-packages\PIL\Image.py:3311, in fromarray(obj, mode)
3310 try:
-> 3311 mode, rawmode = _fromarray_typemap[typekey]
3312 except KeyError as e:
KeyError: ((1, 1, 1), '|u1')
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[5], line 24
18 loader = PyMuPDFLoader(
19 pdf_path,
20 mode='page',
21 images_parser=TesseractBlobParser(),
22 )
23 print('Extracting the elements...')
---> 24 pdf_elements = loader.load()
25 print(f'Chunking....{file}')
26 chunked_elements = text_splitter.split_documents(pdf_elements)
File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\pdf.py:859, in PyMuPDFLoader.load(self, **kwargs)
858 def load(self, **kwargs: Any) -> list[Document]:
--> 859 return list(self._lazy_load(**kwargs))
File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\pdf.py:856, in PyMuPDFLoader._lazy_load(self, **kwargs)
854 else:
855 blob = Blob.from_path(self.file_path) # type: ignore[attr-defined]
--> 856 yield from parser._lazy_parse(blob, text_kwargs=kwargs)
File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py:996, in PyMuPDFParser._lazy_parse(self, blob, text_kwargs)
994 full_content = []
995 for page in doc:
--> 996 all_text = self._get_page_content(doc, page, text_kwargs).strip()
997 if self.mode == "page":
998 yield Document(
999 page_content=all_text,
1000 metadata=_validate_metadata(
1001 doc_metadata | {"page": page.number}
1002 ),
1003 )
File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py:1031, in PyMuPDFParser._get_page_content(self, doc, page, text_kwargs)
1019 """Get the text of the page using PyMuPDF and RapidOCR and issue a warning
1020 if it is empty.
1021
(...) 1028 str: The text content of the page.
1029 """
1030 text_from_page = page.get_text(**{**self.text_kwargs, **text_kwargs})
-> 1031 images_from_page = self._extract_images_from_page(doc, page)
1032 tables_from_page = self._extract_tables_from_page(page)
1033 extras = []
File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\parsers\pdf.py:1104, in PyMuPDFParser._extract_images_from_page(self, doc, page)
1100 numpy.save(image_bytes, image)
1101 blob = Blob.from_data(
1102 image_bytes.getvalue(), mime_type="application/x-npy"
1103 )
-> 1104 image_text = next(self.images_parser.lazy_parse(blob)).page_content
1106 images.append(
1107 _format_inner_image(blob, image_text, self.images_inner_format)
1108 )
1109 return _FORMAT_IMAGE_STR.format(
1110 image_text=_JOIN_IMAGES.join(filter(None, images))
1111 )
File C:\anconda3\envs\omr\Lib\site-packages\langchain_community\document_loaders\parsers\images.py:56, in BaseImageBlobParser.lazy_parse(self, blob)
54 with blob.as_bytes_io() as buf:
55 if blob.mimetype == "application/x-npy":
---> 56 img = Img.fromarray(numpy.load(buf))
57 else:
58 img = Img.open(buf)
File C:\anconda3\envs\omr\Lib\site-packages\PIL\Image.py:3315, in fromarray(obj, mode)
3313 typekey_shape, typestr = typekey
3314 msg = f"Cannot handle this data type: {typekey_shape}, {typestr}"
-> 3315 raise TypeError(msg) from e
3316 else:
3317 rawmode = mode
TypeError: Cannot handle this data type: (1, 1, 1), |u1
`
System Info
System Information
OS: Windows
OS Version: 10.0.19045
Python Version: 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.45: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.7: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.21: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy: 1.26.4
numpy<3,>=1.26.2: Installed. No version info available.
openai-agents: Installed. No version info available.
openai<2.0.0,>=1.66.3: Installed. No version info available.
opentelemetry-api: Installed. No version info available.
opentelemetry-exporter-otlp-proto-http: Installed. No version info available.
opentelemetry-sdk: Installed. No version info available.
orjson: 3.10.15
packaging: 24.2
packaging<25,>=23.2: Installed. No version info available.
pgvector: 0.3.6
psycopg: 3.2.6
psycopg-pool: 3.2.6
pydantic: 2.10.6
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: Installed. No version info available.
sqlalchemy: 2.0.39
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tiktoken<1,>=0.7: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
I am trying to extract images Using LangChain PyPDF Loader & PyMuDF Loader. As mentioned in the documentation i have tried using TesseractBlobParser & RapidOCR...both of them resulted in error, "TypeError: Cannot handle this data type: (1, 1, 1), |u1".
Is this a bug..or there any issue in my code ?
Below is the detailed version of my error..
System Info
System Information
Package Information
Optional packages not installed
Other Dependencies
Beta Was this translation helpful? Give feedback.
All reactions