Problem running pdf converter #4548
Replies: 1 comment 1 reply
-
I resolved the issue by setting multiprocessing to False. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to use the pdf to text converter. I installed Haystack 1.15.0 as suggested in the discussion: #4531. But now I'm getting the following error:
ValueError Traceback (most recent call last)
Cell In [32], line 4
1 from haystack.nodes import PDFToTextConverter
3 converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
----> 4 doc_pdf = converter.convert(file_path="data/Introduction to mobile network engineering _ GSM, 3G-WCDMA, LTE.pdf", meta=None)[0]
File C:\Program Files\python\lib\site-packages\haystack\nodes\file_converter\pdf.py:171, in PDFToTextConverter.convert(self, file_path, meta, remove_numeric_tables, valid_languages, encoding, id_hash_keys, start_page, end_page, keep_physical_layout, sort_by_position, ocr, ocr_language, multiprocessing)
168 raise ValueError("The ocr parameter must be either 'auto' or 'full'.")
169 self._check_tessdata()
--> 171 pages = self._read_pdf(
172 file_path,
173 sort_by_position=sort_by_position,
174 start_page=start_page,
175 end_page=end_page,
176 ocr=ocr,
177 ocr_language=ocr_language,
178 multiprocessing=multiprocessing,
179 )
181 cleaned_pages = []
182 for page in pages:
File C:\Program Files\python\lib\site-packages\haystack\nodes\file_converter\pdf.py:299, in PDFToTextConverter._read_pdf(self, file_path, ocr_language, sort_by_position, start_page, end_page, ocr, multiprocessing)
296 parts = divide(cpu, page_list)
297 pages_mp = [(i, file_path, parts, sort_by_position, ocr, ocr_language) for i in range(cpu)]
--> 299 with ProcessPoolExecutor(max_workers=cpu) as pool:
300 results = pool.map(self._get_text_parallel, pages_mp)
301 for page in results:
File C:\Program Files\python\lib\concurrent\futures\process.py:609, in ProcessPoolExecutor.init(self, max_workers, mp_context, initializer, initargs)
606 raise ValueError("max_workers must be greater than 0")
607 elif (sys.platform == 'win32' and
608 max_workers > _MAX_WINDOWS_WORKERS):
--> 609 raise ValueError(
610 f"max_workers must be <= {_MAX_WINDOWS_WORKERS}")
612 self._max_workers = max_workers
614 if mp_context is None:
ValueError: max_workers must be <= 61
I need help fixing the issue. Thank you.
Beta Was this translation helpful? Give feedback.
All reactions