-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
When running the script on around 1,600 lines csv from the Zotero library I reach a certain point where I get the following error.
File "/opt/conda/lib/python3.12/multiprocessing/pool.py", line 774, in get
raise self._value
TypeError: invalid length: 8
This is not the case when I run the script on a 200 or 300 lines csv. Any idea of what I could do to solve it?
This is the traceback that leads to it.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.12/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
^^^^^^^^^^^^^^^^
File "/home/onyxia/work/citation_map/analyze_papers.py", line 155, in article_worker
pdf_result, text, pdf_log = process_pdf(metadata)
^^^^^^^^^^^^^^^^^^^^^
File "/home/onyxia/work/citation_map/analyze_papers.py", line 114, in process_pdf
original_page_count, pages = pdf_to_text_list(first_pdf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/onyxia/work/citation_map/analyze_papers.py", line 34, in pdf_to_text_list
pages = layout_scanner.get_pages(file_loc, images_folder=None) # you can try os.path.abspath("output/imgs")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/onyxia/work/citation_map/layout_scanner.py", line 212, in get_pages
return with_pdf(pdf_doc, _parse_pages, pdf_pwd, *tuple([images_folder]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/onyxia/work/citation_map/layout_scanner.py", line 35, in with_pdf
result = fn(doc, *args)
^^^^^^^^^^^^^^
File "/home/onyxia/work/citation_map/layout_scanner.py", line 201, in _parse_pages
for i, page in enumerate(PDFPage.create_pages(doc)):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdfpage.py", line 101, in create_pages
yield klass(document, objid, tree)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdfpage.py", line 56, in __init__
self.mediabox = resolve1(self.attrs['MediaBox'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdftypes.py", line 80, in resolve1
x = x.resolve(default=default)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdftypes.py", line 67, in resolve
return self.doc.getobj(self.objid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdfdocument.py", line 668, in getobj
(strmid, index, genno) = xref.get_pos(objid)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/pdfminer/pdfdocument.py", line 277, in get_pos
f2 = nunpack(ent[self.fl1:self.fl1+self.fl2])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/pdfminer/utils.py", line 183, in nunpack
raise TypeError('invalid length: %d' % l)
TypeError: invalid length: 8
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/onyxia/work/citation_map/analyze_papers.py", line 241, in <module>
result = pool.map(list_worker, list(titles_dict.items()), chunksize=5)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/multiprocessing/pool.py", line 774, in get
raise self._value
TypeError: invalid length: 8
staadecker
Metadata
Metadata
Assignees
Labels
No labels