-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
I noticed an unusual Grobid ERROR in HAL for just one file.
07/11/2023 15:59:44 INFO /home/issa/ISSA-2/data/hal/pdf_cache/tel-03125685v1.pdf already exists
07/11/2023 16:00:14 ERROR descriptor 'strip' requires a 'str' object but received a 'NoneType'
Traceback (most recent call last):
File "./extract_text_from_pdf.py", line 522, in download_and_process_all
process_pdf(f_pdf, f_json, pdf_content=pdf_content)
File "./extract_text_from_pdf.py", line 412, in process_pdf
pdf_dict = xml_to_dict(paper_id, xml)
File "./extract_text_from_pdf.py", line 360, in xml_to_dict
body = [{'text': get_all_text_as_one(root, xml_path, sep=cfg.MERGE_SEPARATOR)}]
File "./extract_text_from_pdf.py", line 119, in get_all_text_as_one
text_list = get_all_text_as_list(root, element_name_or_path)
File "./extract_text_from_pdf.py", line 102, in get_all_text_as_list
text_list = [t for t in list(map(str.strip, text_list)) if t]
TypeError: descriptor 'strip' requires a 'str' object but received a 'NoneType'
To figure out what causes the error we need to check the XML /home/issa/ISSA-2/data/hal/dataset-2-0/20231107/xml
Metadata
Metadata
Assignees
Labels
No labels