-
Notifications
You must be signed in to change notification settings - Fork 8
Description
2025-01-10 10:33:21,663 - DownloadModel - INFO: D:\Dev_env\Anaconda3_202406\envs\audit\Lib\site-packages\rapid_layout\models\layout_cdla.onnx already exists
2025-01-10 10:33:21,910 - rapid_layout - INFO: pp_layout_cdla contains ['text', 'title', 'figure', 'figure_caption', 'table', 'table_caption', 'header', 'footer', 'reference', 'equation']
0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "D:\Dev\Audit\RapidDoc\demo.py", line 13, in
result = pdf_parser(pdf_path)
File "D:\Dev\Audit\RapidDoc\rapid_doc\main.py", line 74, in call
txt_boxes, txts = self.run_direct_extract(i, img_width)
File "D:\Dev\Audit\RapidDoc\rapid_doc\main.py", line 105, in run_direct_extract
txt_boxes, txts = self.pdf_extracter.extract_page_text(page_num, img_width)
ValueError: too many values to unpack (expected 2)
查看源码后发现self.pdf_extracter.extract_page_text(page_num, img_width)这个函数的返回值是 return np.array(boxes),修改为 return np.array(boxes), self.texts后报错 score = list(map(lambda x: float(x[1]), select_text_score))
ValueError: could not convert string to float: '民'。如果将score全部设置为0,出来的数据依然不可用。
然后检查demo里的文档,发现是rapid_doc\main.py中判断是否是扫描版的时候判断结果不一致,遂注释掉以下片段
# 假定不存在某一段是扫描的,某一段是可直接提取的
# if self.is_extract(page):
# img_width = img.shape[1]
# txt_boxes, txts = self.run_direct_extract(i, img_width)
# else:
# txt_boxes, txts = self.run_ocr_extract(img)
问题解决,但表格依然没有提取。我看判断是否可以直接提取的标准是是否可以提取100字。这个判断是不是过于草率了,关于extract_page_text(page_num, img_width)的问题该如何修改呢?