-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Follow-up for #195
After #256 we have added a method for table structure detection. We need to optimize the method in two ways:
-
We need to investigate the current issues with identifying multiple boreholes on the same page. We should look at possible ways how to extend the table detection logic to identify multiple table-like borehole structures within these files (without reducing overall accuracy) or look for a specific method that handles cases where multiple boreholes without a clear box structure exist on the same page.
Currently the following files result in worse performance due to the new method:
- 268124232-bp.pdf (worse results; multiple actual boreholes in the same detected table structure)
- 675245009-bp.pdf (worse results; multiple actual boreholes in the same detected table structure)
- 681249142-bp.pdf (worse results; multiple actual boreholes in the same detected table structure)
- 684252058-bp.pdf (KPI improves as one a single borehole is detected (correct) but some good material descriptions are not extracted as a consequence)
- 690251013-bp.pdf (incorrect borehole removed because two boreholes are in the same table-structure)
The aim is to improve accuracy on the above mentioned files in scope of the new methods we use.
-
After introduction of table detection we need to investigate potential redundancies in the
extract.py
file to see if any of the previous checks that have been used are not needed anymore. The main aim of the table detection is to filter at most one borehole structure per table structure which should reduce the need for filtering of material description pairs.The aim is to validate the current extraction logic and simplify the steps where possible without impacting the overall accuracy