Follow-up for table-like structures

Follow-up for https://github.com/swisstopo/swissgeol-boreholes-dataextraction/issues/195

After https://github.com/swisstopo/swissgeol-boreholes-dataextraction/pull/256 we have added a method for table structure detection. We need to optimize the method in two ways:

1. We need to investigate the current issues with identifying multiple boreholes on the same page. We should look at possible ways how to extend the table detection logic to identify multiple table-like borehole structures within these files (without reducing overall accuracy) or look for a specific method that handles cases where multiple boreholes without a clear box structure exist on the same page.  

   Currently the following files result in worse performance due to the new method: 

    - 268124232-bp.pdf (worse results; multiple actual boreholes in the same detected table structure)
    - 675245009-bp.pdf (worse results; multiple actual boreholes in the same detected table structure)
    - 681249142-bp.pdf (worse results; multiple actual boreholes in the same detected table structure)
    - 684252058-bp.pdf (KPI improves as one a single borehole is detected (correct) but some good material descriptions are not extracted as a consequence)
    - 690251013-bp.pdf (incorrect borehole removed because two boreholes are in the same table-structure)

   The aim is to improve accuracy on the above mentioned files in scope of the new methods we use.

2. After introduction of table detection we need to investigate potential redundancies  in the `extract.py` file to see if any of the previous checks that have been used are not needed anymore. The main aim of the table detection is to filter at most one borehole structure per table structure which should reduce the need for filtering of material description pairs. 

    The aim is to validate the current extraction logic and simplify the steps where possible without impacting the overall accuracy



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Follow-up for table-like structures #258

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Follow-up for table-like structures #258

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions