Skip to content

Follow-up for table-like structures #258

@stijnvermeeren-swisstopo

Description

Follow-up for #195

After #256 we have added a method for table structure detection. We need to optimize the method in two ways:

  1. We need to investigate the current issues with identifying multiple boreholes on the same page. We should look at possible ways how to extend the table detection logic to identify multiple table-like borehole structures within these files (without reducing overall accuracy) or look for a specific method that handles cases where multiple boreholes without a clear box structure exist on the same page.

    Currently the following files result in worse performance due to the new method:

    • 268124232-bp.pdf (worse results; multiple actual boreholes in the same detected table structure)
    • 675245009-bp.pdf (worse results; multiple actual boreholes in the same detected table structure)
    • 681249142-bp.pdf (worse results; multiple actual boreholes in the same detected table structure)
    • 684252058-bp.pdf (KPI improves as one a single borehole is detected (correct) but some good material descriptions are not extracted as a consequence)
    • 690251013-bp.pdf (incorrect borehole removed because two boreholes are in the same table-structure)

    The aim is to improve accuracy on the above mentioned files in scope of the new methods we use.

  2. After introduction of table detection we need to investigate potential redundancies in the extract.py file to see if any of the previous checks that have been used are not needed anymore. The main aim of the table detection is to filter at most one borehole structure per table structure which should reduce the need for filtering of material description pairs.

    The aim is to validate the current extraction logic and simplify the steps where possible without impacting the overall accuracy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions