Replies: 2 comments 8 replies
-
@stestagg I agree, for that reason we have added the ability to propagate the "raw" output from the parsers for pdf (https://github.com/docling-project/docling/blob/main/docling/datamodel/pipeline_options.py#L425). In this way, you can get access to the low level data at conversion. |
Beta Was this translation helpful? Give feedback.
8 replies
-
I have about 300 tables from PDFs with missing content (outlined in yellow). Here's a sample of a few different failure modes if that helps: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Thanks for releasing Docling, I think it's a great tool, and the quality of the pipelines is quite impressive.
EDIT
The thing that prompted me to riase this was the table structure model missing out text that was then omitted from the output.
I think this is a problem that can be addressed in multiple ways:
Personally, I feel 2 is more important, while doing 1 is always a worthy excercise :).
Interestingly, I found that something similar is already happening at the layout stage:
docling/docling/utils/layout_postprocessor.py
Lines 607 to 614 in 4ab7e9d
Unfortunately, clusters can't be used for this directly, as the tableformer model splits up clusters, and the code uses bboxes to approximate the underlying cells that tableformer has identified.
However, the clusters are useful because, for example, on this page:
With the chunks that have been missed from the table highlighted in yellow, the clusters have good structural information about which word cells belong together.
I have a local patch that adds IDs to all pdf cell resource objects, but now having understood the code better, I think the index might be usable to do something similar. It then reconciles the clusters with the cell IDs that werent' picked up by the tableformer process, and puts them in a list of unassigned clusters on the table (this is the bit that assigns the result):
These new cluster are then picked up by the readingorder model and included as before.
This way, all the content is always included in the output in some form.
Not having the ability to link between objects that are super/sub sets of each other reliably makes this code harder, and harder to get right, which is what I was getting at before, but I can see there are workarounds possible using other data.
It's possible that this should now turn into an issue, not sure.
Old Content
In my somewhat opinionated perspective, I think dropping content silently is arguably one of the biggest problems a parser can have.Having worked in this space for a bit, I have a personal maxim that I design against that I think might be useful to consider here:
What this means in practice for a structural parser/modeller is that the parsers have to assign a unique Id to each parsed element, and then every time the element is split/joined into/out of a container, that container tracks the IDs/source objects it holds, ideally recursively.
For docling, this seems like a somewhat pedantic extension of the provenance object, but one that allows for a number of things:
Benefits
Challenges
Case Study
This suggestion came about based on me trying to implement a lighter version of this just for the table processing where text does get silently skipped in some cases.
If the TableFormer model (even on high quality mode) gets the bboxes/cell alignment wrong, text spans can be omitted. In my opinion, it's impossible to get a model that will guarantee to never do this, so the goal was to identify when TextCells weren't included in any table cell, and output those in a separate attribute, grouped by the layout cluster groupings.
Implementing this naively isn't too hard. Give each TextCell a unique ID, and pass that to the
multi_table_predict
predictor so the returned responses have cell ids embedded, then do a set difference on thetcell
inputs.Unfortunately, the
tcell
input is often a flattened list of the input cells, not a representation of the table clusters being passed in, because the table clusters contain layout-grouped objects that are aggregates that don't track the objects they aggregate (case 1 where tracking the underlyings would help, you could just get the Char/word cells from the Line cells), so the code has to go back to the parser and find the words based on bounding-box which isn't ideal, althought practically seems to work. So you're left with a bunch of individual word cells that then would need to grouped into clusters to form useful sentences.Now we have clusters in scope, in
table_cluster
object, and that is recursive, so you can reconstruct the sentence groupings nicely from that, however the TableCell and other objects in those clusters have no way to be tied back to the char/word cells as they don't track the provenance of the underlyign objects either (case 2 where this would help)Beta Was this translation helpful? Give feedback.
All reactions