Proposal: Track individual parsing cells through all higher-level objects: Table, Cluster, Text, etc.. #1460

stestagg · 2025-04-25T11:39:56Z

stestagg
Apr 25, 2025

Thanks for releasing Docling, I think it's a great tool, and the quality of the pipelines is quite impressive.

EDIT

The thing that prompted me to riase this was the table structure model missing out text that was then omitted from the output.

I think this is a problem that can be addressed in multiple ways:

Train the model better - Always a worthy goal, but unlikely to ever be a perfect solution
Identify when content has been missed, and put it back into the document so at least the content is present, even if it's mis-categorised.

Personally, I feel 2 is more important, while doing 1 is always a worthy excercise :).

Interestingly, I found that something similar is already happening at the layout stage:

docling/docling/utils/layout_postprocessor.py

Lines 607 to 614 in 4ab7e9d

    
           def _find_unassigned_cells(self, clusters: List[Cluster]) -> List[TextCell]: 
        
               """Find cells not assigned to any cluster.""" 
        
               assigned = {cell.index for cluster in clusters for cell in cluster.cells} 
        
               return [ 
        
                   cell 
        
                   for cell in self.cells 
        
                   if cell.index not in assigned and cell.text.strip() 
        
               ]

Unfortunately, clusters can't be used for this directly, as the tableformer model splits up clusters, and the code uses bboxes to approximate the underlying cells that tableformer has identified.

However, the clusters are useful because, for example, on this page:

With the chunks that have been missed from the table highlighted in yellow, the clusters have good structural information about which word cells belong together.

I have a local patch that adds IDs to all pdf cell resource objects, but now having understood the code better, I think the index might be usable to do something similar. It then reconciles the clusters with the cell IDs that werent' picked up by the tableformer process, and puts them in a list of unassigned clusters on the table (this is the bit that assigns the result):

@@ -282,6 +296,27 @@ class TableStructureModel(BasePageModel):
                                 .get("rs_seq", [])
                             )
+                            # Unassigned cells:
+                            unassigned_groups = get_unassigned_groups(sp, table_cluster, assigned_ids.copy())
+                            unassigned_clusters = []
+                            if unassigned_groups:
+                                next_id = get_max_id(page.predictions.layout.clusters) + 1
+
+                                for group in unassigned_groups:
+                                    cluster_id = next_id
+                                    next_id += 1
+                                    cluster = Cluster(
+                                        id=cluster_id,
+                                        label=DocItemLabel.TEXT,
+                                        bbox=BoundingBox.enclosing_bbox(
+                                            [c.rect.to_bounding_box() for c in group]
+                                        ),
+                                        confidence=1.0,
+                                        cells=group,
+                                    )
+                                    page.predictions.layout.clusters.append(cluster)
+                                    unassigned_clusters.append(cluster)
+
                             tbl = Table(
                                 otsl_seq=otsl_seq,
                                 table_cells=table_cells,
@@ -291,6 +326,7 @@ class TableStructureModel(BasePageModel):
                                 page_no=page.page_no,
                                 cluster=table_cluster,
                                 label=table_cluster.label,
+                                unassigned=unassigned_clusters,
                             )

These new cluster are then picked up by the readingorder model and included as before.

This way, all the content is always included in the output in some form.

Not having the ability to link between objects that are super/sub sets of each other reliably makes this code harder, and harder to get right, which is what I was getting at before, but I can see there are workarounds possible using other data.

It's possible that this should now turn into an issue, not sure.

Old Content

In my somewhat opinionated perspective, I think dropping content silently is arguably one of the biggest problems a parser can have.

Having worked in this space for a bit, I have a personal maxim that I design against that I think might be useful to consider here:

Have robust methods to ensure that all items in the source document can be accounted for in the output (if required).

What this means in practice for a structural parser/modeller is that the parsers have to assign a unique Id to each parsed element, and then every time the element is split/joined into/out of a container, that container tracks the IDs/source objects it holds, ideally recursively.

For docling, this seems like a somewhat pedantic extension of the provenance object, but one that allows for a number of things:

Benefits

Easy to ensure that you never miss content. No matter how good or well trained the processing models are, there's always a chance that some content gets dropped, or missed. For example discussion#62 which is somethig I've been seeing on difficult pages too. It's tempting to just try to 'improve the model', which is a worthy goal, but it's far safer, in my opinion, to have a mechanism that can ensure that content is never silently missed, which having the provenance ids would provide.
Able to get correct bounding boxes. I'm seeing issues with some merged texts where the bounding box doesn't cover all the text (sometimes missing entire paragraphs). Also, a bounding box can be quite a coarse tool in cases where texts are wrapped over columns/around images etc.. Having the ability to go from a paragraph to the bounding boxes of the individual content cells gives users a lot of flexibility over how to model the content on the page.
High quality data lineage: If you can always drill down in the data model to the specific cell that contains the content being looked at.

Challenges

Docling has a lot of parsers for different formats/versions, so this approach would have to be implemented in all of them and exposed.
Merging / splitting logic becomes more complex, you need methods to ensure this is done correctly, and the right IDs are handled
This data has to be carried across each layer of the processing tree, from the parsers (in C) to the converter and ConversionResults to the Document output object (maybe with conditional output flags to control serialization size?)

Case Study

This suggestion came about based on me trying to implement a lighter version of this just for the table processing where text does get silently skipped in some cases.

If the TableFormer model (even on high quality mode) gets the bboxes/cell alignment wrong, text spans can be omitted. In my opinion, it's impossible to get a model that will guarantee to never do this, so the goal was to identify when TextCells weren't included in any table cell, and output those in a separate attribute, grouped by the layout cluster groupings.

Implementing this naively isn't too hard. Give each TextCell a unique ID, and pass that to the multi_table_predict predictor so the returned responses have cell ids embedded, then do a set difference on the tcell inputs.

Unfortunately, the tcell input is often a flattened list of the input cells, not a representation of the table clusters being passed in, because the table clusters contain layout-grouped objects that are aggregates that don't track the objects they aggregate (case 1 where tracking the underlyings would help, you could just get the Char/word cells from the Line cells), so the code has to go back to the parser and find the words based on bounding-box which isn't ideal, althought practically seems to work. So you're left with a bunch of individual word cells that then would need to grouped into clusters to form useful sentences.

Now we have clusters in scope, in table_cluster object, and that is recursive, so you can reconstruct the sentence groupings nicely from that, however the TableCell and other objects in those clusters have no way to be tied back to the char/word cells as they don't track the provenance of the underlyign objects either (case 2 where this would help)

PeterStaar-IBM · 2025-04-25T11:45:33Z

PeterStaar-IBM
Apr 25, 2025
Maintainer

@stestagg I agree, for that reason we have added the ability to propagate the "raw" output from the parsers for pdf (https://github.com/docling-project/docling/blob/main/docling/datamodel/pipeline_options.py#L425). In this way, you can get access to the low level data at conversion.

8 replies

stestagg Apr 25, 2025
Author

Aah thanks, yes that's about where I'd got to before I decided to write the wall of text!

As far as I can see, in docling-parse, there's no actual linkage between a textline_cell and a char_cell or word_cell, so when the table structure asks the segmented page for the word cells:

                                tcells = sp.get_cells_in_bbox(
                                    cell_unit=TextCellUnit.WORD,
                                    bbox=table_cluster.bbox,
                                )

They can't be reconciled (as far as I can tell) with the clusters being passed in from the layout, as they ultimately use the text line cells, which don't link to the words/cells. And the clusters are super useful because they have the structural grouping info for the bare words.

I guess I could query the SP for all words in the table bbox, and then do a set intersection with the tableformer cells, and then do reverse bbox querying to find the clusters. It does feel a bit of an unstable solution, but could be a way out of this without a lot of bigger changes.

I'm having a go locally as a spike adding IDs to the parser, and seeing if they can propagate nicely. I'm doing this on the understanding that the change likely won't be acceptable upstream, but helps me understand the code, and fixes a deal-breaker for us locally in the interim. Somewhat selfishly for us in the short-term, non-pdf input and OCR isn't something we're worrying about for now, but I realise you don't have this luxury!

Thanks

PeterStaar-IBM Apr 25, 2025
Maintainer

The words and textline cells are created from the char's (the latter we actually retrieve from the pdf). So you can be sure there is always an overlap.

I must admit I am not entirely following the exact problem you are facing, but happy I could help out. It would help to provide an actual example, to see if we need to incorporate it into Docling.

stestagg Apr 25, 2025
Author

Appreciate your time/input with this. It's a bit subtle, but I think having a robust approach to not loosing content during parsing is a good thing in general. The use-case is just an example of a case where it matters.

Don't feel you need to respond if you don't want, I'll add any developments later. But here's an example of where content just goes missing from a page (content in the yellow box is missing from the output):

Which is due to tableformer getting it wrong (understandable it's a probabilistic model, it'll never be perfect)

So robust code should ideally identify these missed bits and expose them somehow.

The trouble with bboxes is that they're never perfect, and with layered content (mix of graphics & text etc, ) can be funky, there's code in docling to use heuristics based on overlap percentage etc which is indicative of it not being an ideal way of tracking elements. From the page above:

Is a good example of this, what content does cell 5.2 contain?

Where there are multiple errors in one table (as per the example), The ideal output (on the assumption that tableformer well never be perfect) would be a list of 'unassigned text' content, which would be a useable structure containing the text that hasn't made it into the table correctly.

But if you just use the word cells, then you're left with a bag of words to try to reconstruct into senteces, etc. (In this page, we effectively have 4 blocks of text that were missed).

The clusters from the layout model have this structural grouping info, but they are based on the text spans coming out of the parser (not the words/chars) so the goal for this specific solution is to identify which cells were missed, and group them by the layout cluster they're in, and then reconstruct the paragraph from the text. As far as I can see, this is impossible without using bbox heuristics.

As I said above, this is me trying to clarify / record the problem for posterity, so feel free to ignore.

PeterStaar-IBM Apr 25, 2025
Maintainer

OK, I get it! Yes, the bottom line is that this is actually more a mis-classification problem in the layout ... This is a tricky page. Are there any documents with this layout that are open, we are doing annotation campaigns and it would be great to include this one to learn the layout & reading order better

stestagg Apr 25, 2025
Author

The doc (or an almost identical version of it) is on the public web here:

https://www.blackrock.com/uk/literature/prospectus/blackrock-collective-investment-funds-en-gb-prospectus.pdf
I'm not aware of any distribution restrictions associated with it.

There's also the document in discussion #621 which had the same issue (there was talk about new weights fixing that, but not sure if they were released)

I agree that this is a case of the models not being 100%, but i'd be hesitant to just file this as 'let's make the AI models perfect' as a solution, because I'm not sure that's an achievable goal, and more conservative clients (that I'm working with) will push back hard against that response. :)

stestagg · 2025-04-30T16:24:24Z

stestagg
Apr 30, 2025
Author

I have about 300 tables from PDFs with missing content (outlined in yellow).

Here's a sample of a few different failure modes if that helps:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Track individual parsing cells through all higher-level objects: Table, Cluster, Text, etc.. #1460

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Proposal: Track individual parsing cells through all higher-level objects: Table, Cluster, Text, etc.. #1460

Uh oh!

Uh oh!

stestagg Apr 25, 2025

Replies: 2 comments · 8 replies

Uh oh!

PeterStaar-IBM Apr 25, 2025 Maintainer

Uh oh!

stestagg Apr 25, 2025 Author

Uh oh!

PeterStaar-IBM Apr 25, 2025 Maintainer

Uh oh!

stestagg Apr 25, 2025 Author

Uh oh!

PeterStaar-IBM Apr 25, 2025 Maintainer

Uh oh!

Uh oh!

stestagg Apr 25, 2025 Author

Uh oh!

stestagg Apr 30, 2025 Author

stestagg
Apr 25, 2025

Replies: 2 comments 8 replies

PeterStaar-IBM
Apr 25, 2025
Maintainer

stestagg Apr 25, 2025
Author

PeterStaar-IBM Apr 25, 2025
Maintainer

stestagg Apr 25, 2025
Author

PeterStaar-IBM Apr 25, 2025
Maintainer

stestagg Apr 25, 2025
Author

stestagg
Apr 30, 2025
Author