Heading from pdf #1170

roninio · 2025-03-15T13:07:04Z

roninio
Mar 15, 2025

I have a PDF document containing headings of various sizes and styles. After converting the PDF to a Markdown file using DocLing, I've noticed that all headings are uniformly converted to level-2 headings (##), regardless of their original size or importance in the PDF.

I would like to know how to properly configure DocLing, or if there's an alternative method, to accurately represent the original heading hierarchy from the PDF in the resulting Markdown file. Specifically, I need the Markdown headings to reflect the relative size and importance of the headings in the original PDF (e.g., larger headings should become #, smaller headings ###, etc.).

Could you please provide information on how to achieve this? Thank you."

kaumnen · 2025-04-21T20:51:36Z

kaumnen
Apr 21, 2025

PyMuPDF has a get_toc [0] method that returns a table of contents

a quick example of returned ToC with simple=True (more info in the linked docs):

[
    [1, "Heading 1", 22],
    [2, "Heading 2", 33],
    [3, "Heading 3", 44],
    [4, "Another heading 4", 55]
]

elements format:
[<heading level>, <heading text>, <page its on>]

what you can do is:

convert the pdf and .export_to_markdown() [1]
open the same pdf file with PyMuPDF, extract ToC
go through the markdown, check and update (add or remove) # where needed based on the <heading level> from the ToC

[0]: https://pymupdf.readthedocs.io/en/latest/document.html#Document.get_toc

[1]: https://docling-project.github.io/docling/reference/docling_document/#docling_core.types.doc.DoclingDocument.export_to_markdown

0 replies

PeterStaar-IBM · 2025-04-22T06:54:52Z

PeterStaar-IBM
Apr 22, 2025
Maintainer

@roninio Note that the docling-parse can also provide the TOC and populate it (see here: https://github.com/docling-project/docling-core/blob/763e1364ff0b95388696ccd3d69f150718012a3a/docling_core/types/doc/page.py#L463).

We plan to propagate this info and use it to improve the heading tree.

1 reply

JohannKaspar May 25, 2025

Looking forward to this feature!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Heading from pdf #1170

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Heading from pdf #1170

Uh oh!

Uh oh!

roninio Mar 15, 2025

Replies: 2 comments · 1 reply

Uh oh!

kaumnen Apr 21, 2025

Uh oh!

PeterStaar-IBM Apr 22, 2025 Maintainer

Uh oh!

JohannKaspar May 25, 2025

roninio
Mar 15, 2025

Replies: 2 comments 1 reply

kaumnen
Apr 21, 2025

PeterStaar-IBM
Apr 22, 2025
Maintainer