Replies: 2 comments 1 reply
-
PyMuPDF has a get_toc [0] method that returns a table of contents a quick example of returned ToC with simple=True (more info in the linked docs):
elements format: what you can do is:
[0]: https://pymupdf.readthedocs.io/en/latest/document.html#Document.get_toc |
Beta Was this translation helpful? Give feedback.
-
@roninio Note that the docling-parse can also provide the TOC and populate it (see here: https://github.com/docling-project/docling-core/blob/763e1364ff0b95388696ccd3d69f150718012a3a/docling_core/types/doc/page.py#L463). We plan to propagate this info and use it to improve the heading tree. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a PDF document containing headings of various sizes and styles. After converting the PDF to a Markdown file using DocLing, I've noticed that all headings are uniformly converted to level-2 headings (##), regardless of their original size or importance in the PDF.
I would like to know how to properly configure DocLing, or if there's an alternative method, to accurately represent the original heading hierarchy from the PDF in the resulting Markdown file. Specifically, I need the Markdown headings to reflect the relative size and importance of the headings in the original PDF (e.g., larger headings should become #, smaller headings ###, etc.).
Could you please provide information on how to achieve this? Thank you."
Beta Was this translation helpful? Give feedback.
All reactions