Skip to content

Commit a31bac3

Browse files
jamie-lemonJorjMcKie
authored andcommitted
Adds docs updates for Layout and some tidy-up.
1 parent 026467a commit a31bac3

File tree

9 files changed

+545
-233
lines changed

9 files changed

+545
-233
lines changed

docs/about.rst

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,21 @@ Here are current results, grouped by task:
8080

8181
For more detail regarding the methodology for these performance timings see: :ref:`Performance Comparison Methodology<Appendix4>`.
8282

83+
84+
85+
Architecture
86+
----------------------
87+
88+
89+
|PyMuPDF| follows a four-layer system architecture with clear separation between components:
90+
91+
- **Python API Layer** - User-facing classes and functions
92+
- **Python Wrapper Layer** - High-level abstractions and utilities
93+
- **Binding Layer** - SWIG-generated Python-C bridges
94+
- **MuPDF Core Layer** - Native C/C++ implementation
95+
96+
.. include:: version.rst
97+
8398
.. _About_License:
8499

85100
License and Copyright
@@ -110,6 +125,7 @@ License and Copyright
110125
:title:`Artifex`, the :title:`Artifex` logo, :title:`MuPDF`, and the :title:`MuPDF` logo are registered trademarks of :title:`Artifex Software Inc.`
111126

112127

113-
.. include:: version.rst
128+
129+
114130

115131
.. include:: footer.rst

docs/header.rst

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,22 @@
3636

3737
<cite>PyMuPDF4LLM</cite>
3838

39+
.. |PyMuPDF Layout| raw:: html
40+
41+
<cite>PyMuPDF Layout</cite>
42+
3943
.. |Markdown| raw:: html
4044

4145
<cite>Markdown</cite>
4246

47+
.. |JSON| raw:: html
48+
49+
<cite>JSON</cite>
50+
51+
.. |TXT| raw:: html
52+
53+
<cite>TXT</cite>
54+
4355
.. |MuPDF| raw:: html
4456

4557
<cite>MuPDF</cite>
@@ -52,6 +64,10 @@
5264

5365
<cite>AGPL</cite>
5466

67+
.. |PyPI| raw:: html
68+
69+
<cite>PyPI</cite>
70+
5571
.. raw:: html
5672

5773
<style>
@@ -132,7 +148,7 @@
132148
</a>
133149
</div>
134150
<div class="forumLink" style="display:flex;align-items:baseline;margin-top: -5px;margin-left:10px;">
135-
<div id="forumCTAText"></div><a id="winking-cow-link" style="font-weight: bold;" href="https://forum.pymupdf.com">Try our forum! <img alt="MuPDF Forum link logo" id="winking-cow-image" src="https://pymupdf.readthedocs.io/en/latest/_static/forum-logo.gif" width=38px height=auto /></a>
151+
<div id="forumCTAText"></div><a id="winking-cow-link" style="font-weight: bold;" href="https://forum.mupdf.com">Try our forum! <img alt="MuPDF Forum link logo" id="winking-cow-image" src="https://pymupdf.readthedocs.io/en/latest/_static/forum-logo.gif" width=38px height=auto /></a>
136152
</div>
137153
</div>
138154

docs/index.rst

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ Welcome to |PyMuPDF|
2323

2424
|PyMuPDF| is hosted on `GitHub <https://github.com/pymupdf/PyMuPDF>`_ and registered on `PyPI <https://pypi.org/project/PyMuPDF/>`_.
2525

26+
.. if we want a Github repo badge
27+
28+
.. raw:: html
29+
30+
<a href="https://trendshift.io/repositories/11536" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11536" alt="pymupdf%2FPyMuPDF | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
31+
2632
2733
----
2834

@@ -35,8 +41,10 @@ This documentation covers all versions up to |version|.
3541
:maxdepth: 1
3642

3743
about.rst
44+
pymupdf-layout/index.rst
3845
pymupdf4llm/index.rst
39-
pymupdf-pro.rst
46+
pymupdf-pro/index.rst
47+
4048

4149

4250
.. toctree::

docs/pymupdf-layout/index.rst

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
2+
.. include:: ../header.rst
3+
4+
.. _pymupdf-layout
5+
6+
7+
PyMuPDF Layout
8+
===========================================================================
9+
10+
11+
|PyMuPDF Layout| is a lightweight layout analysis extension for |PyMuPDF| that turns PDFs into clean, structured data with minimal setup. It’s fast, accurate, and efficient without any GPU requirement.
12+
13+
It is an optional, but recommended, addition to the |PyMuPDF| library especially if you are required to more accurately extract structured data with better semantic information.
14+
15+
16+
Installing
17+
----------------------------------
18+
19+
Install from |PyPI| with::
20+
21+
22+
pip install pymupdf-layout
23+
24+
25+
Using
26+
----------------------------------
27+
28+
29+
In nutshell, |PyMuPDF Layout| detects the layout to extract, but we need |PyMuPDF4LLM| for the API interface. This provides us with options to extract document content as |Markdown|, |JSON| or |TXT|.
30+
31+
Let's set up the Python coding environment to get started and open a PDF then we'll move on to the semantic data extraction.
32+
33+
Register packages and open a PDF
34+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
35+
36+
First up let's import the libraries and open a sample document::
37+
38+
import pymupdf.layout
39+
import pymupdf4llm
40+
doc = pymupdf.open("sample.pdf")
41+
42+
Note, in the above code, that |PyMuPDF Layout| must be imported as shown and before importing |PyMuPDF4LLM| to activate |PyMuPDF|'s layout feature and make it available to |PyMuPDF4LLM|.
43+
44+
Omitting the first line would cause execution of standard |PyMuPDF4LLM| - without the layout feature!
45+
46+
Extract the structured data
47+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
48+
49+
We've activated the |PyMuPDF Layout| library and we've loaded a document, next let's extract the structured data. This is now like a super-charged version of standard |PyMuPDF4LLM| with ``Layout`` working behind the scenes combining heuristics with machine learning - for better extraction results.
50+
51+
Extract as Markdown
52+
""""""""""""""""""""""""
53+
54+
.. code-block:: python
55+
56+
md = pymupdf4llm.to_markdown(doc)
57+
58+
59+
Extract as JSON
60+
"""""""""""""""""
61+
62+
.. code-block:: python
63+
64+
json = pymupdf4llm.to_json(doc)
65+
66+
67+
Extract as TXT
68+
"""""""""""""""""
69+
70+
.. code-block:: python
71+
72+
txt = pymupdf4llm.to_text(doc)
73+
74+
.. note::
75+
76+
Please refer top the full :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` for more.
77+
78+
Finally we can save the output to an external file as follows::
79+
80+
from pathlib import Path
81+
suffix = ".md" # or ".json" or ".txt"
82+
Path(doc.name).with_suffix(suffix).write_bytes(md.encode())
83+
84+
85+
Headers & Footers
86+
~~~~~~~~~~~~~~~~~~~~~~~
87+
88+
89+
Many documents will have header and footer information on each page of a PDF which you may or may not want to include. This information can be repetitive and simply not needed ( e.g. the same logo and document title or page number information is not always really important when it comes to extracting the document content ).
90+
91+
|PyMuPDF Layout| is trained in detecting these typical document elements and able to omit them.
92+
93+
So in this case we can adjust our API calls to ignore these elements as follows::
94+
95+
md = pymupdf4llm.to_markdown(doc, header=False, footer=False)
96+
txt = pymupdf4llm.to_text(doc, header=False, footer=False)
97+
98+
99+
.. note::
100+
101+
Please note that page ``header`` / ``footer`` exclusion is not applicable to JSON output as it aims to always represent all data for the included pages. Please refer top the full :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` for more.
102+
103+
Extending Capability
104+
----------------------------------
105+
106+
107+
Using with Pro
108+
~~~~~~~~~~~~~~~~~
109+
110+
We are able to extend |PyMuPDF Layout| to work with |PyMuPDF Pro| and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to include the import for |PyMuPDF Pro| and unlock it before we import & activate |PyMuPDF Layout|::
111+
112+
import pymupdf.layout
113+
import pymupdf.pro
114+
import pymupdf4llm
115+
pymupdf.pro.unlock()
116+
117+
Now we can happily load Office files and convert them as follows::
118+
119+
md = pymupdf4llm.to_markdown("sample.docx")
120+
121+
122+
123+
OCR support
124+
~~~~~~~~~~~~~~~~~
125+
126+
The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
127+
128+
If Tesseract is not installed on your platform, no OCR is attempted.
129+
130+
131+
132+
----
133+
134+
.. _pymupdf_layout_and_pymupdf4llm_api:
135+
136+
PyMuPDF Layout and parameter caveats
137+
--------------------------------------
138+
139+
140+
|PyMuPDF Layout| uses |PyMuPDF4LLM| for its interface. However, if you have imported ``Layout`` then the following caveats apply to the method parameters:
141+
142+
143+
+-------------------+-------------+---------+---------+----------------------------------+
144+
| Parameter | to_markdown | to_text | to_json | Comments |
145+
+===================+=============+=========+=========+==================================+
146+
| doc | ✔️ | ✔️ | ✔️ | |
147+
+-------------------+-------------+---------+---------+----------------------------------+
148+
| header | ✔️ | ✔️ | ignored | **new:** replaces ``margins`` |
149+
+-------------------+-------------+---------+---------+----------------------------------+
150+
| footer | ✔️ | ✔️ | ignored | **new:** replaces ``margins`` |
151+
+-------------------+-------------+---------+---------+----------------------------------+
152+
| detect_bg_color |||| |
153+
+-------------------+-------------+---------+---------+----------------------------------+
154+
| dpi | ✔️ | ✔️ | ✔️ | |
155+
+-------------------+-------------+---------+---------+----------------------------------+
156+
| embed_images | ✔️ | ✔️ | ✔️ | |
157+
+-------------------+-------------+---------+---------+----------------------------------+
158+
| extract_words | later | later | later | postponed |
159+
+-------------------+-------------+---------+---------+----------------------------------+
160+
| filename | ✔️ | ✔️ | ✔️ | |
161+
+-------------------+-------------+---------+---------+----------------------------------+
162+
| fontsize_limit |||| obsolete |
163+
+-------------------+-------------+---------+---------+----------------------------------+
164+
| force_text |||| text in pictures is always |
165+
| | | | | ignored |
166+
+-------------------+-------------+---------+---------+----------------------------------+
167+
| graphics_limit |||| obsolete |
168+
+-------------------+-------------+---------+---------+----------------------------------+
169+
| hdr_info |||| obsolete |
170+
+-------------------+-------------+---------+---------+----------------------------------+
171+
| ignore_alpha |||| |
172+
+-------------------+-------------+---------+---------+----------------------------------+
173+
| ignore_code | ✔️ | ✔️ | ✔️ | |
174+
+-------------------+-------------+---------+---------+----------------------------------+
175+
| ignore_graphics |||| obsolete |
176+
+-------------------+-------------+---------+---------+----------------------------------+
177+
| ignore_images |||| obsolete |
178+
+-------------------+-------------+---------+---------+----------------------------------+
179+
| image_format | ✔️ | ✔️ | ✔️ | |
180+
+-------------------+-------------+---------+---------+----------------------------------+
181+
| image_path | ✔️ | ✔️ | ✔️ | |
182+
+-------------------+-------------+---------+---------+----------------------------------+
183+
| image_size_limit |||| obsolete |
184+
+-------------------+-------------+---------+---------+----------------------------------+
185+
| margins |||| obsolete |
186+
+-------------------+-------------+---------+---------+----------------------------------+
187+
| page_chunks | later | later | later | postponed |
188+
+-------------------+-------------+---------+---------+----------------------------------+
189+
| page_height | later | later | later | postponed |
190+
+-------------------+-------------+---------+---------+----------------------------------+
191+
| page_separators | later | later | later | postponed |
192+
+-------------------+-------------+---------+---------+----------------------------------+
193+
| page_width | later | later | later | postponed |
194+
+-------------------+-------------+---------+---------+----------------------------------+
195+
| pages | ✔️ | ✔️ | ✔️ | |
196+
+-------------------+-------------+---------+---------+----------------------------------+
197+
| show_progress | later | later | later | postponed |
198+
+-------------------+-------------+---------+---------+----------------------------------+
199+
| table_strategy |||| obsolete |
200+
+-------------------+-------------+---------+---------+----------------------------------+
201+
| use_glyphs |||| always show &#xfffd; |
202+
+-------------------+-------------+---------+---------+----------------------------------+
203+
| write_images | ✔️ | ✔️ | ✔️ | |
204+
+-------------------+-------------+---------+---------+----------------------------------+
205+
206+
207+
208+
209+
210+
211+
.. include:: ../footer.rst

0 commit comments

Comments
 (0)