|
| 1 | + |
| 2 | +.. include:: ../header.rst |
| 3 | + |
| 4 | +.. _pymupdf-layout |
| 5 | +
|
| 6 | +
|
| 7 | +PyMuPDF Layout |
| 8 | +=========================================================================== |
| 9 | + |
| 10 | + |
| 11 | +|PyMuPDF Layout| is a lightweight layout analysis extension for |PyMuPDF| that turns PDFs into clean, structured data with minimal setup. It’s fast, accurate, and efficient without any GPU requirement. |
| 12 | + |
| 13 | +It is an optional, but recommended, addition to the |PyMuPDF| library especially if you are required to more accurately extract structured data with better semantic information. |
| 14 | + |
| 15 | + |
| 16 | +Installing |
| 17 | +---------------------------------- |
| 18 | + |
| 19 | +Install from |PyPI| with:: |
| 20 | + |
| 21 | + |
| 22 | + pip install pymupdf-layout |
| 23 | + |
| 24 | + |
| 25 | +Using |
| 26 | +---------------------------------- |
| 27 | + |
| 28 | + |
| 29 | +In nutshell, |PyMuPDF Layout| detects the layout to extract, but we need |PyMuPDF4LLM| for the API interface. This provides us with options to extract document content as |Markdown|, |JSON| or |TXT|. |
| 30 | + |
| 31 | +Let's set up the Python coding environment to get started and open a PDF then we'll move on to the semantic data extraction. |
| 32 | + |
| 33 | +Register packages and open a PDF |
| 34 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 35 | + |
| 36 | +First up let's import the libraries and open a sample document:: |
| 37 | + |
| 38 | + import pymupdf.layout |
| 39 | + import pymupdf4llm |
| 40 | + doc = pymupdf.open("sample.pdf") |
| 41 | + |
| 42 | +Note, in the above code, that |PyMuPDF Layout| must be imported as shown and before importing |PyMuPDF4LLM| to activate |PyMuPDF|'s layout feature and make it available to |PyMuPDF4LLM|. |
| 43 | + |
| 44 | +Omitting the first line would cause execution of standard |PyMuPDF4LLM| - without the layout feature! |
| 45 | + |
| 46 | +Extract the structured data |
| 47 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 48 | + |
| 49 | +We've activated the |PyMuPDF Layout| library and we've loaded a document, next let's extract the structured data. This is now like a super-charged version of standard |PyMuPDF4LLM| with ``Layout`` working behind the scenes combining heuristics with machine learning - for better extraction results. |
| 50 | + |
| 51 | +Extract as Markdown |
| 52 | +"""""""""""""""""""""""" |
| 53 | + |
| 54 | +.. code-block:: python |
| 55 | +
|
| 56 | + md = pymupdf4llm.to_markdown(doc) |
| 57 | +
|
| 58 | +
|
| 59 | +Extract as JSON |
| 60 | +""""""""""""""""" |
| 61 | + |
| 62 | +.. code-block:: python |
| 63 | +
|
| 64 | + json = pymupdf4llm.to_json(doc) |
| 65 | +
|
| 66 | +
|
| 67 | +Extract as TXT |
| 68 | +""""""""""""""""" |
| 69 | + |
| 70 | +.. code-block:: python |
| 71 | +
|
| 72 | + txt = pymupdf4llm.to_text(doc) |
| 73 | +
|
| 74 | +.. note:: |
| 75 | + |
| 76 | + Please refer top the full :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` for more. |
| 77 | + |
| 78 | +Finally we can save the output to an external file as follows:: |
| 79 | + |
| 80 | + from pathlib import Path |
| 81 | + suffix = ".md" # or ".json" or ".txt" |
| 82 | + Path(doc.name).with_suffix(suffix).write_bytes(md.encode()) |
| 83 | + |
| 84 | + |
| 85 | +Headers & Footers |
| 86 | +~~~~~~~~~~~~~~~~~~~~~~~ |
| 87 | + |
| 88 | + |
| 89 | +Many documents will have header and footer information on each page of a PDF which you may or may not want to include. This information can be repetitive and simply not needed ( e.g. the same logo and document title or page number information is not always really important when it comes to extracting the document content ). |
| 90 | + |
| 91 | +|PyMuPDF Layout| is trained in detecting these typical document elements and able to omit them. |
| 92 | + |
| 93 | +So in this case we can adjust our API calls to ignore these elements as follows:: |
| 94 | + |
| 95 | + md = pymupdf4llm.to_markdown(doc, header=False, footer=False) |
| 96 | + txt = pymupdf4llm.to_text(doc, header=False, footer=False) |
| 97 | + |
| 98 | + |
| 99 | +.. note:: |
| 100 | + |
| 101 | + Please note that page ``header`` / ``footer`` exclusion is not applicable to JSON output as it aims to always represent all data for the included pages. Please refer top the full :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` for more. |
| 102 | + |
| 103 | +Extending Capability |
| 104 | +---------------------------------- |
| 105 | + |
| 106 | + |
| 107 | +Using with Pro |
| 108 | +~~~~~~~~~~~~~~~~~ |
| 109 | + |
| 110 | +We are able to extend |PyMuPDF Layout| to work with |PyMuPDF Pro| and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to include the import for |PyMuPDF Pro| and unlock it before we import & activate |PyMuPDF Layout|:: |
| 111 | + |
| 112 | + import pymupdf.layout |
| 113 | + import pymupdf.pro |
| 114 | + import pymupdf4llm |
| 115 | + pymupdf.pro.unlock() |
| 116 | + |
| 117 | +Now we can happily load Office files and convert them as follows:: |
| 118 | + |
| 119 | + md = pymupdf4llm.to_markdown("sample.docx") |
| 120 | + |
| 121 | + |
| 122 | + |
| 123 | +OCR support |
| 124 | +~~~~~~~~~~~~~~~~~ |
| 125 | + |
| 126 | +The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content. |
| 127 | + |
| 128 | +If Tesseract is not installed on your platform, no OCR is attempted. |
| 129 | + |
| 130 | + |
| 131 | + |
| 132 | +---- |
| 133 | + |
| 134 | +.. _pymupdf_layout_and_pymupdf4llm_api: |
| 135 | + |
| 136 | +PyMuPDF Layout and parameter caveats |
| 137 | +-------------------------------------- |
| 138 | + |
| 139 | + |
| 140 | +|PyMuPDF Layout| uses |PyMuPDF4LLM| for its interface. However, if you have imported ``Layout`` then the following caveats apply to the method parameters: |
| 141 | + |
| 142 | + |
| 143 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 144 | +| Parameter | to_markdown | to_text | to_json | Comments | |
| 145 | ++===================+=============+=========+=========+==================================+ |
| 146 | +| doc | ✔️ | ✔️ | ✔️ | | |
| 147 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 148 | +| header | ✔️ | ✔️ | ignored | **new:** replaces ``margins`` | |
| 149 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 150 | +| footer | ✔️ | ✔️ | ignored | **new:** replaces ``margins`` | |
| 151 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 152 | +| detect_bg_color | ❌ | ❌ | ❌ | | |
| 153 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 154 | +| dpi | ✔️ | ✔️ | ✔️ | | |
| 155 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 156 | +| embed_images | ✔️ | ✔️ | ✔️ | | |
| 157 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 158 | +| extract_words | later | later | later | postponed | |
| 159 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 160 | +| filename | ✔️ | ✔️ | ✔️ | | |
| 161 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 162 | +| fontsize_limit | ❌ | ❌ | ❌ | obsolete | |
| 163 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 164 | +| force_text | ❌ | ❌ | ❌ | text in pictures is always | |
| 165 | +| | | | | ignored | |
| 166 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 167 | +| graphics_limit | ❌ | ❌ | ❌ | obsolete | |
| 168 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 169 | +| hdr_info | ❌ | ❌ | ❌ | obsolete | |
| 170 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 171 | +| ignore_alpha | ❌ | ❌ | ❌ | | |
| 172 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 173 | +| ignore_code | ✔️ | ✔️ | ✔️ | | |
| 174 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 175 | +| ignore_graphics | ❌ | ❌ | ❌ | obsolete | |
| 176 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 177 | +| ignore_images | ❌ | ❌ | ❌ | obsolete | |
| 178 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 179 | +| image_format | ✔️ | ✔️ | ✔️ | | |
| 180 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 181 | +| image_path | ✔️ | ✔️ | ✔️ | | |
| 182 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 183 | +| image_size_limit | ❌ | ❌ | ❌ | obsolete | |
| 184 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 185 | +| margins | ❌ | ❌ | ❌ | obsolete | |
| 186 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 187 | +| page_chunks | later | later | later | postponed | |
| 188 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 189 | +| page_height | later | later | later | postponed | |
| 190 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 191 | +| page_separators | later | later | later | postponed | |
| 192 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 193 | +| page_width | later | later | later | postponed | |
| 194 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 195 | +| pages | ✔️ | ✔️ | ✔️ | | |
| 196 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 197 | +| show_progress | later | later | later | postponed | |
| 198 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 199 | +| table_strategy | ❌ | ❌ | ❌ | obsolete | |
| 200 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 201 | +| use_glyphs | ❌ | ❌ | ❌ | always show � | |
| 202 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 203 | +| write_images | ✔️ | ✔️ | ✔️ | | |
| 204 | ++-------------------+-------------+---------+---------+----------------------------------+ |
| 205 | + |
| 206 | + |
| 207 | + |
| 208 | + |
| 209 | + |
| 210 | + |
| 211 | +.. include:: ../footer.rst |
0 commit comments