pymupdf
diff --git a/‎docs/about.rst‎
Lines changed: 17 additions & 1 deletion b/‎docs/about.rst‎
Lines changed: 17 additions & 1 deletion
diff --git a/‎docs/header.rst‎
Lines changed: 17 additions & 1 deletion b/‎docs/header.rst‎
Lines changed: 17 additions & 1 deletion
diff --git a/‎docs/index.rst‎
Lines changed: 9 additions & 1 deletion b/‎docs/index.rst‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎docs/pymupdf-layout/index.rst‎
Lines changed: 211 additions & 0 deletions b/‎docs/pymupdf-layout/index.rst‎
Lines changed: 211 additions & 0 deletions
@@ -80,6 +80,21 @@ Here are current results, grouped by task:
 
    For more detail regarding the methodology for these performance timings see: :ref:`Performance Comparison Methodology<Appendix4>`.
 
+
+
+Architecture
+----------------------
+
+
+|PyMuPDF| follows a four-layer system architecture with clear separation between components:
+
+- **Python API Layer** - User-facing classes and functions
+- **Python Wrapper Layer** - High-level abstractions and utilities
+- **Binding Layer** - SWIG-generated Python-C bridges
+- **MuPDF Core Layer** - Native C/C++ implementation
+
+.. include:: version.rst
+
 .. _About_License:
 
 License and Copyright
@@ -110,6 +125,7 @@ License and Copyright
 :title:`Artifex`, the :title:`Artifex` logo, :title:`MuPDF`, and the :title:`MuPDF` logo are registered trademarks of :title:`Artifex Software Inc.`
 
 
-.. include:: version.rst
+
+
 
 .. include:: footer.rst
@@ -36,10 +36,22 @@
 
     <cite>PyMuPDF4LLM</cite>
 
+.. |PyMuPDF Layout| raw:: html
+
+    <cite>PyMuPDF Layout</cite>
+
 .. |Markdown| raw:: html
 
     <cite>Markdown</cite>
 
+.. |JSON| raw:: html
+
+    <cite>JSON</cite>
+
+.. |TXT| raw:: html
+
+    <cite>TXT</cite>
+
 .. |MuPDF| raw:: html
 
     <cite>MuPDF</cite>
@@ -52,6 +64,10 @@
 
     <cite>AGPL</cite>
 
+.. |PyPI| raw:: html
+
+    <cite>PyPI</cite>
+
 .. raw:: html
 
     <style>
@@ -132,7 +148,7 @@
             </a>
         </div>
         <div class="forumLink" style="display:flex;align-items:baseline;margin-top: -5px;margin-left:10px;">
-            <div id="forumCTAText"></div><a id="winking-cow-link" style="font-weight: bold;" href="https://forum.pymupdf.com">Try our forum! <img alt="MuPDF Forum link logo" id="winking-cow-image" src="https://pymupdf.readthedocs.io/en/latest/_static/forum-logo.gif" width=38px height=auto /></a>
+            <div id="forumCTAText"></div><a id="winking-cow-link" style="font-weight: bold;" href="https://forum.mupdf.com">Try our forum! <img alt="MuPDF Forum link logo" id="winking-cow-image" src="https://pymupdf.readthedocs.io/en/latest/_static/forum-logo.gif" width=38px height=auto /></a>
         </div>
     </div>
 
 
@@ -23,6 +23,12 @@ Welcome to |PyMuPDF|
 
 |PyMuPDF| is hosted on `GitHub <https://github.com/pymupdf/PyMuPDF>`_ and registered on `PyPI <https://pypi.org/project/PyMuPDF/>`_.
 
+.. if we want a Github repo badge
+
+   .. raw:: html
+
+      <a href="https://trendshift.io/repositories/11536" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11536" alt="pymupdf%2FPyMuPDF | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
+
 
 ----
 
@@ -35,8 +41,10 @@ This documentation covers all versions up to |version|.
    :maxdepth: 1
 
    about.rst
+   pymupdf-layout/index.rst
    pymupdf4llm/index.rst
-   pymupdf-pro.rst
+   pymupdf-pro/index.rst
+
 
 
 .. toctree::
 
@@ -0,0 +1,211 @@
+
+.. include:: ../header.rst
+
+.. _pymupdf-layout
+
+
+PyMuPDF Layout
+===========================================================================
+
+
+|PyMuPDF Layout| is a lightweight layout analysis extension for |PyMuPDF| that turns PDFs into clean, structured data with minimal setup. It’s fast, accurate, and efficient without any GPU requirement.
+
+It is an optional, but recommended, addition to the |PyMuPDF| library especially if you are required to more accurately extract structured data with better semantic information.
+
+
+Installing
+----------------------------------
+
+Install from |PyPI| with::
+
+
+    pip install pymupdf-layout
+
+
+Using
+----------------------------------
+
+
+In nutshell, |PyMuPDF Layout| detects the layout to extract, but we need |PyMuPDF4LLM| for the API interface. This provides us with options to extract document content as |Markdown|, |JSON| or |TXT|.
+
+Let's set up the Python coding environment to get started and open a PDF then we'll move on to the semantic data extraction.
+
+Register packages and open a PDF
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First up let's import the libraries and open a sample document::
+
+    import pymupdf.layout
+    import pymupdf4llm
+    doc = pymupdf.open("sample.pdf")
+
+Note, in the above code, that |PyMuPDF Layout| must be imported as shown and before importing |PyMuPDF4LLM| to activate |PyMuPDF|'s layout feature and make it available to |PyMuPDF4LLM|.
+
+Omitting the first line would cause execution of standard |PyMuPDF4LLM| - without the layout feature!
+
+Extract the structured data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We've activated the |PyMuPDF Layout| library and we've loaded a document, next let's extract the structured data. This is now like a super-charged version of standard |PyMuPDF4LLM| with ``Layout`` working behind the scenes combining heuristics with machine learning - for better extraction results.
+
+Extract as Markdown
+""""""""""""""""""""""""
+
+.. code-block:: python
+
+    md = pymupdf4llm.to_markdown(doc)
+
+
+Extract as JSON
+"""""""""""""""""
+
+.. code-block:: python
+
+    json = pymupdf4llm.to_json(doc)
+
+
+Extract as TXT
+"""""""""""""""""
+
+.. code-block:: python
+
+    txt = pymupdf4llm.to_text(doc)
+
+.. note::
+
+    Please refer top the full :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` for more.
+
+Finally we can save the output to an external file as follows::
+
+    from pathlib import Path
+    suffix = ".md" # or ".json" or ".txt"
+    Path(doc.name).with_suffix(suffix).write_bytes(md.encode())
+
+
+Headers & Footers
+~~~~~~~~~~~~~~~~~~~~~~~
+
+
+Many documents will have header and footer information on each page of a PDF which you may or may not want to include. This information can be repetitive and simply not needed ( e.g. the same logo and document title or page number information is not always really important when it comes to extracting the document content ).
+
+|PyMuPDF Layout| is trained in detecting these typical document elements and able to omit them.
+
+So in this case we can adjust our API calls to ignore these elements as follows::
+
+    md = pymupdf4llm.to_markdown(doc, header=False, footer=False)
+    txt = pymupdf4llm.to_text(doc, header=False, footer=False)
+
+
+.. note::
+
+    Please note that page ``header`` / ``footer`` exclusion is not applicable to JSON output as it aims to always represent all data for the included pages. Please refer top the full :ref:`PyMuPDF4LLM API <pymupdf4llm-api>` for more.
+
+Extending Capability
+----------------------------------
+
+
+Using with Pro
+~~~~~~~~~~~~~~~~~
+
+We are able to extend |PyMuPDF Layout| to work with |PyMuPDF Pro| and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to include the import for |PyMuPDF Pro| and unlock it before we import & activate |PyMuPDF Layout|::
+
+    import pymupdf.layout
+    import pymupdf.pro
+    import pymupdf4llm
+    pymupdf.pro.unlock()
+
+Now we can happily load Office files and convert them as follows::
+
+    md = pymupdf4llm.to_markdown("sample.docx")
+
+
+
+OCR support
+~~~~~~~~~~~~~~~~~
+
+The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
+
+If Tesseract is not installed on your platform, no OCR is attempted.
+
+
+
+----
+
+.. _pymupdf_layout_and_pymupdf4llm_api:
+
+PyMuPDF Layout and parameter caveats
+--------------------------------------
+
+
+|PyMuPDF Layout| uses |PyMuPDF4LLM| for its interface. However, if you have imported ``Layout`` then the following caveats apply to the method parameters:
+
+
++-------------------+-------------+---------+---------+----------------------------------+
+| Parameter         | to_markdown | to_text | to_json | Comments                         |
++===================+=============+=========+=========+==================================+
+| doc               | ✔️          | ✔️      | ✔️      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+| header            | ✔️          | ✔️      | ignored | **new:** replaces ``margins``    |
++-------------------+-------------+---------+---------+----------------------------------+
+| footer            | ✔️          | ✔️      | ignored | **new:** replaces ``margins``    |
++-------------------+-------------+---------+---------+----------------------------------+
+| detect_bg_color   | ❌          | ❌      | ❌      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+| dpi               | ✔️          | ✔️      | ✔️      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+| embed_images      | ✔️          | ✔️      | ✔️      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+| extract_words     | later       | later   | later   | postponed                        |
++-------------------+-------------+---------+---------+----------------------------------+
+| filename          | ✔️          | ✔️      | ✔️      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+| fontsize_limit    | ❌          | ❌      | ❌      | obsolete                         |
++-------------------+-------------+---------+---------+----------------------------------+
+| force_text        | ❌          | ❌      | ❌      | text in pictures is always       |
+|                   |             |         |         | ignored                          |
++-------------------+-------------+---------+---------+----------------------------------+
+| graphics_limit    | ❌          | ❌      | ❌      | obsolete                         |
++-------------------+-------------+---------+---------+----------------------------------+
+| hdr_info          | ❌          | ❌      | ❌      | obsolete                         |
++-------------------+-------------+---------+---------+----------------------------------+
+| ignore_alpha      | ❌          | ❌      | ❌      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+| ignore_code       | ✔️          | ✔️      | ✔️      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+| ignore_graphics   | ❌          | ❌      | ❌      | obsolete                         |
++-------------------+-------------+---------+---------+----------------------------------+
+| ignore_images     | ❌          | ❌      | ❌      | obsolete                         |
++-------------------+-------------+---------+---------+----------------------------------+
+| image_format      | ✔️          | ✔️      | ✔️      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+| image_path        | ✔️          | ✔️      | ✔️      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+| image_size_limit  | ❌          | ❌      | ❌      | obsolete                         |
++-------------------+-------------+---------+---------+----------------------------------+
+| margins           | ❌          | ❌      | ❌      | obsolete                         |
++-------------------+-------------+---------+---------+----------------------------------+
+| page_chunks       | later       | later   | later   | postponed                        |
++-------------------+-------------+---------+---------+----------------------------------+
+| page_height       | later       | later   | later   | postponed                        |
++-------------------+-------------+---------+---------+----------------------------------+
+| page_separators   | later       | later   | later   | postponed                        |
++-------------------+-------------+---------+---------+----------------------------------+
+| page_width        | later       | later   | later   | postponed                        |
++-------------------+-------------+---------+---------+----------------------------------+
+| pages             | ✔️          | ✔️      | ✔️      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+| show_progress     | later       | later   | later   | postponed                        |
++-------------------+-------------+---------+---------+----------------------------------+
+| table_strategy    | ❌          | ❌      | ❌      | obsolete                         |
++-------------------+-------------+---------+---------+----------------------------------+
+| use_glyphs        | ❌          | ❌      | ❌      | always show &#xfffd;             |
++-------------------+-------------+---------+---------+----------------------------------+
+| write_images      | ✔️          | ✔️      | ✔️      |                                  |
++-------------------+-------------+---------+---------+----------------------------------+
+
+
+
+
+
+
+.. include:: ../footer.rst