Expanding the size of graphics (negative margin) #292

PK109 · 2025-07-11T08:24:10Z

PK109
Jul 11, 2025

I have a PDF documentation that contains plenty of tables, graphics, images, etc.
I am using to_markdown() method for handling it and it is doing very well.

md_text = pymupdf4llm.to_markdown( doc, pages=range(28,doc.page_count), show_progress=True, margins=(50,75), hdr_info=my_headers, write_images=True, image_format="JPG", image_path = image_path, force_text=False )

There is one thing I would like to change, if possible.
Some graphics are cropped. This is mostly affecting graphs.
This is how extracted graphics look like:

and here is how it look like in PDF:

It would be sufficient If I could expand the view by some manner.
In other discussion I have found such sentence:

The library does a geometrical analysis and assumes that any singular draws probably belong together if they are not further apart than 3 points.

Can we alter that distance as a parameter? Or can we apply some fixed margin parameter with negative value for graphics extraction?

Answered by JorjMcKie

Jul 14, 2025

What you could do as an immediate help is porting the clustering results over to each page before pymupdf4llm deals with it. Like this

import pymupdf, pymupdf4llm, pathlib

doc = pymupdf.open("test.pdf")
myheaders = pymupdf4llm.IdentifyHeaders(doc)  # prevent that effort per page
md = ""
for page in doc:
    clusters = page.cluster_drawings()
    for bb in clusters:
        page.draw_rect(bb, width=0.2)  # put extra border around detected graphics
    md += pymupdf4llm.to_markdown(
        doc, pages=[page.number], hdr_info=myheaders, write_images=True
    )
pathlib.Path("test1.md").write_text(md)

View full answer

JorjMcKie · 2025-07-12T09:32:44Z

JorjMcKie
Jul 12, 2025
Maintainer

Thank you for the post and your appreciation of PyMuPDF4LLM!

I understand your problem. You can reproduce the graphics detection in PyMuPDF directly, it is method Page.cluster_drawings(). There you will find options to change the x and y tolerances.

However, I'm afraid that increasing them beyond the default 3 will not get you what you want: including surrounding text. You did not include the file itself, but if any of the visible text pieces were indeed graphics, they would have been included.

So what you are really asking for is an additional margin around the final identified clusters, right?
Something like a parameter expand=(l, t, r, b) (left, top, right, bottom).
If we do that however, this would be something with no additional plausibility checks. IAW we cannot verify whether the expanded cluster rectangles are cutting through text or not ...

1 reply

PK109 Jul 14, 2025
Author

I have tested out this method and it seems like outside of to_markdown() it is working in the different way.
Since my pdf document is available online, please look on the script that obtains file, extract these graphics and produce jpg files.

import requests
import pymupdf
doc_url = "https://dl.mitsubishielectric.com/dl/fa/document/manual/robot/bfp-a3447/bfp-a3447q.pdf"
file_path = "data/manual.pdf"
with open(file_path, mode="wb+") as f:
    doc_file = requests.get(doc_url)
    f.write(doc_file.content)
doc = pymupdf.open(file_path)
for index, drawing in enumerate(doc[28].cluster_drawings()):
    doc[28].get_pixmap(matrix=pymupdf.Matrix(2,2), clip= drawing).save(f'graphics_{index}.jpg')

Images extracted by this script are much closer to expected result. From my perspective, they are just fine.

I suppose some method is trimming the view.
I see that refine_boxes() changes the Rect size, but it seems to expand, not shrink them.

JorjMcKie · 2025-07-14T18:18:56Z

JorjMcKie
Jul 14, 2025
Maintainer

Thanks for the file.
This is easy to explain: pymupdf4llm invests a lot of effort to understand the overall page layout. In the course of that, it ignores everything whose color equals the background color: text and graphics.
E.g. the top left of figure comes out as bordered by the red rectangle, but the reason for this are the white rectangles which I bordered here in slim green lines:

In pymudf4llm we detect "white" as the background color and hence ignore any white elements downwards.

That we cannot give up!

0 replies

JorjMcKie · 2025-07-14T22:36:24Z

JorjMcKie
Jul 14, 2025
Maintainer

What you could do as an immediate help is porting the clustering results over to each page before pymupdf4llm deals with it. Like this

import pymupdf, pymupdf4llm, pathlib

doc = pymupdf.open("test.pdf")
myheaders = pymupdf4llm.IdentifyHeaders(doc)  # prevent that effort per page
md = ""
for page in doc:
    clusters = page.cluster_drawings()
    for bb in clusters:
        page.draw_rect(bb, width=0.2)  # put extra border around detected graphics
    md += pymupdf4llm.to_markdown(
        doc, pages=[page.number], hdr_info=myheaders, write_images=True
    )
pathlib.Path("test1.md").write_text(md)

1 reply

PK109 Jul 16, 2025
Author

This is a great improvement.
Clever idea to make modification before putting data into markdown.
Thank you for your support!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expanding the size of graphics (negative margin) #292

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Expanding the size of graphics (negative margin) #292

Uh oh!

PK109 Jul 11, 2025

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

JorjMcKie Jul 12, 2025 Maintainer

Uh oh!

PK109 Jul 14, 2025 Author

Uh oh!

JorjMcKie Jul 14, 2025 Maintainer

Uh oh!

JorjMcKie Jul 14, 2025 Maintainer

Uh oh!

PK109 Jul 16, 2025 Author

PK109
Jul 11, 2025

Replies: 3 comments 2 replies

JorjMcKie
Jul 12, 2025
Maintainer

PK109 Jul 14, 2025
Author

JorjMcKie
Jul 14, 2025
Maintainer

JorjMcKie
Jul 14, 2025
Maintainer

PK109 Jul 16, 2025
Author