-
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Thank you for the post and your appreciation of PyMuPDF4LLM! I understand your problem. You can reproduce the graphics detection in PyMuPDF directly, it is method However, I'm afraid that increasing them beyond the default 3 will not get you what you want: including surrounding text. You did not include the file itself, but if any of the visible text pieces were indeed graphics, they would have been included. So what you are really asking for is an additional margin around the final identified clusters, right? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
What you could do as an immediate help is porting the clustering results over to each page before pymupdf4llm deals with it. Like this import pymupdf, pymupdf4llm, pathlib
doc = pymupdf.open("test.pdf")
myheaders = pymupdf4llm.IdentifyHeaders(doc) # prevent that effort per page
md = ""
for page in doc:
clusters = page.cluster_drawings()
for bb in clusters:
page.draw_rect(bb, width=0.2) # put extra border around detected graphics
md += pymupdf4llm.to_markdown(
doc, pages=[page.number], hdr_info=myheaders, write_images=True
)
pathlib.Path("test1.md").write_text(md) |
Beta Was this translation helpful? Give feedback.
What you could do as an immediate help is porting the clustering results over to each page before pymupdf4llm deals with it. Like this