Replies: 5 comments 3 replies
-
As mentioned on Discord, documents with a highly complex layout like the above require radically simplifying assumptions to become reasonably accessible. |
Beta Was this translation helpful? Give feedback.
-
I am experimenting with additional parameters import pymupdf, pymupdf4llm, pathlib
filename = "test.pdf"
doc = pymupdf.open(filename)
md = pymupdf4llm.to_markdown(
doc,
# write_images=True,
force_text=True,
show_progress=True,
ignore_images=True,
ignore_graphics=True,
)
pathlib.Path(doc.name).with_suffix(".md").write_text(md) delivers this output which is probably close to the desired one: |
Beta Was this translation helpful? Give feedback.
-
This file, pars pro toto, is a good example to see that probably every layout analysis is bound to fail at some point: |
Beta Was this translation helpful? Give feedback.
-
As you can see, at least some of the tables come out quite nicely. If you have read the package's API documentation, then you know that its header identification is also quite simple. In essence it is based on classifying the full document's text based on a statistical analysis of font size frequencies: |
Beta Was this translation helpful? Give feedback.
-
I have made a few more changes: As per header identification, I have introduced a new parameter that limits the amount of accepted header levels to something smaller than 6. import pymupdf, pymupdf4llm, pathlib
filename = "test.pdf"
doc = pymupdf.open(filename)
# pre-process header identification and only accept 4 header levels.
hdr_info = pymupdf4llm.IdentifyHeaders(doc, max_levels=4)
md = pymupdf4llm.to_markdown(
doc,
# write_images=True,
force_text=True, # irrelevant because of below
hdr_info=hdr_info, # use the prepared header info object
show_progress=True, # progress bar
ignore_images=True, # ignore all images
ignore_graphics=True, # ignore all drawings
)
pathlib.Path(doc.name).with_suffix(".md").write_text(md) You would need the wheel And this is the result: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Blaenau Gwent LAEP - Technical Report.pdf
Beta Was this translation helpful? Give feedback.
All reactions