Issues with Table Extraction in Multi-Column PDF #4293
Unanswered
chayennemosk
asked this question in
Looking for help
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I'm trying to extract text and tables from a multi-column PDF that was likely generated from PowerPoint. Since the layout is a bit different from standard PDFs, I decided to use pymupdf4llm for structured extraction.
Right now, I’m using:
md_text = pymupdf4llm.to_markdown("financial-management-strategic-planning-budgeting.pdf")
It gets most of the text right, but I am having trouble with tables.
Example of the issue is slide 4. Instead of a properly structured table, I get:
Summary of key takeaways
The table structure is not correctly populated, and other tables produce merged/misaligned outputs, for example:
• [Multi-year allocations ] • [Long-term clarity ] of funding on funding, to
The data is incorrectly formatted and difficult to parse.
Questions:
I've attached the pdf.
Many thanks in advance!
financial-management-strategic-planning-budgeting.pdf
Beta Was this translation helpful? Give feedback.
All reactions