hi i am trying to extract the information from the pdf but my pdf tables and columns and hierarchy tables can help me on this how to do #264

harshithgowdakc · 2025-04-27T11:45:29Z

harshithgowdakc
Apr 27, 2025

hi i am trying to extract the information from the pdf but my pdf tables and columns and hierarchy tables can help me on this how to do

JorjMcKie · 2025-04-27T12:07:39Z

JorjMcKie
Apr 27, 2025
Maintainer

Sorry, I do not understand at all what you mean. Please reword your post.

0 replies

Harshith9898 · 2025-04-29T14:29:41Z

Harshith9898
Apr 29, 2025

I'm trying to extract information from a PDF that contains tables, columns, and hierarchical structures. I'm having trouble preserving the layout and structure during extraction.

1 reply

JorjMcKie Apr 29, 2025
Maintainer

Please be specific and attach one (small) of your problem files.
Also explain what you mean by "preserving layout & structure" across all those different object types.

Harshith9898 · 2025-04-29T16:21:02Z

Harshith9898
Apr 29, 2025

When I try to extract text from the PDF, I get output like this:

"Software Specifications
Operating System
3rd-Party Integration
Server-Side Database
Suggested Browser for Client-Side
Windows7/8/10/11
Notification / Messages
Line / WhatsApp / Amazon SNS / SMS"

But I want it to be extracted like this:

"3rd-Party Integration Notification / Messages Line / WhatsApp / Amazon SNS / SMS"

Let me know if you'd like help figuring out how to achieve this clean extraction too. i want to do chunk on next step for LLM and RAG

0 replies

JorjMcKie · 2025-04-29T16:50:44Z

JorjMcKie
Apr 29, 2025
Maintainer

I am a bit out of ideas how to say this in order to reach you:
Please provide the PDF example and the code you tried!
A picture alone does not help at all.

0 replies

JorjMcKie · 2025-04-30T09:09:47Z

JorjMcKie
Apr 30, 2025
Maintainer

Sorry, now that you shared your code, I realize that you are not using PyMuPDF or PyMuPDF4LLM at all, but other packages.
You are posting in the wrong repository.
Please submit a problem at pypdf's repository.

0 replies

harshithgowdakc · 2025-04-30T09:28:34Z

harshithgowdakc
Apr 30, 2025
Author

i am using PyMuPDF also import pymupdf4llm
from langchain.text_splitter import MarkdownTextSplitter

Get the MD text

md_text = pymupdf4llm.to_markdown("/content/ZKBio CVSecurity_V6.4.0_R_Datasheet_202411.pdf") # get markdown for all pages

splitter = MarkdownTextSplitter(chunk_size=1200, chunk_overlap=200)

splitter.create_documents([md_text])
ZKBio CVSecurity_V6.4.0_R_Datasheet_202411.pdf
the output i am getting like this Document(metadata={}, page_content="-----\n\n|Col1|Col2|Col3|\n|---|---|---|\n|Windows7/8/10/11, Windows Server 2008/2012/2016/2019/20|22|Notification / Messages|\n|||GIS Map|\n|||Microsoft Active Directory|\n|||Intrusion Alarm Integration|\n|PostgreSQL, Oracle11g/1 19c /21c, SQL Server 2008 014/2016/2017/2019/202|2c/18c/ /2012/2 2|API|\n|IE 11 or later, Chrome 33 or later, Safari 6.1.3 or later Edge|||\n|ISO/IEC 27001:2013, ISO/IEC 27701:2019, ISO9001, ISO20000.|||\n|Unlimited (depending on the performance of the server & network)|||\n|300,000|||\n|300,000|||\n|300,000|||\n|300,000|||\n|300,000|||\n|5,000|||\n|Unlimited (depending on the performance of the server & network)|||\n|2,000|||\n|100,000|||\n|1024|||\n|50||Elevator Destination Control System Integration|\n|Unlimited (depending on the device's configuration)|||\n|||Data Protection|\n\n\n\n\n\n\n\n-----"),

0 replies

harshithgowdakc · 2025-05-01T10:37:22Z

harshithgowdakc
May 1, 2025
Author

help me

3 replies

JorjMcKie May 1, 2025
Maintainer

Sorry, there is no way that PyMuPDF4LLM can handle this page correctly. The table is extremely complex plus the page contains background / watermark images which together simply make this impossible.
You will have to develop your own logic.

JorjMcKie May 1, 2025
Maintainer

Try to use conventional text extraction, maybe in some layout preserving way like Page.get_text(sort=True).

Harshith9898 May 1, 2025

ok Thnks for that

hi i am trying to extract the information from the pdf but my pdf tables and columns and hierarchy tables can help me on this how to do #264

Uh oh!

harshithgowdakc Apr 27, 2025

Replies: 7 comments · 4 replies

Uh oh!

JorjMcKie Apr 27, 2025 Maintainer

Uh oh!

Harshith9898 Apr 29, 2025

Uh oh!

JorjMcKie Apr 29, 2025 Maintainer

Uh oh!

Uh oh!

Harshith9898 Apr 29, 2025

Uh oh!

JorjMcKie Apr 29, 2025 Maintainer

Uh oh!

JorjMcKie Apr 30, 2025 Maintainer

Uh oh!

harshithgowdakc Apr 30, 2025 Author

Get the MD text

Uh oh!

harshithgowdakc May 1, 2025 Author

Uh oh!

Uh oh!

JorjMcKie May 1, 2025 Maintainer

Uh oh!

JorjMcKie May 1, 2025 Maintainer

Uh oh!

Harshith9898 May 1, 2025

harshithgowdakc
Apr 27, 2025

Replies: 7 comments 4 replies

JorjMcKie
Apr 27, 2025
Maintainer

Harshith9898
Apr 29, 2025

JorjMcKie Apr 29, 2025
Maintainer

Harshith9898
Apr 29, 2025

JorjMcKie
Apr 29, 2025
Maintainer

JorjMcKie
Apr 30, 2025
Maintainer

harshithgowdakc
Apr 30, 2025
Author

harshithgowdakc
May 1, 2025
Author

JorjMcKie May 1, 2025
Maintainer

JorjMcKie May 1, 2025
Maintainer