Replies: 1 comment
-
About PDF headers/footer, if you have few pdf, it is worth of editing manually the PDF to remove this kind of data. If your PDF changes every weeks/months it might be painful. If all your PDF are structured exactly the same, I think you can set offsets to extract the text from the PDFs. So you can hardcode the offsets. pymupdf/PyMuPDF#2259 (comment) Based on positioning or based on word similarity between slides: https://medium.com/@hussainshahbazkhawaja/paper-implementation-header-and-footer-extraction-by-page-association-3a499b2552ae If I remember correctly, if you want your PDFs to be sliced by page and not by chunks, you can use For the last part maybe you can add a configuration dictionary that says which slides should not be used. But it's mostly manual, else you might try to find some text that are common in all licensing information and stuff and exclude the process if you find some texts. "Thank you" or email, phone number. But its pretty dangerous since it can remove other pages of your PDF. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I made a conversational knowledge base based on books in pdf format (and some plain text). To make this work, I needed to add some ad-hoc heuristics and book-specific information to
Now I need to make this automatic.
LLMs can probably do part of the work. And I'm looking at libraries for text segmentation such as Deep Tiling.
Maybe some of you have already attempted this sort of thing?
Beta Was this translation helpful? Give feedback.
All reactions