Retrieval from books #10950

nilsdacke · 2023-09-22T18:13:14Z

nilsdacke
Sep 22, 2023

I made a conversational knowledge base based on books in pdf format (and some plain text). To make this work, I needed to add some ad-hoc heuristics and book-specific information to

Remove page headers and footers
Make text chunks span page boundaries but not chapter boundaries
Remove text not part of the book proper (licensing information and the like)

Now I need to make this automatic.

LLMs can probably do part of the work. And I'm looking at libraries for text segmentation such as Deep Tiling.

Maybe some of you have already attempted this sort of thing?

Uranium2 · 2023-09-23T11:56:38Z

Uranium2
Sep 23, 2023

About PDF headers/footer, if you have few pdf, it is worth of editing manually the PDF to remove this kind of data. If your PDF changes every weeks/months it might be painful.

If all your PDF are structured exactly the same, I think you can set offsets to extract the text from the PDFs. So you can hardcode the offsets.
Else their might be multiple strategies to detect and remove headers/footers:

pymupdf/PyMuPDF#2259 (comment)

Based on positioning or based on word similarity between slides: https://medium.com/@hussainshahbazkhawaja/paper-implementation-header-and-footer-extraction-by-page-association-3a499b2552ae

If I remember correctly, if you want your PDFs to be sliced by page and not by chunks, you can use RecursiveCharacterTextSplitter() without any arguments. Else, you might use Fizz to split your documents page per page and use the text splitter.

For the last part maybe you can add a configuration dictionary that says which slides should not be used. But it's mostly manual, else you might try to find some text that are common in all licensing information and stuff and exclude the process if you find some texts. "Thank you" or email, phone number. But its pretty dangerous since it can remove other pages of your PDF.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retrieval from books #10950

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Retrieval from books #10950

Uh oh!

nilsdacke Sep 22, 2023

Replies: 1 comment

Uh oh!

Uranium2 Sep 23, 2023

nilsdacke
Sep 22, 2023

Uranium2
Sep 23, 2023