Introducing chunking strategies to the library #275
Replies: 1 comment 1 reply
-
I would say this is a probably separate package. Rather than implement from scratch it could probably be a wrapper for one of the python libraries. The issue (I think) with generic chunking libraries is that they tend to be based on length, paragraphs, sentences, headings, etc. which can be quite arbitrary - and OCRed or documents converted into markdown tend to be imperfectly formatted. I'd encourage folk to go for something semantic where possible - much easier when working in a tightly defined domain (i.e. only processing one or two types of documents - e.g. contracts or invoices). To answer your question on the other thread - the way we do semantic chunking is with LLMs. We use a mix of LLMs - some acting as "thinkers" (deciding "where to draw the lines", within set parameters), other acting as "workers" and others acting as "quality control". The thinkers and quality controllers tend to be more intelligent models, and we are really careful on how many output tokens we use to keep costs reasonable. The workers are low cost models, which given decent instructions and quality check, are very capable of doing the bulk of the work. Some workers are generic and work through the document top to bottom, some others extract specific information. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
A lot of the functionality of any Question -> Answer system using llm's depends on including the right context. Otherwise known as RAG and by using a vector db or vector compatible sql database.
Storing this context is often done through chunking longer documents into smaller parts. Some of the different ways to do this are explained in articles like this: https://js.langchain.com/docs/concepts/text_splitters/.
What do you think of including this or do you see this as out of scope? Maybe it is better as a seperate package?
Beta Was this translation helpful? Give feedback.
All reactions