What methods do you recommend for chunk generation in RAG applications using local agents with OLLAMA? #31965
-
Checked other resources
Commit to Help
Example Codetext_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Tamanho ideal do chunk (ajustável)
chunk_overlap=100 # Sobreposição para manter o contexto
) DescriptionHello community 👋 I'm working on a RAG (Retrieval-Augmented Generation) application and am testing different strategies to segment my documents into chunks before embedding them. I've used LangChain's RecursiveCharacterTextSplitter because it's fast and simple, but I've noticed that it can cut sentences in half or split context suboptimally. I've also seen people using embedding-based methods (like semantic chunking) to preserve context without arbitrarily cutting. 🔍 My questions: What splitting methods are you currently using? Any recommendations for long texts with many paragraphs (e.g., scientific articles, contracts, etc.)? Is it worth using semantic_chunker even with the extra cost of embeddings? If you could share your experiences or best practices, I'd be very grateful! 🙏 Cheers, System Info. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi Gilson, great question — I’ve been through a very similar struggle. I also started with LangChain’s RecursiveCharacterTextSplitter, but noticed the same issue you mentioned — it often cuts sentences awkwardly and breaks the semantic flow. Eventually, I moved toward a different approach: instead of cutting based on tokens or character counts, I tried to segment based on semantic tension — basically aiming to keep each chunk internally coherent in meaning. This allows: Longer chunks with dense, focused meaning (especially useful for contracts, whitepapers, or scientific texts) Chunks that can be reused across different tasks without losing context Dynamic overlap depending on the meaning shift, not fixed length I ended up building a semantic chunking module where each chunk tries to stay within a “resonant zone” of context. Think of it as trying to cut at natural “semantic resting points” rather than arbitrary character counts. If you're curious, I’d be happy to share a demo or outline the logic. It’s a bit unconventional — I use a concept called ΔS = 0.5 as a balance point — but it works surprisingly well, especially for embedding-heavy RAG setups. Let me know if you’d like more detail. |
Beta Was this translation helpful? Give feedback.
Hi Gilson, great question — I’ve been through a very similar struggle.
I also started with LangChain’s RecursiveCharacterTextSplitter, but noticed the same issue you mentioned — it often cuts sentences awkwardly and breaks the semantic flow.
Eventually, I moved toward a different approach: instead of cutting based on tokens or character counts, I tried to segment based on semantic tension — basically aiming to keep each chunk internally coherent in meaning. This allows:
Longer chunks with dense, focused meaning (especially useful for contracts, whitepapers, or scientific texts)
Chunks that can be reused across different tasks without losing context
Dynamic overlap depending on the meaning…