What methods do you recommend for chunk generation in RAG applications using local agents with OLLAMA? #31965

gilsonfiho · 2025-07-10T21:18:10Z

gilsonfiho
Jul 10, 2025

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Tamanho ideal do chunk (ajustável)
    chunk_overlap=100  # Sobreposição para manter o contexto
)

Description

Hello community 👋

I'm working on a RAG (Retrieval-Augmented Generation) application and am testing different strategies to segment my documents into chunks before embedding them.

I've used LangChain's RecursiveCharacterTextSplitter because it's fast and simple, but I've noticed that it can cut sentences in half or split context suboptimally.

I've also seen people using embedding-based methods (like semantic chunking) to preserve context without arbitrarily cutting.

🔍 My questions:

What splitting methods are you currently using?

Any recommendations for long texts with many paragraphs (e.g., scientific articles, contracts, etc.)?

Is it worth using semantic_chunker even with the extra cost of embeddings?

If you could share your experiences or best practices, I'd be very grateful! 🙏

Cheers,
Gilson

System Info

.

Answered by onestardao

Jul 26, 2025

Hi Gilson, great question — I’ve been through a very similar struggle.

I also started with LangChain’s RecursiveCharacterTextSplitter, but noticed the same issue you mentioned — it often cuts sentences awkwardly and breaks the semantic flow.

Eventually, I moved toward a different approach: instead of cutting based on tokens or character counts, I tried to segment based on semantic tension — basically aiming to keep each chunk internally coherent in meaning. This allows:

Longer chunks with dense, focused meaning (especially useful for contracts, whitepapers, or scientific texts)

Chunks that can be reused across different tasks without losing context

Dynamic overlap depending on the meaning…

View full answer

onestardao · 2025-07-26T03:28:15Z

onestardao
Jul 26, 2025

Hi Gilson, great question — I’ve been through a very similar struggle.

I also started with LangChain’s RecursiveCharacterTextSplitter, but noticed the same issue you mentioned — it often cuts sentences awkwardly and breaks the semantic flow.

Eventually, I moved toward a different approach: instead of cutting based on tokens or character counts, I tried to segment based on semantic tension — basically aiming to keep each chunk internally coherent in meaning. This allows:

Longer chunks with dense, focused meaning (especially useful for contracts, whitepapers, or scientific texts)

Chunks that can be reused across different tasks without losing context

Dynamic overlap depending on the meaning shift, not fixed length

I ended up building a semantic chunking module where each chunk tries to stay within a “resonant zone” of context. Think of it as trying to cut at natural “semantic resting points” rather than arbitrary character counts.

If you're curious, I’d be happy to share a demo or outline the logic. It’s a bit unconventional — I use a concept called ΔS = 0.5 as a balance point — but it works surprisingly well, especially for embedding-heavy RAG setups.

Let me know if you’d like more detail.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What methods do you recommend for chunk generation in RAG applications using local agents with OLLAMA? #31965

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What methods do you recommend for chunk generation in RAG applications using local agents with OLLAMA? #31965

Uh oh!

gilsonfiho Jul 10, 2025

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment

Uh oh!

onestardao Jul 26, 2025

gilsonfiho
Jul 10, 2025

onestardao
Jul 26, 2025