[Feature Request] Layout-Aware Chunker for Large Document Handling in LangChain #31454

prajwal10001 · 2025-06-01T19:32:21Z

prajwal10001
Jun 1, 2025

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

Advanced Semantic Chunker for LangChain Document Loaders

LangChain already includes several document loaders (e.g., PyPDFLoader, PDFPlumberLoader, UnstructuredPDFLoader) and chunkers (e.g., RecursiveCharacterTextSplitter), but lacks a semantic, format-aware chunking module.

This feature would introduce a new utility (e.g., SemanticChunker, LayoutAwareChunker) that intelligently splits documents based on content structure such as headings, paragraphs, and sections — rather than naive character counts. It would support token-aware chunking, optional metadata enrichment (e.g., page number, section titles), and integration with common document formats like PDFs and Markdown.

This module would be especially helpful when working with large documents that need to be passed to LLMs with strict token limits (e.g., GPT-3.5 16k, Claude 200k, LLaMA 8k), enabling better retrieval and summarization performance.

Motivation

When developing certain pdf parsing or collection data from pdf at one time hits token limits several time our organization uses gpt3.5 with 16,000 token limits and demands to handle the pdf pages 1000+ which becomes an issue so for alternate solution we go for chunking of document and pass to gpt model

This leads to:

Token overflow errors

Loss of semantic continuity

Inaccurate retrieval results in RAG setups

Developers often hack together custom logic to parse sections or insert manual separators, which could be abstracted into a clean, well-tested LangChain utility.

A robust chunker would:

Improve retrieval accuracy for long documents

Enable safer use of large models with fixed context limits

Align LangChain better with enterprise use cases involving contracts, manuals, academic papers, and other structured content

Proposal (If applicable)

Implement SemanticChunker class using heading-aware and paragraph-aware logic (optionally regex- or NLP-based).

Extend support for formats like PDF (via pdfplumber, PyMuPDF), Markdown, and HTML.

Add configurable overlap, max estimated token length, and metadata tagging per chunk.

Offer fallbacks to existing RecursiveCharacterTextSplitter.

Write unit tests and examples (e.g., loading a 30-page report, chunking it semantically, feeding into a RAG system).

I’m open to contributing this as a community PR and iterating based on feedback from maintainers.

prajwal10001 · 2025-06-01T19:38:50Z

prajwal10001
Jun 1, 2025
Author

Community need you feedback on this

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Layout-Aware Chunker for Large Document Handling in LangChain #31454

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Feature Request] Layout-Aware Chunker for Large Document Handling in LangChain #31454

Uh oh!

prajwal10001 Jun 1, 2025

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 1 comment

Uh oh!

prajwal10001 Jun 1, 2025 Author

prajwal10001
Jun 1, 2025

prajwal10001
Jun 1, 2025
Author