[Feature Request] Layout-Aware Chunker for Large Document Handling in LangChain #31454
prajwal10001
announced in
Ideas
Replies: 1 comment
-
Community need you feedback on this |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Checked
Feature request
Advanced Semantic Chunker for LangChain Document Loaders
LangChain already includes several document loaders (e.g., PyPDFLoader, PDFPlumberLoader, UnstructuredPDFLoader) and chunkers (e.g., RecursiveCharacterTextSplitter), but lacks a semantic, format-aware chunking module.
This feature would introduce a new utility (e.g., SemanticChunker, LayoutAwareChunker) that intelligently splits documents based on content structure such as headings, paragraphs, and sections — rather than naive character counts. It would support token-aware chunking, optional metadata enrichment (e.g., page number, section titles), and integration with common document formats like PDFs and Markdown.
This module would be especially helpful when working with large documents that need to be passed to LLMs with strict token limits (e.g., GPT-3.5 16k, Claude 200k, LLaMA 8k), enabling better retrieval and summarization performance.
Motivation
When developing certain pdf parsing or collection data from pdf at one time hits token limits several time our organization uses gpt3.5 with 16,000 token limits and demands to handle the pdf pages 1000+ which becomes an issue so for alternate solution we go for chunking of document and pass to gpt model
This leads to:
Token overflow errors
Loss of semantic continuity
Inaccurate retrieval results in RAG setups
Developers often hack together custom logic to parse sections or insert manual separators, which could be abstracted into a clean, well-tested LangChain utility.
A robust chunker would:
Improve retrieval accuracy for long documents
Enable safer use of large models with fixed context limits
Align LangChain better with enterprise use cases involving contracts, manuals, academic papers, and other structured content
Proposal (If applicable)
Implement SemanticChunker class using heading-aware and paragraph-aware logic (optionally regex- or NLP-based).
Extend support for formats like PDF (via pdfplumber, PyMuPDF), Markdown, and HTML.
Add configurable overlap, max estimated token length, and metadata tagging per chunk.
Offer fallbacks to existing RecursiveCharacterTextSplitter.
Write unit tests and examples (e.g., loading a 30-page report, chunking it semantically, feeding into a RAG system).
I’m open to contributing this as a community PR and iterating based on feedback from maintainers.
Beta Was this translation helpful? Give feedback.
All reactions