MarkdownTextSplitter vs. MarkdownHeaderTextSplitter for chunking #30540
Replies: 1 comment
-
Hey @4MIR2000, I'm here to help you out with any questions or issues you might have. Let's tackle this together! 😊 The
If you want to split text based on specific headers and possibly retain header information in metadata, Regarding # MD splits
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)
# Char-level splits
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
# Split
splits = text_splitter.split_documents(md_header_splits) This method allows you to first split the document by headers and then refine the chunks by size and overlap using another splitter [1]. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
do you know what the difference between the MarkdownTextSplitter and MarkdownHeaderTextSplitter is. Which one would you choose?
Can you also define chunk_size and overlap in MarkdownHeaderTextSplitter?
Thank you in advance
Beta Was this translation helpful? Give feedback.
All reactions