Langchain CharacterTextSplitter from huggingface tokenizers dont split text at all #12471

316usman · 2023-10-28T04:02:23Z

316usman
Oct 28, 2023

Hi, while trying to split text using the CharacterTextSplitter.from_huggingface_tokenizer() class, no model seems to work even the example given in the docs dont seem to work I have tried with different model tokenizers but the models do even split the text.

`from transformers import GPT2TokenizerFast
from langchain.text_splitter import CharacterTextSplitter
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
text_splitter_gpt = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer, chunk_size=200, chunk_overlap=20)

file_text_splits = []
for page_text in page_text_list:
page_text_splits = []
chunks = text_splitter_gpt.split_text(page_text)
for chunk in chunks:
page_text_splits.append(chunk)
file_text_splits.append(page_text_splits)

print (len(file_text_splits))
print (file_text_splits[0])`

`from transformers import LlamaTokenizer
from langchain.text_splitter import CharacterTextSplitter
tokenizer_llama = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b", legacy=True)
text_splitter_llama = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer_llama, chunk_size=50, chunk_overlap=20)

file_text_splits = []
for page_text in page_text_list:
page_text_splits = []
chunks = text_splitter_llama.split_text(page_text)
for chunk in chunks:
page_text_splits.append(chunk)
file_text_splits.append(page_text_splits)

print (len(file_text_splits))
print (file_text_splits[0])`

Both these return the code as is and no changes are mode by these in the text. Please help.

ilanaliouchouche · 2024-02-28T10:16:34Z

ilanaliouchouche
Feb 28, 2024

Hello Usman,
Have you found a solution?

0 replies

ilanaliouchouche · 2024-02-28T10:22:00Z

ilanaliouchouche
Feb 28, 2024

Using TokenTextSplitter instead of CharacterTextSplitter seems to work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Langchain CharacterTextSplitter from huggingface tokenizers dont split text at all #12471

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Langchain CharacterTextSplitter from huggingface tokenizers dont split text at all #12471

Uh oh!

Uh oh!

316usman Oct 28, 2023

Replies: 2 comments

Uh oh!

ilanaliouchouche Feb 28, 2024

Uh oh!

ilanaliouchouche Feb 28, 2024

316usman
Oct 28, 2023

ilanaliouchouche
Feb 28, 2024

ilanaliouchouche
Feb 28, 2024