Replies: 2 comments
-
Hello Usman, |
Beta Was this translation helpful? Give feedback.
0 replies
-
Using |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, while trying to split text using the CharacterTextSplitter.from_huggingface_tokenizer() class, no model seems to work even the example given in the docs dont seem to work I have tried with different model tokenizers but the models do even split the text.
`from transformers import GPT2TokenizerFast
from langchain.text_splitter import CharacterTextSplitter
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
text_splitter_gpt = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer, chunk_size=200, chunk_overlap=20)
file_text_splits = []
for page_text in page_text_list:
page_text_splits = []
chunks = text_splitter_gpt.split_text(page_text)
for chunk in chunks:
page_text_splits.append(chunk)
file_text_splits.append(page_text_splits)
print (len(file_text_splits))
print (file_text_splits[0])`
`from transformers import LlamaTokenizer
from langchain.text_splitter import CharacterTextSplitter
tokenizer_llama = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b", legacy=True)
text_splitter_llama = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer_llama, chunk_size=50, chunk_overlap=20)
file_text_splits = []
for page_text in page_text_list:
page_text_splits = []
chunks = text_splitter_llama.split_text(page_text)
for chunk in chunks:
page_text_splits.append(chunk)
file_text_splits.append(page_text_splits)
print (len(file_text_splits))
print (file_text_splits[0])`
Both these return the code as is and no changes are mode by these in the text. Please help.
Beta Was this translation helpful? Give feedback.
All reactions