-
Hello model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" and then i try to embed and store in Chroma the chunks splitted of the document but i would like that the embedding were generated with the model sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 but if i try to do the same using the model sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 it give me error "AttributeError: 'str' object has no attribute 'embed_documents'" Is there any way to indicate to Chroma that it uses the "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"model to generate and store embeddings using the Chroma.from_documents method or any other method, or is there any other way to do it? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 4 replies
-
Answer generated by a 🤖 AnswerI understand that you're trying to use the SentenceTransformer model "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" to generate embeddings for your documents and store them in Chroma DB. However, you're encountering an error because you're passing a string to the To resolve this issue, you can create a custom class that inherits from the from langchain.embeddings.base import Embeddings
from sentence_transformers import SentenceTransformer
from typing import List
class SentenceTransformerEmbeddings(Embeddings):
def __init__(self, model_name: str):
self.model = SentenceTransformer(model_name)
def embed_documents(self, documents: List[str]) -> List[List[float]]:
return self.model.encode(documents)
def embed_query(self, query: str) -> List[float]:
return self.model.encode([query])[0] After creating this class, you can use it to generate embeddings for your documents: from langchain.vectorstores import Chroma
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embedding = SentenceTransformerEmbeddings(model_name)
vectorStore = Chroma.from_documents(
documents=pages_splitted,
collection_name="dcd_store",
embedding=embedding,
persist_directory=persist_directory) This approach should allow you to use the SentenceTransformer model to generate embeddings for your documents and store them in Chroma DB. Please note that this is one potential solution and there might be other ways to achieve the same result. If you find this solution helpful and believe it could benefit other users, I encourage you to make a pull request to update the LangChain documentation. This way, other users facing the same issue can easily find this solution. SourcesThis response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
Beta Was this translation helpful? Give feedback.
-
Hello best regards |
Beta Was this translation helpful? Give feedback.
-
Hello my_model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" |
Beta Was this translation helpful? Give feedback.
-
I Slightly changed the code by looking at the LangChain code and just used from sentence_transformers import SentenceTransformer
class SentenceTransformerMy(object):
encode_kwargs = dict()
# See also the Sentence Transformer documentation: https://sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"""
multi_process: bool = False
"""Run encode() on multiple GPUs."""
show_progress: bool = False
"""Whether to show a progress bar."""
def __init__(self, model_path, **kwargs):
self.client = SentenceTransformer(model_path, **kwargs)
def embed_documents(self, texts):
"""Compute doc embeddings using a HuggingFace transformer model.
Args:
texts: The list of texts to embed.
Returns:
List of embeddings, one for each text.
"""
texts = list(map(lambda x: x.replace("\n", " "), texts))
if self.multi_process:
pool = self.client.start_multi_process_pool()
embeddings = self.client.encode_multi_process(texts, pool)
sentence_transformers.SentenceTransformer.stop_multi_process_pool(pool)
else:
embeddings = self.client.encode(
texts, show_progress_bar=self.show_progress, **self.encode_kwargs
)
return embeddings.tolist()
def embed_query(self, text: str):
"""Compute query embeddings using a HuggingFace transformer model.
Args:
text: The text to embed.
Returns:
Embeddings for the text.
"""
return self.embed_documents([text])[0] and use it like this: # you can load with local embedding model path
embedding_model = SentenceTransformerMy("/path/to/your/embedding_model", device="cuda")
# test embedding model
embedding_model.embed_documents(["what your name?"])
# also can be used in langchain Chroma
vecs = Chroma.from_documents(documents=docs, embedding=embedding_model, persist_directory="./chroma_db") |
Beta Was this translation helpful? Give feedback.
Hello
Following the documentation of langchain It was more easy,
https://python.langchain.com/docs/modules/data_connection/text_embedding/integrations/sentence_transformers
my_model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embedding_model = SentenceTransformerEmbeddings(model_name=my_model_name)
regards