Use Chromadb with Langchain and embedding from SentenceTransformer model #7818

dmet6789 · 2023-07-17T05:17:19Z

dmet6789
Jul 17, 2023

Hello
I'm trying to store in Chroma Db embeddings vector generated with model "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2".
In my app i read a pdf document, split it with SentenceTransformersTokenTextSplitter

model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
model = SentenceTransformer(model_name)
text_splitter = SentenceTransformersTokenTextSplitter(
chunk_overlap=overlap,
model_name=model_name, #Modelo a utilizar para tokenizar
tokens_per_chunk=max_length_tk
)
pages_splitted = text_splitter.split_documents(document_pages)

and then i try to embed and store in Chroma the chunks splitted of the document but i would like that the embedding were generated with the model sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
if i generated the embedding with openai embedding it work fine with this code
from langchain.vectorstores import Chroma
vectorStore = Chroma.from_documents(
documents=pages_splitted,
collection_name="dcd_store",
embedding=OpenAIEmbeddings(openai_api_key=key_open_ai),
persist_directory=persist_directory)

but if i try to do the same using the model sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 it give me error "AttributeError: 'str' object has no attribute 'embed_documents'"
from langchain.vectorstores import Chroma
vectorStore = Chroma.from_documents(
documents=pages_splitted,
collection_name="dcd_store",
embedding="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
persist_directory=persist_directory)

Is there any way to indicate to Chroma that it uses the "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"model to generate and store embeddings using the Chroma.from_documents method or any other method, or is there any other way to do it?
thanks in advance for your help
best regards
diego

Answered by dmet6789

Jul 18, 2023

Hello
Following the documentation of langchain It was more easy,
https://python.langchain.com/docs/modules/data_connection/text_embedding/integrations/sentence_transformers

my_model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embedding_model = SentenceTransformerEmbeddings(model_name=my_model_name)
regards

View full answer

Gojo1729 · 2023-07-17T14:42:34Z

dosubot[bot]
bot Jul 17, 2023

Answer generated by a 🤖

Answer

I understand that you're trying to use the SentenceTransformer model "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" to generate embeddings for your documents and store them in Chroma DB. However, you're encountering an error because you're passing a string to the embedding parameter of the Chroma.from_documents method, which expects an instance of an Embeddings class.

To resolve this issue, you can create a custom class that inherits from the Embeddings base class and implements the embed_documents method using the SentenceTransformer model. Here's how you can do it:

from langchain.embeddings.base import Embeddings
from sentence_transformers import SentenceTransformer
from typing import List

class SentenceTransformerEmbeddings(Embeddings):
    def __init__(self, model_name: str):
        self.model = SentenceTransformer(model_name)

    def embed_documents(self, documents: List[str]) -> List[List[float]]:
        return self.model.encode(documents)

    def embed_query(self, query: str) -> List[float]:
        return self.model.encode([query])[0]

After creating this class, you can use it to generate embeddings for your documents:

from langchain.vectorstores import Chroma

model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embedding = SentenceTransformerEmbeddings(model_name)

vectorStore = Chroma.from_documents(
    documents=pages_splitted,
    collection_name="dcd_store",
    embedding=embedding,
    persist_directory=persist_directory)

This approach should allow you to use the SentenceTransformer model to generate embeddings for your documents and store them in Chroma DB. Please note that this is one potential solution and there might be other ways to achieve the same result.

If you find this solution helpful and believe it could benefit other users, I encourage you to make a pull request to update the LangChain documentation. This way, other users facing the same issue can easily find this solution.

Sources

langchain/vectorstores/chroma.py

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

1 reply

Gojo1729 Apr 20, 2024

embed_documents must return a list without that we get "ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

This works

class CustomEmbeddings(Embeddings):
    def __init__(self, model_name: str):
        self.model = SentenceTransformer(model_name)

    def embed_documents(self, documents: List[str]) -> List[List[float]]:
        return [self.model.encode(d).tolist() for d in documents]

    def embed_query(self, query: str) -> List[float]:
        return self.model.encode([query])[0].tolist()

dmet6789 · 2023-07-18T03:26:39Z

dmet6789
Jul 18, 2023
Author

Hello
i try the solution that you commend, but give me the next error, ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
In debugging i can see that the embeedings are generated correctly and the class seem that return as ndarray with the enbedding but
later in chroma.py when try to access to the embedding array with embeddings_with_metadatas = (
[embeddings[i] for i in non_empty] if embeddings else None
) give the error The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
the complete error is:
───────────────────── Traceback (most recent call last) ─────────────────────┐
│ C:\DCD\PROYECTOS\BIGDATA\Python │
│ Deep\gpt2\Curso_LangChain\5_chat_with_documents_local_embeds.py:104 in │
│ │
│ │
│ 101 │ #si quisiera añadir mas docs sobre la db que ya existe podría hac │
│ 102 │ # vectorStore.add_documents(<pages_splitted>) │
│ 103 else: │
│ > 104 │ vectorStore = Chroma.from_documents( │
│ 105 │ │ documents=pages_splitted, │
│ 106 │ │ collection_name="dcd_store", │
│ 107 │ │ embedding=embedding_model, │
│ │
│ C:\ProgramData\Anaconda3\envs\gpt2\lib\site-packages\langchain\vectorstores │
│ \chroma.py:564 in from_documents │
│ │
│ 561 │ │ """ │
│ 562 │ │ texts = [doc.page_content for doc in documents] │
│ 563 │ │ metadatas = [doc.metadata for doc in documents] │
│ > 564 │ │ return cls.from_texts( │
│ 565 │ │ │ texts=texts, │
│ 566 │ │ │ embedding=embedding, │
│ 567 │ │ │ metadatas=metadatas, │
│ │
│ C:\ProgramData\Anaconda3\envs\gpt2\lib\site-packages\langchain\vectorstores │
│ \chroma.py:528 in from_texts │
│ │
│ 525 │ │ │ collection_metadata=collection_metadata, │
│ 526 │ │ │ **kwargs, │
│ 527 │ │ ) │
│ > 528 │ │ chroma_collection.add_texts(texts=texts, metadatas=metadatas, │
│ 529 │ │ return chroma_collection │
│ 530 │ │
│ 531 │ @classmethod │
│ │
│ C:\ProgramData\Anaconda3\envs\gpt2\lib\site-packages\langchain\vectorstores │
│ \chroma.py:181 in add_texts │
│ │
│ 178 │ │ │ │ metadatas = [metadatas[i] for i in non_empty] │
│ 179 │ │ │ │ texts_with_metadatas = [texts[i] for i in non_empty] │
│ 180 │ │ │ │ embeddings_with_metadatas = ( │
│ > 181 │ │ │ │ │ [embeddings[i] for i in non_empty] if embeddings │
│ 182 │ │ │ │ ) │
│ 183 │ │ │ │ ids_with_metadata = [ids[i] for i in non_empty] │
│ 184 │ │ │ │ self._collection.upsert( │
└─────────────────────────────────────────────────────────────────────────────┘
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()

best regards
diego

1 reply

Gojo1729 Apr 20, 2024

Fix for the above issue - #7818 (reply in thread)

dmet6789 · 2023-07-18T14:51:43Z

dmet6789
Jul 18, 2023
Author

Hello
Following the documentation of langchain It was more easy,
https://python.langchain.com/docs/modules/data_connection/text_embedding/integrations/sentence_transformers

my_model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
embedding_model = SentenceTransformerEmbeddings(model_name=my_model_name)
regards

1 reply

dosubot[bot] bot Jun 22, 2024

To integrate the SentenceTransformer model with LangChain's Chroma, you need to ensure that the embedding function is correctly implemented and used. Here is a step-by-step guide based on the provided information and the correct approach:

Define a Custom Embeddings Class:
Create a custom embeddings class that uses the SentenceTransformer model to generate embeddings.

from sentence_transformers import SentenceTransformer
from langchain.embeddings.base import Embeddings

class CustomEmbeddings(Embeddings):
    def __init__(self, model_name: str):
        self.model = SentenceTransformer(model_name)

    def embed_documents(self, documents: List[str]) -> List[List[float]]:
        return [self.model.encode(d).tolist() for d in documents]

    def embed_query(self, query: str) -> List[float]:
        return self.model.encode([query])[0].tolist()

Use the Custom Embeddings Class with Chroma:
Use the custom embeddings class to generate embeddings and store them in Chroma.

from langchain.vectorstores import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

# Load and split the document
loader = TextLoader("path/to/your/document.pdf")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Create the custom embedding function
embedding_model = CustomEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Load it into Chroma
vectorStore = Chroma.from_documents(
    documents=docs,
    collection_name="dcd_store",
    embedding=embedding_model,
    persist_directory="path/to/persist_directory"
)

# Query it
query = "What did the president say about Ketanji Brown Jackson"
results = vectorStore.similarity_search(query)

# Print results
print(results[0].page_content)

This approach ensures that the embed_documents method returns a list of lists, which avoids the ValueError you encountered. The custom embeddings class correctly interfaces with the SentenceTransformer model and integrates seamlessly with Chroma ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

eternal-bug · 2024-06-22T16:20:53Z

eternal-bug
Jun 22, 2024

I Slightly changed the code by looking at the LangChain code and just used SentenceTransformer to build class with the embed_documents function.

from sentence_transformers import SentenceTransformer

class SentenceTransformerMy(object):
    
    encode_kwargs = dict()
    # See also the Sentence Transformer documentation: https://sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode"""
    multi_process: bool = False
    """Run encode() on multiple GPUs."""
    show_progress: bool = False
    """Whether to show a progress bar."""
    
    def __init__(self, model_path, **kwargs):
        self.client = SentenceTransformer(model_path, **kwargs)
    
    def embed_documents(self, texts):
            """Compute doc embeddings using a HuggingFace transformer model.

            Args:
                texts: The list of texts to embed.

            Returns:
                List of embeddings, one for each text.
            """

            texts = list(map(lambda x: x.replace("\n", " "), texts))
            if self.multi_process:
                pool = self.client.start_multi_process_pool()
                embeddings = self.client.encode_multi_process(texts, pool)
                sentence_transformers.SentenceTransformer.stop_multi_process_pool(pool)
            else:
                embeddings = self.client.encode(
                    texts, show_progress_bar=self.show_progress, **self.encode_kwargs
                )

            return embeddings.tolist()
        
    def embed_query(self, text: str):
        """Compute query embeddings using a HuggingFace transformer model.

        Args:
            text: The text to embed.

        Returns:
            Embeddings for the text.
        """
        return self.embed_documents([text])[0]

and use it like this:

# you can load with local embedding model path
embedding_model = SentenceTransformerMy("/path/to/your/embedding_model", device="cuda")
# test embedding model
embedding_model.embed_documents(["what your name?"])
# also can be used in langchain Chroma
vecs = Chroma.from_documents(documents=docs, embedding=embedding_model, persist_directory="./chroma_db")

1 reply

bzr1 Aug 14, 2024

love u so much xoxo

Use Chromadb with Langchain and embedding from SentenceTransformer model #7818

Uh oh!

dmet6789 Jul 17, 2023

Replies: 4 comments · 4 replies

Uh oh!

dosubot[bot] bot Jul 17, 2023

Answer

Sources

Uh oh!

Uh oh!

Gojo1729 Apr 20, 2024

Uh oh!

dmet6789 Jul 18, 2023 Author

Uh oh!

Gojo1729 Apr 20, 2024

Uh oh!

dmet6789 Jul 18, 2023 Author

Uh oh!

dosubot[bot] bot Jun 22, 2024

Uh oh!

Uh oh!

eternal-bug Jun 22, 2024

Uh oh!

bzr1 Aug 14, 2024

dmet6789
Jul 17, 2023

Replies: 4 comments 4 replies

dosubot[bot]
bot Jul 17, 2023

dmet6789
Jul 18, 2023
Author

dmet6789
Jul 18, 2023
Author

eternal-bug
Jun 22, 2024