How to take an existing list of embeddings and documents and add it to a vectorstore? #5341
Replies: 7 comments 6 replies
-
Here's how it can be done with Chroma: import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="my_collection")
collection.add(
documents=["This is a document", "This is another document"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
embeddings=[[1,2,3],[4,5,6]],
ids=["id1", "id2"]
)
from langchain.vectorstores import Chroma
langchainChroma = Chroma(client=chroma_client, collection_name="my_collection")
print(langchainChroma._collection.count())
# 2 reference: chroma-core/chroma#626 (comment) |
Beta Was this translation helpful? Give feedback.
-
I'd like to extend this question, because it does not work for me. First I did this: from langchain.vectorstores import Chroma
import chromadb
client_settings = chromadb.config.Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="db",
anonymized_telemetry=False,
)
vector_db = Chroma(
collection_name="my_collection",
persist_directory='db',
client_settings=client_settings,
embedding_function=embeddings_model
)
chroma_client = chromadb.Client(settings=client_settings)
chroma_client.list_collections() This always returns collection = chroma_client.get_or_create_collection("my_collection",
embedding_function=embeddings_model) A If there is nothing in the db then I'll loop through the documents and fill it: collection.add(
embeddings=embeddings_batch,
documents=documents_batch,
metadatas=metadatas_batch,
ids=ids_batch
) Then do a The problem starts with langchain. from langchain import LlamaCpp
from langchain.chains import RetrievalQAWithSourcesChain
retriever_llm = LlamaCpp(model_path=RETRIEVE_MODEL,
temperature=TEMPERATURE,
n_ctx=N_CTX,
use_mlock=USE_MLOCK,
n_batch=N_BATCH)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=retriever_llm,
chain_type="stuff",
retriever=vector_db.as_retriever(),
) loads the model and everything looks ok, but this does not work: How do I use this correctly? I read through all the documentation, I seem to miss something essential :/ |
Beta Was this translation helpful? Give feedback.
-
I checked the chroma docs, I think I'm doing it right. This is the output of {'documents': ['... the SIN and COS array. def', '...on the production software because of'],
'embeddings': [[0.003863372141495347, ...], [0.011034797877073288, ...]],
'metadatas': [{'title': 'my cozy title 1', 'id': '1', 'source': 'https://page.html'}, {'title': 'my cozy title 2', 'id': '2', 'source': 'https://page_2.html'}],
'ids': ['2940cdfb-...', 'b9e07b0b-...'],
} Chroma docs suggests collection.add(
documents=["doc1", "doc2", "doc3", ...],
embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
ids=["id1", "id2", "id3", ...]
) |
Beta Was this translation helpful? Give feedback.
-
Hi all, I am researching about this topic and I want to ask how do you keep track of the ids when you add new documents to avoid ids collision? |
Beta Was this translation helpful? Give feedback.
-
Any update on this issue? Any solution for FAISS? |
Beta Was this translation helpful? Give feedback.
-
use add_embeddings instead of add_documents this works fine for me texts, metadatas and embeddings are lists, create them first (reuse the emdeddings)
|
Beta Was this translation helpful? Give feedback.
-
any updates? same issue |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I have a bunch of embeddings I do not want to pay for computing again. I would rather just manually add them along with their corresponding documents to the vectorstore of my choice (in this case ChromaDB).
I do not see a sanctioned way to do this. I searched for whether there were any other databases I could use to add just the embeddings (lists of lists) and only atlas and FAISS popped up in the search results.
Here's what I did:
Any suggestions here? I would work on a PR but don't really know the right place to start.
I mean, even if it's a simple instruction notebook it might be helpful, but I'm just wondering whether this is not really a use case? I would imagine there are plenty of companies that have been managing embeddings and would like to migrate them without re-computing them, and langchain could probably fill in that use case.
Thanks,
Steven
ps. For those wondering why I didn't just use
faiss_vectorstore = from_documents([], embedding=embedding_function)
and then use theadd_embeddings
method (which doesn't seem so bad) it's because it relies on seeing one embedding in order to create the index variable (see here).Beta Was this translation helpful? Give feedback.
All reactions