How to take an existing list of embeddings and documents and add it to a vectorstore? #5341

startakovsky · 2023-05-27T12:13:52Z

startakovsky
May 27, 2023

Hi,

I have a bunch of embeddings I do not want to pay for computing again. I would rather just manually add them along with their corresponding documents to the vectorstore of my choice (in this case ChromaDB).

I do not see a sanctioned way to do this. I searched for whether there were any other databases I could use to add just the embeddings (lists of lists) and only atlas and FAISS popped up in the search results.

Here's what I did:

from langchain.vectorstores import FAISS

vectorstore = FAISS.from_texts([texts[0]], embedding=embedding_function, metadatas=[metadatas[0]])
_ = vectorstore.add_embeddings(text_embeddings=list(zip(texts[1:], embeddings[1:])), metadatas=metadatas[1:])

Any suggestions here? I would work on a PR but don't really know the right place to start.

I mean, even if it's a simple instruction notebook it might be helpful, but I'm just wondering whether this is not really a use case? I would imagine there are plenty of companies that have been managing embeddings and would like to migrate them without re-computing them, and langchain could probably fill in that use case.

Thanks,
Steven

ps. For those wondering why I didn't just use faiss_vectorstore = from_documents([], embedding=embedding_function) and then use the add_embeddings method (which doesn't seem so bad) it's because it relies on seeing one embedding in order to create the index variable (see here).

startakovsky · 2023-05-27T20:09:21Z

startakovsky
May 27, 2023
Author

Here's how it can be done with Chroma:

import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="my_collection")

collection.add(
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    embeddings=[[1,2,3],[4,5,6]],
    ids=["id1", "id2"]
)

from langchain.vectorstores import Chroma
langchainChroma = Chroma(client=chroma_client, collection_name="my_collection") 

print(langchainChroma._collection.count())
# 2

reference: chroma-core/chroma#626 (comment)

0 replies

catbears · 2023-06-28T14:00:33Z

catbears
Jun 28, 2023

I'd like to extend this question, because it does not work for me.

First I did this:

from langchain.vectorstores import Chroma
import chromadb

client_settings = chromadb.config.Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="db",
    anonymized_telemetry=False,
)

vector_db = Chroma(
    collection_name="my_collection",
    persist_directory='db',
    client_settings=client_settings,
    embedding_function=embeddings_model
)
chroma_client = chromadb.Client(settings=client_settings)
chroma_client.list_collections()

This always returns [] even if there is a collection.
Then I explicitly get-or-create it

collection = chroma_client.get_or_create_collection("my_collection",
                                                    embedding_function=embeddings_model)

A chroma_client.list_collections() shows the collection now.
If I already filled it, I can collection.peek() and see some entries.

If there is nothing in the db then I'll loop through the documents and fill it:

collection.add(
                embeddings=embeddings_batch,
                documents=documents_batch,
                metadatas=metadatas_batch,
                ids=ids_batch
            )

Then do a vector_db.persist(), the files appear in the DB directory. collection.peek() shows documents, collection.count() tells me 112648, which is what I fed the db with. So far so good.

The problem starts with langchain.

from langchain import LlamaCpp
from langchain.chains import RetrievalQAWithSourcesChain

retriever_llm = LlamaCpp(model_path=RETRIEVE_MODEL,
                         temperature=TEMPERATURE,
                         n_ctx=N_CTX,
                         use_mlock=USE_MLOCK,
                         n_batch=N_BATCH)

chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=retriever_llm,
    chain_type="stuff",
    retriever=vector_db.as_retriever(),
)

loads the model and everything looks ok, but this does not work:
chain.retriever()

How do I use this correctly? I read through all the documentation, I seem to miss something essential :/

4 replies

startakovsky Jun 29, 2023
Author

@jeffchuber ^

jeffchuber Jun 29, 2023

@catbears are you sure your collection is called "my_collection"? I suspect it is not.

catbears Jun 30, 2023

@jeffchuber Yes it is. What I missed was the thing that every time I instantiated a client while another one was running, it'd brick the db. Once you know that it becomes obvious why everything is still there on the disk, was accessible just now, but isn't anymore.
I worked with jupyter notebooks, so after storing the data in the db, I fired up a second one and tried to load it from there.

If I got that wrong and it's all sunshine and no accidental bricking anymore, please correct me.

jeffchuber Jul 3, 2023

@catbears yes that is enough of a common problem we added a special section in the docs just for it! https://docs.trychroma.com/troubleshooting#your-index-resets-back-to-just-a-few-number-of-records

catbears · 2023-06-29T06:01:44Z

catbears
Jun 29, 2023

I checked the chroma docs, I think I'm doing it right. This is the output of collection.peek() after generating, without loading from disk.

{'documents': ['... the SIN and COS array. def',  '...on the production software because of'],
 'embeddings': [[0.003863372141495347,   ...],  [0.011034797877073288,   ...]],
 'metadatas': [{'title': 'my cozy title 1', 'id': '1', 'source': 'https://page.html'}, {'title': 'my cozy title 2', 'id': '2', 'source': 'https://page_2.html'}],
'ids': ['2940cdfb-...',  'b9e07b0b-...'],  
}

Chroma docs suggests

collection.add(
    documents=["doc1", "doc2", "doc3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    ids=["id1", "id2", "id3", ...]
)

0 replies

MinhPham123456789 · 2024-05-22T12:52:26Z

MinhPham123456789
May 22, 2024

Hi all, I am researching about this topic and I want to ask how do you keep track of the ids when you add new documents to avoid ids collision?

0 replies

Garviljain · 2024-06-05T11:15:31Z

## read a file, split it into chunks and store it in db1 
db1 = FAISS.from_documents(new_chunks, embeddings)

## read another file, split it into chunks and store it in db2
db2 = FAISS.from_documents(new_chunks2, embeddings)

## now merge db2 into db1 
db1.merge_from(db2)

with the above functionality, embeddings of multiple documents can be appended to an existing vector store (FAISS)

HorstA · 2024-07-30T15:15:05Z

HorstA
Jul 30, 2024

use add_embeddings instead of add_documents

this works fine for me

texts, metadatas and embeddings are lists, create them first (reuse the emdeddings)

ids = vectorstore.add_embeddings(
    texts=texts, metadatas=metadatas, embeddings=embeddings
)

0 replies

JoyboyBrian · 2024-07-31T01:01:24Z

JoyboyBrian
Jul 31, 2024

any updates? same issue

0 replies

How to take an existing list of embeddings and documents and add it to a vectorstore? #5341

Uh oh!

Replies: 7 comments · 6 replies

Uh oh!

startakovsky May 27, 2023 Author

Uh oh!

Uh oh!

Uh oh!

startakovsky Jun 29, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 7 comments 6 replies

startakovsky
May 27, 2023
Author

startakovsky Jun 29, 2023
Author