Bring your own embeddings with InMemoryDocumentStore() #4663

davidgibsonp · 2023-04-13T17:05:24Z

davidgibsonp
Apr 13, 2023

Is it possible to use pre-computed embeddings when using the InMemoryDocumentStore() instead of having to run .update_embeddings() every time? The use case is I already have sentence embeddings for my documents and do not want to have to re-compute them when prototyping.

I see it is possible with FAISS #1085 but I do not want to use FAISS or the other document store modules because of added dependencies.

Thanks!

Answered by davidgibsonp

Apr 14, 2023

Thanks for the response! This does work, but as a clarification, I already computed the embeddings elsewhere and they are stored in a table. I do not want to compute the embeddings using the Haystack retriever. Just select the content and embedding and add them to the InMemoryDocumentStore.

But your example did get me on the right path! Does this design pattern make sense?

import pandas as pd
from haystack import Document
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.pipelines import DocumentSearchPipeline
from sentence_transformers import SentenceTransformer

# separate process to create table with embeddings
model = …

View full answer

anakin87 · 2023-04-13T20:21:21Z

anakin87
Apr 13, 2023
Maintainer

Hello @davidgibsonp!

When you call write_documents method with a list of Document objects, if the embedding is contained in the Document object, it is written to the document store.

An example:

from haystack import Document
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import EmbeddingRetriever

# embeddings generation (you should skip this part)
texts=["Test document 1", "Test document 2", "Test document 3"]
docs = [Document(text) for text in texts]
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)
retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
)
document_store.update_embeddings(retriever)

# you can start from here
docs_w_embeddings = document_store.get_all_documents(return_embedding=True)
new_document_store = InMemoryDocumentStore()
new_document_store.write_documents(docs_w_embeddings)

Now, if you type print(new_document_store.get_all_documents(return_embedding=True)[0].embedding),
you get the embedding:
array([-1.67953715e-01, -4.70052332e-01, -2.77143747e-01, -1.40942782e-01, ...], dtype=float32)

Don't hesitate to ask for any clarification.

3 replies

davidgibsonp Apr 14, 2023
Author

Thanks for the response! This does work, but as a clarification, I already computed the embeddings elsewhere and they are stored in a table. I do not want to compute the embeddings using the Haystack retriever. Just select the content and embedding and add them to the InMemoryDocumentStore.

But your example did get me on the right path! Does this design pattern make sense?

import pandas as pd
from haystack import Document
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack.pipelines import DocumentSearchPipeline
from sentence_transformers import SentenceTransformer

# separate process to create table with embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["Test document 1", "Test document 2", "Test document 3"]
embeddings = model.encode(texts)

table_with_embeddings = pd.DataFrame(
    {"content": texts, "embedding": embeddings.tolist()}
)


# select from table and build in memory search pipeline without re-computing embeddings
docs = []
for i, row in table_with_embeddings.iterrows():
    docs.append(Document(content=row["content"], embedding=row["embedding"]))


document_store = InMemoryDocumentStore()
document_store.write_documents(docs)

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
)

pipe = DocumentSearchPipeline(retriever=retriever)

pipe.run(query="Test query")

Answer selected by davidgibsonp

anakin87 Apr 14, 2023
Maintainer

In my opinion, it should work. Please let me know if it does.
Please also remember that InMemoryDocumentStore is good for local experiments with few documents, but it is not meant for production use cases.

davidgibsonp Apr 14, 2023
Author

It appears to work thus far!

And yes, I am using it for testing. But I have already computed embeddings (both from SBERT and OpenAI) and don't want to have to wait and/or pay to re-embed them each time... Thanks for the help!

prashants975 · 2023-05-25T17:31:47Z

prashants975
May 25, 2023

Hi
Is it possible to give embeddings for the query also ? I want to load them from my local and pass them on to the pipelines.

2 replies

anakin87 May 25, 2023
Maintainer

@prashants975 what do you mean specifically?
I would suggest to open another discussion, explaining in detail your use case...

davidgibsonp May 25, 2023
Author

Hi Is it possible to give embeddings for the query also ? I want to load them from my local and pass them on to the pipelines.

I think InMemoryDocumentStore.query_by_embedding is what you are looking for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bring your own embeddings with InMemoryDocumentStore() #4663

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Bring your own embeddings with InMemoryDocumentStore() #4663

Uh oh!

Uh oh!

davidgibsonp Apr 13, 2023

Replies: 2 comments · 5 replies

Uh oh!

anakin87 Apr 13, 2023 Maintainer

Uh oh!

davidgibsonp Apr 14, 2023 Author

Uh oh!

anakin87 Apr 14, 2023 Maintainer

Uh oh!

davidgibsonp Apr 14, 2023 Author

Uh oh!

prashants975 May 25, 2023

Uh oh!

anakin87 May 25, 2023 Maintainer

Uh oh!

davidgibsonp May 25, 2023 Author

davidgibsonp
Apr 13, 2023

Replies: 2 comments 5 replies

anakin87
Apr 13, 2023
Maintainer

davidgibsonp Apr 14, 2023
Author

anakin87 Apr 14, 2023
Maintainer

davidgibsonp Apr 14, 2023
Author

prashants975
May 25, 2023

anakin87 May 25, 2023
Maintainer

davidgibsonp May 25, 2023
Author