How to persist and load back BM25Retriever? #26112

lingvisa · 2024-09-05T19:16:51Z

lingvisa
Sep 5, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

keyword_retriever = BM25Retriever.from_documents(documents)
    with open('test.pkl', "wb") as f:
        pickle.dump(keyword_retriever, f)

    with open('test.pkl', 'rb') as f:
        keyword_retriever = pickle.load(f)
        keyword_retriever.k = 5

Description

I am working on a hybrid search implementation, and use this code to do keyword search:

    keyword_retriever = BM25Retriever.from_documents(documents)
    with open('test.pkl', "wb") as f:
        pickle.dump(keyword_retriever, f)

Then, load it back:
    ```
with open('test.pkl', 'rb') as f:
        keyword_retriever = pickle.load(f)
        keyword_retriever.k = 5

However, this is not scalable to large dataset due to pickle. What's the right way to persist keyword index?

System Info

langchain==0.2.15
langchain-aws==0.1.7
langchain-community==0.2.10
langchain-core==0.2.36
langchain-google-vertexai==1.0.8
langchain-milvus==0.1.4
langchain-openai==0.1.23
langchain-text-splitters==0.2.2

@dosu · 2024-09-05T19:17:49Z

dosubot[bot]
bot Sep 5, 2024

To persist and load back a BM25Retriever in a scalable way for large datasets without using pickle, you can use Elasticsearch. Here's how you can do it:

Create the Retriever:

from langchain_community.retrievers.elastic_search_bm25 import ElasticSearchBM25Retriever

elasticsearch_url = "https://username:password@es_host:9243"
index_name = "your_index_name"
retriever = ElasticSearchBM25Retriever.create(elasticsearch_url, index_name)

Add Texts to the Retriever:

texts = ["Document 1 content", "Document 2 content", "Document 3 content"]
retriever.add_texts(texts)

Retrieve Relevant Documents:

query = "search query"
relevant_docs = retriever._get_relevant_documents(query)
for doc in relevant_docs:
    print(doc.page_content)

By using Elasticsearch, you can handle large datasets efficiently and avoid the limitations and potential issues associated with using pickle for persistence ^[1].

To continue talking to Dosu, mention @dosu.

6 replies

dosubot[bot] bot Sep 5, 2024

Yes, you would need to install and run an Elasticsearch instance to use ElasticSearchBM25Retriever. Elasticsearch can be installed locally, or you can use a managed service like AWS Elasticsearch Service, Elastic Cloud, or any other cloud provider that offers Elasticsearch. This setup will allow you to handle large datasets efficiently and persist your keyword index without relying on pickle.

To continue talking to Dosu, mention @dosu.

lingvisa Sep 5, 2024
Author

can fassis store BM25Retriever?

dosubot[bot] bot Sep 5, 2024

I couldn't find any information indicating that LangChain provides built-in support for integrating with Fassis for storing and retrieving BM25Retriever or other components. You might need to explore custom solutions or other storage options supported by LangChain, such as Elasticsearch ^[1]^[2].

To continue talking to Dosu, mention @dosu.

lingvisa Sep 5, 2024
Author

@dosu If I use elasticsearch, can I store both semantic vector and keyword vector in elasticsearch?

dosubot[bot] bot Sep 5, 2024

Yes, you can store both semantic vectors and keyword vectors in Elasticsearch. The provided code demonstrates how to store dense vectors (semantic vectors) in Elasticsearch using the ElasticVectorSearch class. The _default_text_mapping function shows how to define a mapping for text and dense vectors:

def _default_text_mapping(dim: int) -> Dict:
    return {
        "properties": {
            "text": {"type": "text"},
            "vector": {"type": "dense_vector", "dims": dim},
        }
    }

This mapping includes a text field for keyword vectors and a vector field for dense vectors. Therefore, Elasticsearch can handle both types of vectors.

Additionally, the ElasticsearchStore class from the LangChain framework provides various retrieval strategies, such as ApproxRetrievalStrategy, ExactRetrievalStrategy, and SparseRetrievalStrategy, which can be used to perform searches on the stored vectors. This flexibility allows you to choose the most suitable strategy for your use case, whether it involves approximate nearest neighbor search, exact search, or sparse retrieval using text expansion.

To get started with Elasticsearch, you can run it locally using Docker or connect to an instance on Elastic Cloud. Here is an example of how to run a single-node Elasticsearch instance with security disabled:

%docker run -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" -e "xpack.security.http.ssl.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.12.1

For production use, it is recommended to run Elasticsearch with security enabled and connect using login credentials or an API key. Here is an example of how to connect to an Elasticsearch instance with authentication:

from langchain_elasticsearch import ElasticsearchStore

elastic_vector_search = ElasticsearchStore(
    es_url="http://localhost:9200",
    index_name="langchain_index",
    embedding=embeddings,
    es_user="elastic",
    es_password="changeme",
)

For more detailed information on setting up and using Elasticsearch with LangChain, you can refer to the LangChain documentation ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Emekaborisama · 2024-09-09T19:56:48Z

Emekaborisama
Sep 9, 2024

what if i don't want to use elastic search?

0 replies

zhouhangtju · 2024-09-12T00:57:54Z

zhouhangtju
Sep 12, 2024

The bottleneck occurs in texts_processed and vectorizer = BM25Okapi(texts_processed, bm25_params) in the from_texts method. You can persist the vectorizer and avoid repeatedly processing texts.

0 replies

mspronesti · 2024-11-15T19:48:57Z

mspronesti
Nov 15, 2024

Hi, I opened a PR supporting an alternative and persistable implementation based on bm25s #28123.

0 replies

How to persist and load back BM25Retriever? #26112

Uh oh!

Uh oh!

lingvisa Sep 5, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 4 comments · 6 replies

Uh oh!

dosubot[bot] bot Sep 5, 2024

Uh oh!

dosubot[bot] bot Sep 5, 2024

Uh oh!

lingvisa Sep 5, 2024 Author

Uh oh!

dosubot[bot] bot Sep 5, 2024

Uh oh!

Uh oh!

lingvisa Sep 5, 2024 Author

Uh oh!

dosubot[bot] bot Sep 5, 2024

Uh oh!

Emekaborisama Sep 9, 2024

Uh oh!

zhouhangtju Sep 12, 2024

Uh oh!

mspronesti Nov 15, 2024

lingvisa
Sep 5, 2024

Replies: 4 comments 6 replies

dosubot[bot]
bot Sep 5, 2024

lingvisa Sep 5, 2024
Author

lingvisa Sep 5, 2024
Author

Emekaborisama
Sep 9, 2024

zhouhangtju
Sep 12, 2024

mspronesti
Nov 15, 2024