Building an Index that supports a hybrid search comprising of Full text(BM-25) and vector searches #9837

blackhawk-616 · 2024-01-04T11:19:29Z

blackhawk-616
Jan 4, 2024

I am trying to build an index that should support a hybrid search mechanism consisting of both BM25 and vector searches.
I would like to know how I can do this with llama-index in particular. I am using OpenAI Embeddings using an Azure Deployment of the ada embedding model. The documentation only shows how to load documents to and index which is already created. Further it shows only the vector search implementation. Any help is appreciated.

logan-markewich · 2024-01-07T17:16:19Z

logan-markewich
Jan 7, 2024
Maintainer

One tricky part with BM25 is that you need to persist the nodes somewhere. Usually, this is done with a docstore (i.e using Redis, Mongodb, or save to disk), or you can simply serialize/pickle the nodes.

Then, you can combine BM25 with a vector index retriever

https://docs.llamaindex.ai/en/stable/examples/retrievers/reciprocal_rerank_fusion.html#create-a-hybrid-fusion-retriever

0 replies

lukevrobbins · 2024-01-09T02:38:29Z

lukevrobbins
Jan 9, 2024

@blackhawk-616
For azure deployment service context, use
from llama_index.llms import AzureOpenAI
from llama_index.embeddings import AzureOpenAIEmbedding

and llm = AzureOpenAI() found here: link

For the hybrid search, use a QueryFusionRetriver: link above

Once you create your documents either with a loader or with the Document class, you can use the node parser to create nodes.
Something like this example:

from llama_index import SimpleDirectoryReader, VectorStoreIndex, ServiceContext
from llama_index.text_splitter import SentenceSplitter

documents = SimpleDirectoryReader("./data").load_data()

text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)
service_context = ServiceContext.from_defaults(text_splitter=text_splitter)

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

or

print("Creating Nodes")
extractors = [
    KeywordExtractor(keywords=5, llm = llm) # This would be where you use the above AzureOpenAI class for llm
]
pipeline = IngestionPipeline(transformations=extractors)
nodes = pipeline.run(documents = documents)

print("Creating index")
index = VectorStoreIndex(nodes, service_context=service_context)
index.storage_context.persist("index/")

This sets up the docstore within the index and persists it so you can do:

from llama_index.retrievers import BM25Retriever

vector_retriever = index.as_retriever(similarity_top_k=2)

bm25_retriever = BM25Retriever.from_defaults(
    docstore=index.docstore, similarity_top_k=2
)

and finally

from llama_index.retrievers import QueryFusionRetriever

retriever = QueryFusionRetriever(
    [vector_retriever, bm25_retriever],
    similarity_top_k=2,
    num_queries=4,  # set this to 1 to disable query generation
    mode="reciprocal_rerank",
    use_async=True,
    verbose=True,
    # query_gen_prompt="...",  # we could override the query generation prompt here
)

The only thing the docs example linked above is missing is replacing the llm calls with an AzureOpenAI class for your deployment, and creating the nodes explicitly if that is something you want to do.

Hopefully that helped

0 replies

blackhawk-616 · 2024-01-09T11:46:31Z

blackhawk-616
Jan 9, 2024
Author

Thanks @logan-markewich @lukevrobbins In fact I was looking for a more production grade solution involving ElasticSearch as the DB for storing both text as well as embeddings. Both solutions mentioned here uses local DBs which cannot be used in production. Could you suggest some example which involves ElasticSearch as DB.

Note: ElasticSearch has the RRF implemented internally. I am looking for a way in which llama-index can actually make use of this inbuild capability rather than implementing it using llama-index.

1 reply

logan-markewich Jan 10, 2024
Maintainer

We don't have an elastic search docstore integration right now (I guess we probably could?) You options are redis, mongodb, firestore, or dynamodb for docstore/indexstore.

Foe the vector store, we of course have a bunch of other stuff

blackhawk-616 · 2024-01-10T05:03:58Z

blackhawk-616
Jan 10, 2024
Author

@logan-markewich
This documentation shows how to ingest and query with the ElasticSearch DB- https://docs.llamaindex.ai/en/latest/examples/vector_stores/Elasticsearch_demo.html#

During ingestion both text and embedding must be stored and during retrieval RRF should be used to get the most similar chunks of data. Is there a way in llama-index to support the hybrid retrieval in the way mentioned here?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Building an Index that supports a hybrid search comprising of Full text(BM-25) and vector searches #9837

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Building an Index that supports a hybrid search comprising of Full text(BM-25) and vector searches #9837

Uh oh!

blackhawk-616 Jan 4, 2024

Replies: 4 comments · 1 reply

Uh oh!

logan-markewich Jan 7, 2024 Maintainer

Uh oh!

Uh oh!

lukevrobbins Jan 9, 2024

Uh oh!

Uh oh!

blackhawk-616 Jan 9, 2024 Author

Uh oh!

logan-markewich Jan 10, 2024 Maintainer

Uh oh!

Uh oh!

blackhawk-616 Jan 10, 2024 Author

blackhawk-616
Jan 4, 2024

Replies: 4 comments 1 reply

logan-markewich
Jan 7, 2024
Maintainer

lukevrobbins
Jan 9, 2024

blackhawk-616
Jan 9, 2024
Author

logan-markewich Jan 10, 2024
Maintainer

blackhawk-616
Jan 10, 2024
Author