Building an Index that supports a hybrid search comprising of Full text(BM-25) and vector searches #9837
Replies: 4 comments 1 reply
-
One tricky part with BM25 is that you need to persist the nodes somewhere. Usually, this is done with a docstore (i.e using Redis, Mongodb, or save to disk), or you can simply serialize/pickle the nodes. Then, you can combine BM25 with a vector index retriever |
Beta Was this translation helpful? Give feedback.
-
@blackhawk-616 and llm = AzureOpenAI() found here: link For the hybrid search, use a QueryFusionRetriver: link above Once you create your documents either with a loader or with the Document class, you can use the node parser to create nodes. from llama_index import SimpleDirectoryReader, VectorStoreIndex, ServiceContext
from llama_index.text_splitter import SentenceSplitter
documents = SimpleDirectoryReader("./data").load_data()
text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)
service_context = ServiceContext.from_defaults(text_splitter=text_splitter)
index = VectorStoreIndex.from_documents(
documents, service_context=service_context
) or print("Creating Nodes")
extractors = [
KeywordExtractor(keywords=5, llm = llm) # This would be where you use the above AzureOpenAI class for llm
]
pipeline = IngestionPipeline(transformations=extractors)
nodes = pipeline.run(documents = documents)
print("Creating index")
index = VectorStoreIndex(nodes, service_context=service_context)
index.storage_context.persist("index/") This sets up the docstore within the index and persists it so you can do: from llama_index.retrievers import BM25Retriever
vector_retriever = index.as_retriever(similarity_top_k=2)
bm25_retriever = BM25Retriever.from_defaults(
docstore=index.docstore, similarity_top_k=2
) and finally from llama_index.retrievers import QueryFusionRetriever
retriever = QueryFusionRetriever(
[vector_retriever, bm25_retriever],
similarity_top_k=2,
num_queries=4, # set this to 1 to disable query generation
mode="reciprocal_rerank",
use_async=True,
verbose=True,
# query_gen_prompt="...", # we could override the query generation prompt here
) The only thing the docs example linked above is missing is replacing the llm calls with an AzureOpenAI class for your deployment, and creating the nodes explicitly if that is something you want to do. Hopefully that helped |
Beta Was this translation helpful? Give feedback.
-
Thanks @logan-markewich @lukevrobbins In fact I was looking for a more production grade solution involving ElasticSearch as the DB for storing both text as well as embeddings. Both solutions mentioned here uses local DBs which cannot be used in production. Could you suggest some example which involves ElasticSearch as DB. Note: ElasticSearch has the RRF implemented internally. I am looking for a way in which llama-index can actually make use of this inbuild capability rather than implementing it using llama-index. |
Beta Was this translation helpful? Give feedback.
-
@logan-markewich During ingestion both text and embedding must be stored and during retrieval RRF should be used to get the most similar chunks of data. Is there a way in llama-index to support the hybrid retrieval in the way mentioned here? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to build an index that should support a hybrid search mechanism consisting of both BM25 and vector searches.
I would like to know how I can do this with llama-index in particular. I am using OpenAI Embeddings using an Azure Deployment of the ada embedding model. The documentation only shows how to load documents to and index which is already created. Further it shows only the vector search implementation. Any help is appreciated.
Beta Was this translation helpful? Give feedback.
All reactions