Semantic document search using dates to narrow down the search #2759
-
Hi, I want to perform a semantic search on +80 million news articles for which I have a publishing date. Before conducting the semantic search, I want to narrow down the news articles and only search within the given date range. I managed to use Elasticsearch to do the semantic search of over 7 million news articles just by using an input text as a query and getting pretty solid results. However, for the whole dataset written in the document store, it doesn't make sense for me to do the search over the whole 80 million news articles when I know the date range of what I'm looking for! For each row of my dataset, I have these features: Topic, URL, Date, Context Which I turned the dataframe into Haystack's DocumentStore format, the result is a dictionary as below for each news article:
I'm using
To be clear, I have a sentence with a date. I want to find the most relevant news articles given that sentence. Since I have the sentence and the news articles publishing date, I want to narrow down the news articles for the given date (a few days before and after maybe) and then conduct the semantic search. Here's the example of using filters for I can use Thanks in advance for the time you will kindly dedicate to help me out with this issue. P.S. I wanted to use FAISS but building the indexing for this size took forever. I came up with the below index based on this article guide |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @Squishy-33 thank you for explaining your setup in such detail. Your haystack/haystack/nodes/retriever/sparse.py Line 114 in a2905d0 However, note that the search won't really be a semantic search but only a keyword-based search if you use |
Beta Was this translation helpful? Give feedback.
Hi @Squishy-33 thank you for explaining your setup in such detail. Your
BM25Retriever
can handle any filters via thefilters
parameter in itsretrieve
method:haystack/haystack/nodes/retriever/sparse.py
Line 114 in a2905d0
However, note that the search won't really be a semantic search but only a keyword-based search if you use
ElasticsearchDocumentStore
with aBM25Retriever
. For semantic search you would need to use an EmbeddingRetriever and with your number of documents a DocumentStore other thanElasticsearchDocumentStore
makes more sense (faiss or milvus). The indexing takes much longer tha…