With the ever increasing database of knowledge on the internet, finding the most relevant information becomes harder and harder, as it is difficult to navigate large collections of documents and texts. Information Retrieval (IR) is the process of accessing documents from a database to satisfy an user’s information need.
In recent years, researchers have introduced the idea of combining sparse and dense representations to leverage both exact term matching and semantic meaning. Hybrid approaches often involve combining the scores or representations of sparse and dense models in various ways, such as by weighted interpolation. It is also proposed that a hybrid model can be used for tracking the connections made by dense models, revealing possible biases and allowing us to improve the semantic links.
We analyze the responses of different IR techniques, specifically sparse versus dense search, as well as their hybridization, to evaluate which performs best over various categories of queries. Using content analysis, we identify different categories of queries, such as keywords versus sentences, questions versus descriptions, and more. Then, using visualization tools, such as Numpy, Scikit, and Pandas, we perform a high-level evaluation and comparison of the performance of different sparse, dense, and hybrid models, such as BM25, SPLADE, BERT, and more.