Semantic Search Engine for Arxiv Research Papers

This repository contains a semantic search application for ArXiv research papers that using a pre-trained DistilBERT model for generating vector embeddings and Qdrant Vector database to store them and retrieve the most relevant research papers based on user queries and and the results of the search engine.

Pipeline

Data acquisition:
- We used the ArXiv-10 dataset from HuggingFace that contains 100k rows of research papers in 10 disciplines : computer science, astrophysics, quantum physics, statistics... featuring the id of the paper, its abstract and its label (discipline/domain)
Embedding Creation:
- The DistilBERT model was used due to limited compute power (no GPU) to encode the dataset, more specifically the abstract column into dense vector representations of dimension 768, this process done in batches due to the size of the dataset.
Vector Database:
- The Qdrant Cloud was used to store the pre-computed embeddings of the research papers.
Semantic Search:
- A user enters a query into the Streamlit app and can choose how many search results to display.
- The query is encoded into a vector using DistilBERT.
- The vector is sent to the Qdrant Cloud instance, using built-in Qdrant semantic search, taking cosine similarity as a simialrity metric, we retrieve retrieves the top-k most similar poinstructs and their payloads.
- The payloads of the retrieved pointstruts including the title and the text version of the abstract itself as well as the similariy score are displayed in streamlit cards within the interface.

Tech Stack

Python + Pytorch : for the implementation.
HugghingFace (Dataset + Transformers) : For data acquisition and generating vector embeddings
Qdrant Cloud / Docker : As a vector database for storing, indexing the data, searching and retrivieng the search results.
Streamlit : For builduing the user interface.
Box / PyYAML : For managing configuration files.

Installation and Setup

Prerequisites

Python 3.8+
A Qdrant Cloud account
Docker

Steps

Clone the Repository:

git clone https://github.com/Wissemamr/arxiv_semantic_search_engine.git

Create a Virtual Environment:

python3 -m venv venv
source venv/bin/activate

Install Dependencies:

pip install -r requirements.txt

With docker

Pull the latest Qdrant iamge from docker hub

docker pull qdrant/qdrant

Run the service

docker run -p 6333:6333 -p 6334:6334 \
   -v $(pwd)/qdrant_storage:/qdrant/storage:z \
   qdrant/qdrant

Set Up Configuration: Create a config.yaml file in the root directory with the following structure:

qdrant:
  url: "<qdrant-cloud-url>"
  api_key: "<api-key>"
  collection_name: "arxiv-collection"

Run the app:
```
cd src
streamlit run app.py
```
Access the app:
- Open a browser and navigate to http://localhost:8501.
Perform search:
- Enter your search qeury in the search bar, select from the sliding bar the number of returned search results and click on Search

Demonstration

Future Improvements

Use a richer dataset or add more columns to the current dataset such as publication year, the list of keywords, citation link and the link to access the document directly.
Enable pagination for large result sets.
Integrate additional metadata filters (e.g., year, author).
Deploy the app to a cloud platform for public access.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Search Engine for Arxiv Research Papers

Pipeline

Tech Stack

Installation and Setup

Prerequisites

Steps

Demonstration

Future Improvements

Useful Ressources :

About

Uh oh!

Releases

Packages

Languages

Wissemamr/arxiv_semantic_search_engine

Folders and files

Latest commit

History

Repository files navigation

Semantic Search Engine for Arxiv Research Papers

Pipeline

Tech Stack

Installation and Setup

Prerequisites

Steps

Demonstration

Future Improvements

Useful Ressources :

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages