This project enables efficient similarity-based book recommendations using vector embeddings and a local database. It combines Sentence Transformers, Qdrant vector database, and the OpenAI API (or a local LLM) to process, store, and query book data, with a focus on Retrieval-Augmented Generation (RAG) to provide additional context to the LLM.
- Load and preprocess book data from a CSV file.
- Encode book metadata into vector embeddings using
SentenceTransformer
. - Store embeddings in an in-memory Qdrant database for similarity searches.
- Perform similarity searches based on user prompts.
- Use RAG to send user prompts along with relevant context retrieved from the vector database to an LLM, enabling the generation of informed and accurate responses.
search-books/
├── .env # Environment variables (e.g., OpenAI API key)
├── .gitignore # Ignored files and directories
├── books-dataset.csv # Dataset with book details (modified from Kaggle)
├── requirements.txt # Python dependencies
├── search-books.ipynb # Jupyter Notebook with step-by-step workflow
├── search-books.py # Python script version of the project
└── openai_response.json # Saved response from OpenAI (if generated)
Before running the project, ensure you have the following installed:
- Python 3.8 or higher
- A virtual environment for Python dependencies
- Access to the OpenAI API or a compatible local LLM
- The dataset used in this project was downloaded from Kaggle: Goodreads Books Dataset
- The original dataset was modified to reduce processing time by:
- Removing many rows.
- Adding a few new rows with custom concepts to test the RAG system and ensure the LLM uses the context provided by the vector database rather than its pre-trained knowledge.
- Example of the first row in the dataset:
bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher 1,Harry Potter and the Half-Blood Prince (Harry Potter #6),J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
- Clone the repository:
git clone https://github.com/<your-username>/search-books.git cd search-books
- Create a virtual environment and activate it:
python -m venv myenv source myenv/bin/activate # On Windows: myenv\Scripts�ctivate
- Install dependencies:
pip install -r requirements.txt
- Set up the
.env
file with your OpenAI API key (if using OpenAI):OPENAI_API_KEY=your-openai-api-key
- Run the Jupyter Notebook:
Open
search-books.ipynb
in a Jupyter environment and execute cells sequentially to process the data, encode embeddings, and perform searches. - Run the Python Script:
Alternatively, execute the Python script directly:
python search-books.py
- Loads book data from
books-dataset.csv
. - Preprocesses and structures data for encoding.
- Encodes book metadata (
title
,authors
) into vector embeddings usingSentenceTransformer
.
- Stores vector embeddings in an in-memory Qdrant database.
- Enables efficient similarity searches.
- Accepts user prompts to find books with similar characteristics.
- Retrieves top recommendations using cosine similarity from the vector database.
- Sends the retrieved context along with the user prompt to an LLM to generate informed responses.
Show me books related to Quantum Bioinformatics for expert users.
[
{
"bookID": 22874,
"title": "Learn Quantum Bioinformatics Basics",
"authors": "Dr. Evelyn Harper",
"average_rating": 3.99,
"isbn": "0553297600",
"isbn13": 9780553297607,
"language_code": "en-US",
"num_pages": 480,
"ratings_count": 5020,
"text_reviews_count": 101,
"publication_date": "7/21/2010",
"publisher": "BZY Media"
},
{
"bookID": 22874,
"title": "Advanced Quantum Bioinformatics",
"authors": "Dr. Evelyn Harper",
"average_rating": 3.99,
"isbn": "0553297600",
"isbn13": 9780553297607,
"language_code": "en-US",
"num_pages": 480,
"ratings_count": 5020,
"text_reviews_count": 101,
"publication_date": "7/21/2010",
"publisher": "BZY Media"
},
{
"bookID": 22874,
"title": "Quantum Bioinformatics in a Nutshell",
"authors": "Dr. Adrian Sinclair",
"average_rating": 3.99,
"isbn": "0553297600",
"isbn13": 9780553297607,
"language_code": "en-US",
"num_pages": 480,
"ratings_count": 5020,
"text_reviews_count": 101,
"publication_date": "7/21/2010",
"publisher": "NewX Media"
}
]
After retrieving relevant data from the vector database, a local LLM was used to generate a detailed recommendation based on the user prompt and retrieved context:
For expert users, I recommend the following top-rated book on Quantum Bioinformatics:
**"Advanced Quantum Bioinformatics" by Dr. Evelyn Harper**
This book is an in-depth guide that covers the latest advancements in Quantum Bioinformatics. It provides a comprehensive overview of the subject, including the principles of quantum mechanics, quantum computing, and their applications in bioinformatics.
**Rating:** 3.99/5
**Number of Reviews:** 5020
**Publication Date:** 7/21/2010
**Publisher:** BZY Media
This book is suitable for expert users who want to delve deeper into the subject and stay up-to-date with the latest developments in Quantum Bioinformatics.
This project is based on the Coursera guided project "Introduction to Retrieval Augmented Generation (RAG)" by Alfredo Deza. During the course, we built an end-to-end RAG system using open-source tools such as Pandas
, SentenceTransformers
, and Qdrant
. The implementation was extended to work with both OpenAI’s GPT-4 and a local LLM downloaded from Hugging Face.
- Model: LLaMA 3.2 3B Instruct
- Size: 2.62 GB
- License: LLaMA 3.2
- File: Llama-3.2-3B-Instruct.Q6_K.llamafile
The local LLM was used for testing to ensure the application works seamlessly with both cloud-based and local AI models.
- Missing API Key: Ensure the
.env
file is correctly configured with your OpenAI API key if using OpenAI. - Deprecation Warnings: Some methods in Qdrant (e.g.,
recreate_collection
) may show deprecation warnings. Follow the latest Qdrant documentation for updates.
The required dependencies are listed in requirements.txt
. Major dependencies include:
pandas
(Data processing)numpy
(Numerical operations)sentence-transformers
(Vector embeddings)qdrant-client
(In-memory vector database)openai
(Optional for conversational AI)
- Currently uses an in-memory Qdrant database. For production, consider a persistent deployment.
- Designed for text-based similarity searches, not advanced analytics.
This project is licensed under the MIT License. See the LICENSE
file for more details.
Special thanks to:
- The Sentence Transformers team for their excellent pre-trained models.
- The Qdrant team for their powerful vector search capabilities.
- OpenAI for their state-of-the-art conversational AI models.
- Hugging Face for providing accessible local LLM models.
- Kaggle for the original Goodreads Books Dataset.