Search Books with Vector Embeddings

This project enables efficient similarity-based book recommendations using vector embeddings and a local database. It combines Sentence Transformers, Qdrant vector database, and the OpenAI API (or a local LLM) to process, store, and query book data, with a focus on Retrieval-Augmented Generation (RAG) to provide additional context to the LLM.

Features

Load and preprocess book data from a CSV file.
Encode book metadata into vector embeddings using SentenceTransformer.
Store embeddings in an in-memory Qdrant database for similarity searches.
Perform similarity searches based on user prompts.
Use RAG to send user prompts along with relevant context retrieved from the vector database to an LLM, enabling the generation of informed and accurate responses.

Project Structure

search-books/
├── .env                  # Environment variables (e.g., OpenAI API key)
├── .gitignore            # Ignored files and directories
├── books-dataset.csv     # Dataset with book details (modified from Kaggle)
├── requirements.txt      # Python dependencies
├── search-books.ipynb    # Jupyter Notebook with step-by-step workflow
├── search-books.py       # Python script version of the project
└── openai_response.json  # Saved response from OpenAI (if generated)

Prerequisites

Before running the project, ensure you have the following installed:

Python 3.8 or higher
A virtual environment for Python dependencies
Access to the OpenAI API or a compatible local LLM

Dataset Information

The dataset used in this project was downloaded from Kaggle: Goodreads Books Dataset
The original dataset was modified to reduce processing time by:
- Removing many rows.
- Adding a few new rows with custom concepts to test the RAG system and ensure the LLM uses the context provided by the vector database rather than its pre-trained knowledge.

Example of the first row in the dataset:

bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
1,Harry Potter and the Half-Blood Prince (Harry Potter  #6),J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.

Setup Instructions

Clone the repository:

git clone https://github.com/<your-username>/search-books.git
cd search-books

Create a virtual environment and activate it:

python -m venv myenv
source myenv/bin/activate   # On Windows: myenv\Scripts�ctivate

Install dependencies:
```
pip install -r requirements.txt
```
Set up the .env file with your OpenAI API key (if using OpenAI):
```
OPENAI_API_KEY=your-openai-api-key
```

How to Run

Run the Jupyter Notebook: Open search-books.ipynb in a Jupyter environment and execute cells sequentially to process the data, encode embeddings, and perform searches.
Run the Python Script: Alternatively, execute the Python script directly:
```
python search-books.py
```

Key Functionality

1. Loading Data

Loads book data from books-dataset.csv.
Preprocesses and structures data for encoding.

2. Vector Encoding

Encodes book metadata (title, authors) into vector embeddings using SentenceTransformer.

3. Qdrant Database

Stores vector embeddings in an in-memory Qdrant database.
Enables efficient similarity searches.

4. Retrieval-Augmented Generation (RAG)

Accepts user prompts to find books with similar characteristics.
Retrieves top recommendations using cosine similarity from the vector database.
Sends the retrieved context along with the user prompt to an LLM to generate informed responses.

Example Usage

User Prompt

Show me books related to Quantum Bioinformatics for expert users.

Example Output

[
  {
    "bookID": 22874,
    "title": "Learn Quantum Bioinformatics Basics",
    "authors": "Dr. Evelyn Harper",
    "average_rating": 3.99,
    "isbn": "0553297600",
    "isbn13": 9780553297607,
    "language_code": "en-US",
    "num_pages": 480,
    "ratings_count": 5020,
    "text_reviews_count": 101,
    "publication_date": "7/21/2010",
    "publisher": "BZY Media"
  },
  {
    "bookID": 22874,
    "title": "Advanced Quantum Bioinformatics",
    "authors": "Dr. Evelyn Harper",
    "average_rating": 3.99,
    "isbn": "0553297600",
    "isbn13": 9780553297607,
    "language_code": "en-US",
    "num_pages": 480,
    "ratings_count": 5020,
    "text_reviews_count": 101,
    "publication_date": "7/21/2010",
    "publisher": "BZY Media"
  },
  {
    "bookID": 22874,
    "title": "Quantum Bioinformatics in a Nutshell",
    "authors": "Dr. Adrian Sinclair",
    "average_rating": 3.99,
    "isbn": "0553297600",
    "isbn13": 9780553297607,
    "language_code": "en-US",
    "num_pages": 480,
    "ratings_count": 5020,
    "text_reviews_count": 101,
    "publication_date": "7/21/2010",
    "publisher": "NewX Media"
  }
]

LLM Response Example (Local LLaMA Model)

After retrieving relevant data from the vector database, a local LLM was used to generate a detailed recommendation based on the user prompt and retrieved context:

Example Response

For expert users, I recommend the following top-rated book on Quantum Bioinformatics:

**"Advanced Quantum Bioinformatics" by Dr. Evelyn Harper**

This book is an in-depth guide that covers the latest advancements in Quantum Bioinformatics. It provides a comprehensive overview of the subject, including the principles of quantum mechanics, quantum computing, and their applications in bioinformatics.

**Rating:** 3.99/5
**Number of Reviews:** 5020
**Publication Date:** 7/21/2010
**Publisher:** BZY Media

This book is suitable for expert users who want to delve deeper into the subject and stay up-to-date with the latest developments in Quantum Bioinformatics.

Additional Information

This project is based on the Coursera guided project "Introduction to Retrieval Augmented Generation (RAG)" by Alfredo Deza. During the course, we built an end-to-end RAG system using open-source tools such as Pandas, SentenceTransformers, and Qdrant. The implementation was extended to work with both OpenAI’s GPT-4 and a local LLM downloaded from Hugging Face.

Local LLM Details

Model: LLaMA 3.2 3B Instruct
Size: 2.62 GB
License: LLaMA 3.2
File: Llama-3.2-3B-Instruct.Q6_K.llamafile

The local LLM was used for testing to ensure the application works seamlessly with both cloud-based and local AI models.

Troubleshooting

Missing API Key: Ensure the .env file is correctly configured with your OpenAI API key if using OpenAI.
Deprecation Warnings: Some methods in Qdrant (e.g., recreate_collection) may show deprecation warnings. Follow the latest Qdrant documentation for updates.

Dependencies

The required dependencies are listed in requirements.txt. Major dependencies include:

pandas (Data processing)
numpy (Numerical operations)
sentence-transformers (Vector embeddings)
qdrant-client (In-memory vector database)
openai (Optional for conversational AI)

Limitations

Currently uses an in-memory Qdrant database. For production, consider a persistent deployment.
Designed for text-based similarity searches, not advanced analytics.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments

Special thanks to:

The Sentence Transformers team for their excellent pre-trained models.
The Qdrant team for their powerful vector search capabilities.
OpenAI for their state-of-the-art conversational AI models.
Hugging Face for providing accessible local LLM models.
Kaggle for the original Goodreads Books Dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Search Books with Vector Embeddings

Features

Project Structure

Prerequisites

Dataset Information

Setup Instructions

How to Run

Key Functionality

1. Loading Data

2. Vector Encoding

3. Qdrant Database

4. Retrieval-Augmented Generation (RAG)

Example Usage

User Prompt

Example Output

LLM Response Example (Local LLaMA Model)

Example Response

Additional Information

Local LLM Details

Troubleshooting

Dependencies

Limitations

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
books-dataset.csv		books-dataset.csv
openai_response.json		openai_response.json
requirements.txt		requirements.txt
search-books.ipynb		search-books.ipynb
search-books.py		search-books.py

khaledjamal/rag-powered-book-search

Folders and files

Latest commit

History

Repository files navigation

Search Books with Vector Embeddings

Features

Project Structure

Prerequisites

Dataset Information

Setup Instructions

How to Run

Key Functionality

1. Loading Data

2. Vector Encoding

3. Qdrant Database

4. Retrieval-Augmented Generation (RAG)

Example Usage

User Prompt

Example Output

LLM Response Example (Local LLaMA Model)

Example Response

Additional Information

Local LLM Details

Troubleshooting

Dependencies

Limitations

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages