SemanticSearch

A FastAPI-based semantic search service for book summaries using FAISS vector database and Sentence Transformers.

Features

Semantic search across book summaries using embeddings
Fast vector similarity search with FAISS
RESTful API with FastAPI
Document processing pipeline for PDFs and DOCX files
Automatic text summarization with BART
Page-level search to find the most relevant page for a query
Automatic parameter optimization based on system capabilities
Scalable architecture for processing millions of books
Parallel processing with multiprocessing
Pagination support for large result sets
Memory-efficient processing with batching

Project Structure

├── main.py                 # FastAPI application entry point
├── Preprocessing/          # Data preparation scripts
│   ├── faiss_indexer.py    # Creates FAISS index from book summaries
│   ├── generate_summaries.py # Generates summaries from books
│   ├── all-MiniLM-L6-v2/   # Sentence transformer model
│   └── bart-large-cnn/     # Text summarization model
├── README.md               # Main documentation
├── ADD_NEW_BOOKS.md        # Guide for adding new books to existing index
├── AUTO_PARAMS.md          # Documentation for automatic parameter optimization
├── LARGE_SCALE.md          # Guide for processing large datasets
└── requirements.txt        # Python dependencies

Setup and Installation

Clone the repository
Create a virtual environment: python -m venv env
Activate the environment:
- Windows: env\Scripts\activate
- Linux/Mac: source env/bin/activate
Install dependencies: pip install -r requirements.txt

Documentation

ADD_NEW_BOOKS.md - Instructions for adding new books to existing index
AUTO_PARAMS.md - Details on automatic parameter optimization
LARGE_SCALE.md - Guide for processing large datasets

Usage

Preprocessing

Place your PDF/DOCX documents in the Preprocessing/Books/ directory
Generate summaries with automatic parameter optimization:
```
python Preprocessing/generate_summaries.py
```
The script will automatically detect your system's capabilities and set optimal parameters.

You can also manually specify parameters:
```
python Preprocessing/generate_summaries.py --batch-size 100 --workers 4 --no-auto
```

Create FAISS index with automatic parameter optimization:

python Preprocessing/faiss_indexer.py

Or manually specify parameters:

python Preprocessing/faiss_indexer.py --batch-size 1000 --use-ivf 10000 --nprobe 16 --no-auto

Running the API

Development Mode

uvicorn main:app --reload

Production Mode

uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Using Docker

docker-compose up -d

The API will be available at http://localhost:8000 and the API documentation at http://localhost:8000/docs

API Endpoints

POST /search - Search for books by semantic similarity
- Request body: {"query": "your search query", "top_k": 5, "max_score": 1.0, "page": 1, "page_size": 10}
- Response includes the most relevant page from each document matching the query
GET /health - Health check endpoint

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Preprocessing		Preprocessing
.gitignore		.gitignore
ADD_NEW_BOOKS.md		ADD_NEW_BOOKS.md
README.md		README.md
add_new_books.bat		add_new_books.bat
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SemanticSearch

Features

Project Structure

Setup and Installation

Documentation

Usage

Preprocessing

Running the API

Development Mode

Production Mode

Using Docker

API Endpoints

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Creatron-Oleksii/SemanticSearch

Folders and files

Latest commit

History

Repository files navigation

SemanticSearch

Features

Project Structure

Setup and Installation

Documentation

Usage

Preprocessing

Running the API

Development Mode

Production Mode

Using Docker

API Endpoints

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages