This project processes and indexes text documents using FAISS for efficient similarity search. It supports embedding entire documents and individual sentences for fine-grained search queries about specific information on any document.
- Detects file encoding and extracts text from PDFs and other text files.
- Cleans and tokenizes text into sentences.
- Generates and stores embeddings for documents and sentences.
- Uses FAISS for efficient similarity searches.
- Stores document metadata and embeddings in SQLite.
- Python 3.10+
- NOTE: this uses the faiss-gpu package, which may not work if you have a gpu that doesnt support CUDA. Consider switching to the faiss-cpu package instead.
- Clone the repository:
git clone https://github.com/yourusername/yourproject.git cd yourproject
- Install dependencies:
make install-all
Use parse_doc(file_path)
to extract and clean text from a document.
Call save_embedding(doc_id, doc_name, embedding, text, sentence_embeddings)
to store document embeddings in the database and FAISS index.
Use FAISS to search for similar documents:
query_embedding = model.encode("your query text")
D, I = doc_index.search(query_embedding.reshape(1, -1), k=5)
- Navigate to the backend directory:
cd backend
- Start the backend server:
OR
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
- Run with the make file:
make run-backend
- Navigate to the frontend directory:
cd frontend
- Build the frontend:
npm run build
- Preview the frontend:
npm run preview
- In the root directory, run:
make start