- Overview
- Stack
- Usage
- Docker Container Creation
- REST API
- Different Modes
- Streamlit App
- Video Demonstration
- Possible Improvements
This is a Retrieval Augmented Generation (RAG) system for querying information from documents. The focus is on contractual information but it can be of any type generally.
It uses FAISS (Facebook AI Similarity Search) as its vector store and uses Google Gemini API for inference.
The indexed data is chunked for better throughoutput and hashed so no duplicates are accepted.
The system is served in a Streamlit frontend, and has 3 modes of use:
-
Local:
This mode allows you to query documents directly on your local machine without any user authentication or session tracking. Ideal for single-user use or quick testing of the vector store. -
User-based:
In this mode, the system tracks individual users, allowing personalized document queries and maintaining user-specific data across sessions. -
Session-based:
This mode enables temporary sessions where queries and data are isolated per session, useful for short-lived or anonymous interactions without persistent user accounts.
- PDF Text Extraction: Extracts text from PDFs using PyMuPDF.
- Semantic Search: Queries contracts using natural language with FAISS-based vector search.
- Multilingual Support: Handles English and Croatian queries with automatic language detection.
- Flexible Storage Modes: Supports Global, User, and Session-based data storage.
- Web Interface: Streamlit app for easy querying and data status visualization.
- API Services: FastAPI endpoints for uploading PDFs and querying documents.
- Robust Error Handling: Comprehensive error management and logging.
- Session Management: Automatic cleanup of temporary session data.
- Professional Responses: Formal, structured answers with context for contract-related queries.
- Modular design: A distributed microservices design to facilitate MLOps and deployment.
- Deduplication of embeddings: All embeddings are hashed and won't be recomputed if present.
- Caching of embeddings: All embeddings are cached to speed up and streamline retrieval and user experience.
- Python: Core programming language.
- Streamlit: Web interface framework.
- FastAPI: Asynchronous API framework.
- FAISS: Vector storage and search.
- PyMuPDF: PDF text extraction.
- LangChain: Text chunking.
- Google Generative AI: Text embedding and content generation.
- Lingua: Language detection.
- NLTK: Stopwords processing. -- In progress, first there is a need for automatic labeling of metadata when chunking
- Uvicorn: ASGI server for FastAPI.
- JSON/FAISS: Data storage formats.
- Bash: Deployment script
- Redis: Caching of embeddings
Important!
This project was designed for a Linux-like environment, you may use either Linux or WSL (Windows subsystem for Linux) on Windows.
Link for WSL tutorial
You will also need Docker.
Docker download link
- cd into your desired folder and download the project
git clone https://github.com/MortalWombat-repo/Document_QA_with_FAISS.git
- cd into the folder
cd Document_QA_with_FAISS
This project uses Google Gemini API for inference. To use this project you should supply your own API key.
You can create your own API key AT THIS LINK
You should add it to the .env file.
nano .env
- To build a Docker container run:
docker build -t my-app .
- To run a container with all of the exposed ports run:
docker run -it -p 8000:8000 -p 8001:8001 -p 8002:8002 my-app
Do NOT omit the -p flag, and do not change the ports unless you also change them in the .py files that uvicorn serves.
OR
If you wish to run a Redis cached service to speed up queries, you will need a multi-container solution.
- For that you will need to run Docker compose with this command.
docker compose up --build
The REST API is implemented with FastAPI and a production ready server uvicorn.
For session and user-based modes, you should utilize the functionality from the upload.py
file that borrows from core.py
, by sending a POST request with the file to be vectorized to a listening server, either by a direct CURL request from the terminal, or by a script that mimics curl-like behavior in testing folder.
After the vectorization you can either query via POST request or from a Streamlit frontend.
- cd into testing
cd testing
- run either test_upload_manual.py or test_upload_session.py
python test_upload_manual.py
python test_upload_session.py
Alternatively run it through curl for user/manual
curl -X POST http://localhost:8001/upload \
-F "files=@YOUR_FILES_ABSOLUTE_PATH" \
-F "user_id=123" \
-F "session=false"
or for session based
curl -X POST http://localhost:8001/upload \
-F "files=@YOUR_FILES_ABSOLUTE_PATH \
-F "session=true"
- run a query from ask.py
python test_ask.py
or run from curl
curl -X POST http://localhost:8002/ask \
-H "Content-Type: application/json" \
-d '{
"user_query": "The invoices must contain which information?",
"user_language": "en",
"top_k": 1,
"user_id": "123",
"session_id": "manual"
}'
Next steps may vary.
You will have different options based on the mode you opt for.
-
Local:
The use case is as follows: First add your .pdf files in the files folder, and next runpython create_vector_store.py
to populate with vecor_store files. After that either runstreamlit run app.py
and choose local option or build a docker container and use it like that. -
User-based:
For this mode there will be a need for building your own docker container with the use of a custom REST API for populating the vector store. Populate the database like was demonstrated in REST API section and copy the username, you will need it for authentication with Streamlit. -
Session-based:
This mode also follows the same logic as user-based option. You need to record your session id for authentication in Streamlit.
You have a fully functional Streamlit frontend when ran in Docker or outside of it as explained earlier. To access Streamlit on your localhost, either look for a link in the terminal output when first running your Docker container or enter into your browser:
http://localhost:8000
Unfortunately, even though it is an active issue that is being worked on, there is still no https support in streamlit. One might try running nginx as a reverse proxy to remedy that, but that is complex and not recommended for this setup.
After cloning the repository, cd-ing into the project, and generating your own API KEY as described, build the container.
You can either run create_vector_store.py
or populate it yourself using POST requests, either from prepared scripts in the testing
folder or manually through curl as mentioned.
We will demonstrate both methods.
run python create_vector_store.py
using a POST request and demonstrating the created files in Docker. We will need to keep a running container for this process.
As mentioned there are three modes you can choose from.
Global/Local mode takes data in from the vector_store folder that you populate with create_vector_store.py
script.
For User and Session based, you will need to authorize in Streamlit by copying user_id or session_id.
The video is too large to embed.
- adding metadata when chunking and labeling each chunk
- using said matadata to represent chunks as a whole document for easier keyword visualization and whole document operations such as summarizing and paraphrasing.
- labeler of key terms and ranking of documents by importance (possibly with huggingface transformer library for sentiment analysis)