This project implements a Retrieval-Augmented Generation (RAG) chatbot that allows users to upload PDF documents, ask questions based on the content, and receive accurate, document-specific answers. It combines the power of Cohere for language processing and embeddings, Pinecone for efficient vector storage and retrieval, and Streamlit for a user-friendly interface.
- PDF Processing: Extracts text from uploaded PDF documents and splits it into manageable chunks for embedding and storage.
- Embedding and Retrieval: Uses Cohere's embeddings for encoding document chunks and Pinecone for scalable vector similarity search.
- Question Answering: Leverages Cohere's language models to generate accurate responses by retrieving and analyzing relevant document chunks.
- Interactive Interface: Provides a simple and intuitive interface using Streamlit for uploading documents, entering queries, and viewing results.
Follow the steps below to run and interact with the project:
Clone the repository to your local system using the following command:
git clone https://github.com/VivekChauhan05/RAG_Document_Question_Answering.git
cd RAG_Document_Question_Answering
Create a virtual environment and activate it to isolate project dependencies:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install the required Python libraries using the provided requirements.txt file:
pip install -r requirements.txt
Get your API keys for:
- Cohere: Sign up at Cohere to obtain an API key.
- Pinecone: Sign up at Pinecone to obtain an API key. These keys will be entered via the Streamlit interface when running the app.
Launch the Streamlit application:
cd src
streamlit run app.py
Once the application is running, open your browser and navigate to the URL provided by Streamlit, typically http://localhost:8501.
Use the interface to upload a PDF file containing the content you want to query.
-
Enter your question in the query box. The chatbot will:
-
Retrieve relevant chunks of text from the uploaded document.
-
Generate a precise and context-aware response.
├── app.py # Main application file with Streamlit interface
├── vectorstore.py # Handles PDF processing, embedding, and retrieval
├── chatbot.py # Handles user interaction and response generation
├── requirements.txt # Project dependencies
├── README.md # Project documentation
Add support for multi-language documents. Enhance the UI with multi-document support and export options for chat history. Enable deployment to cloud platforms for wider accessibility. Integrate additional vector databases for broader compatibility.
🚀 We warmly welcome contributions to enhance this project! Whether it's fixing bugs, adding new features, or improving documentation, your efforts will help make this project better for everyone. Let's collaborate and build something amazing together! 🌟✨
This project is licensed under the Apache License. See the LICENSE
file for more details.
- Cohere AI for their powerful embedding and language models. 🧠✨
- Pinecone for scalable vector search infrastructure. 🔍⚡
- Streamlit for making it easy to build interactive data apps. 📊🎉