A lightweight Retrieval-Augmented Generation (RAG) chatbot for processing and querying Word documents, Excel files, and PDFs using Python 3.9, OpenAI API, LangChain, and Chroma.
- Pronunciation: "DEH-byew-see"
- Alternative: "The Bus"
- Multiple Document Formats: Processes PDF, Word, and Excel files
- Persistent Document Bases: Save and manage multiple document collections
- Conversation History: Supports contextual follow-up questions
- Citation Support: Answers include citations to source documents
- Streamlit UI: Clean interface for document upload and chatting
-
Clone this repository:
git clone https://github.com/yourusername/dbuse.git cd dbuse
-
Create and activate a conda environment:
conda env create -f environment.yml conda activate rag_chatbot
-
Set up your OpenAI API key:
export OPENAI_API_KEY="your-openai-api-key"
python run_app.py
Or directly:
streamlit run app.py
from rag_chatbot import RAGChatbot
# Initialize the chatbot
chatbot = RAGChatbot(openai_api_key="your-api-key")
# Load documents
chatbot.load_documents(file_paths=["document.pdf", "spreadsheet.xlsx"])
# Ask questions
answer = chatbot.ask("What information is in these documents?")
print(answer)
dbuse/
├── utils/
│ ├── document_processor.py # Document text extraction
│ ├── vector_store.py # ChromaDB embeddings manager
│ ├── document_base_manager.py # Persistent document bases
│ └── prompt_loader.py # YAML prompt templates
├── prompts/
│ ├── query_rewriter.yaml # Rewrite contextual questions
│ └── qa_system.yaml # Q&A with citations
├── diagrams/ # Mermaid diagrams
├── test/ # Test scripts
├── app.py # Streamlit UI
├── rag_chatbot.py # Core RAG implementation
├── run_app.py # App launcher
├── demo-notebook.ipynb # Demo notebook
└── environment.yml # Dependencies
Potential enhancements for future versions:
- Support for more document formats (e.g., HTML, Markdown, CSV, TXT)
- Support for OCR, captioning for images and images within documents
- Support for ipynb tutorials
- Integration with additional language models beyond OpenAI, like Claude
- Claude DBUSE has a nice ring to it
- Improved handling of tables and structured data from Excel files
This project is licensed under the MIT License - see the LICENSE file for details.
This project leverages several powerful open-source libraries:
- LangChain for document processing and RAG pipeline
- Chroma for vector database functionality
- Streamlit for the user interface
- OpenAI for language models and embeddings
For questions or feedback, please open an issue on the GitHub repository or contact the project maintainer.