Document Base Unified Search and Extraction (DBUSE)

A lightweight Retrieval-Augmented Generation (RAG) chatbot for processing and querying Word documents, Excel files, and PDFs using Python 3.9, OpenAI API, LangChain, and Chroma.

Pronunciation: "DEH-byew-see"
Alternative: "The Bus"

Features

Multiple Document Formats: Processes PDF, Word, and Excel files
Persistent Document Bases: Save and manage multiple document collections
Conversation History: Supports contextual follow-up questions
Citation Support: Answers include citations to source documents
Streamlit UI: Clean interface for document upload and chatting

Installation

Clone this repository:

git clone https://github.com/yourusername/dbuse.git
cd dbuse

Create and activate a conda environment:

conda env create -f environment.yml
conda activate rag_chatbot

Set up your OpenAI API key:

export OPENAI_API_KEY="your-openai-api-key"

Usage

Streamlit App

python run_app.py

Or directly:

streamlit run app.py

Python API

from rag_chatbot import RAGChatbot

# Initialize the chatbot
chatbot = RAGChatbot(openai_api_key="your-api-key")

# Load documents
chatbot.load_documents(file_paths=["document.pdf", "spreadsheet.xlsx"])

# Ask questions
answer = chatbot.ask("What information is in these documents?")
print(answer)

Project Structure

dbuse/
├── utils/
│   ├── document_processor.py  # Document text extraction
│   ├── vector_store.py        # ChromaDB embeddings manager
│   ├── document_base_manager.py # Persistent document bases
│   └── prompt_loader.py       # YAML prompt templates
├── prompts/
│   ├── query_rewriter.yaml    # Rewrite contextual questions
│   └── qa_system.yaml         # Q&A with citations
├── diagrams/                  # Mermaid diagrams
├── test/                      # Test scripts
├── app.py                     # Streamlit UI
├── rag_chatbot.py             # Core RAG implementation
├── run_app.py                 # App launcher
├── demo-notebook.ipynb        # Demo notebook
└── environment.yml            # Dependencies

Future Improvements

Potential enhancements for future versions:

Support for more document formats (e.g., HTML, Markdown, CSV, TXT)
Support for OCR, captioning for images and images within documents
Support for ipynb tutorials
Integration with additional language models beyond OpenAI, like Claude
- Claude DBUSE has a nice ring to it
Improved handling of tables and structured data from Excel files

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This project leverages several powerful open-source libraries:

LangChain for document processing and RAG pipeline
Chroma for vector database functionality
Streamlit for the user interface
OpenAI for language models and embeddings

Contact

For questions or feedback, please open an issue on the GitHub repository or contact the project maintainer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Base Unified Search and Extraction (DBUSE)

Features

Installation

Usage

Streamlit App

Python API

Project Structure

Future Improvements

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
diagrams		diagrams
prompts		prompts
test		test
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
dbuse-banner.svg		dbuse-banner.svg
demo-notebook.ipynb		demo-notebook.ipynb
environment.yml		environment.yml
rag_chatbot.py		rag_chatbot.py
run_app.py		run_app.py

License

keenlychuang/dbuse-proto

Folders and files

Latest commit

History

Repository files navigation

Document Base Unified Search and Extraction (DBUSE)

Features

Installation

Usage

Streamlit App

Python API

Project Structure

Future Improvements

License

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages