Skip to content

DocuMind is a privacy-first, self-hosted AI knowledge assistant that extracts insights from PDF documents using local LLMs via Ollama.

License

Notifications You must be signed in to change notification settings

p1sangmas/DocuMind

Repository files navigation

🧠 DocuMind - AI-Powered Knowledge Base Assistant

DocuMind Logo
Version License Platform LLM

πŸš€ Overview

DocuMind is a privacy-focused, self-hosted AI assistant that helps you extract insights from your PDF documents using local LLMs through Ollama. Ask questions about your documents in natural language and receive accurate answers with source citations - all without sending your data to external services.

πŸ“„ Upload Documents β†’ πŸ” Ask Questions β†’ πŸ€– Get AI Answers

✨ Key Features

  • πŸ”’ Privacy First: All processing happens locally - no external API calls
  • πŸ“„ Multi-format PDF Processing: Robust text extraction with OCR support
  • πŸ” Hybrid Retrieval System: Combines semantic and keyword search for accuracy
  • πŸ€– Local LLM Integration: Uses Ollama (Llama 3.2 3B) for responses
  • πŸ’¬ Conversation Memory: Maintains context across multiple questions
  • πŸ“Š Source Attribution: Shows which documents informed each answer
  • πŸ”„ Automatic Document Loading: Auto-loads PDFs from the documents directory
  • 🌐 Dual Interfaces: Both Streamlit UI and HTML/CSS/JS web interface
  • ⚑ Docker Ready: Simple setup with Docker and GPU acceleration support

πŸš€ Quick Start with Docker (Recommended)

The easiest way to get started is using the included Docker helper script:

# Make the script executable (if needed)
chmod +x run_docker.sh

# Run the script and follow the menu options
./run_docker.sh

Select option 1 from the menu to start DocuMind, then:

πŸ’» Manual Setup (Alternative)

If you prefer to run without Docker:

  1. Install Dependencies

    pip install -r requirements.txt
  2. Install Ollama

    # Follow Ollama installation instructions from https://ollama.ai/
    # Run the Ollama service
    ollama serve
  3. Install OCR Dependencies (Optional)

    pip install pytesseract pdf2image pillow
    brew install tesseract poppler  # For macOS
    # See documentation/OCR_SETUP.md for other OS instructions
  4. Add Documents

    • Place PDF files in the data/documents directory
  5. Run the Application

    # Start the Web Interface
    python api.py
    
    # OR start the Streamlit Interface
    streamlit run app.py

πŸ”§ System Architecture

DocuMind/
β”œβ”€β”€ app.py                     # Main Streamlit application
β”œβ”€β”€ api.py                     # Alternative web interface (HTML/CSS/JavaScript)
β”œβ”€β”€ docker-entrypoint.sh       # Docker container startup script
β”œβ”€β”€ docker-compose.yml         # Container orchestration configuration
β”œβ”€β”€ docker-compose.gpu.yml     # GPU support configuration
β”œβ”€β”€ Dockerfile                 # Container definition
β”œβ”€β”€ run_docker.sh              # Docker helper script
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ document_processor.py  # PDF processing and extraction with OCR
β”‚   β”œβ”€β”€ chunking.py            # Semantic text chunking
β”‚   β”œβ”€β”€ retriever.py           # Hybrid retrieval system
β”‚   β”œβ”€β”€ llm_handler.py         # LLM integration and prompts
β”‚   β”œβ”€β”€ evaluator.py           # Evaluation framework
β”‚   β”œβ”€β”€ preload_models.py      # Model preloading script
β”‚   └── utils.py               # Utility functions
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ documents/             # PDF documents for auto-loading
β”‚   β”œβ”€β”€ vectorstore/           # Chroma vector database
β”‚   β”œβ”€β”€ models_cache/          # Hugging Face model cache
β”‚   └── chroma_cache/          # ChromaDB ONNX model cache
β”œβ”€β”€ config/
β”‚   └── settings.py            # Configuration settings
β”œβ”€β”€ documentation/             # Detailed documentation files
β”œβ”€β”€ tests/                     # Testing and diagnostic tools
└── web/                       # Web UI assets (HTML/CSS/JS)

πŸ“Š Performance Optimization

Embedding Model Caching

DocuMind pre-downloads and caches embedding models to improve startup and query time:

  • Models are stored in ./data/models_cache/
  • ONNX optimized versions are kept in ./data/chroma_cache/onnx_models/

LLM Selection

Choose the right LLM based on your hardware:

  • High-end systems: Use larger models like llama3.2:3b (default)
  • Low-resource systems: Switch to phi3:mini for faster responses (option 5 in the run_docker.sh menu)

πŸ” Advanced Features

Auto-Loading Documents

  • Documents placed in the data/documents directory are automatically loaded when the app starts
  • Configure auto-loading behavior in config/settings.py:
    AUTO_LOAD_DOCUMENTS = True  # Enable/disable auto-loading
    AUTO_LOAD_SKIP_EXISTING = True  # Skip already processed documents

OCR Support for Problematic PDFs

  • OCR (Optical Character Recognition) processing for difficult PDFs
  • Automatically detects when a PDF needs OCR and applies it
  • Perfect for PDFs saved from websites that have selectable text but don't parse correctly
  • See OCR Setup Guide for detailed setup instructions

PDF Diagnostic Tool

  • Use tests/check_pdf.py to diagnose problematic PDFs:
    python tests/check_pdf.py path/to/document.pdf
  • Identifies which extraction method works best for each document
  • Determines if OCR processing is recommended

πŸ› οΈ Troubleshooting Common Issues

1. Timeout Error During Query Processing

Symptom: Requests timeout with error: "Error generating response: Read timed out."

Solution:

  • Switch to a smaller LLM model through option 5 in the run_docker.sh script
  • Restart the containers to apply changes

2. Document Loading Issues

Symptom: Documents fail to load or extract properly

Solution:

  • Check the format of your PDF
  • Run diagnostic tool: python tests/check_pdf.py path/to/document.pdf
  • Enable OCR for problematic documents

3. Ollama Connection Issues

Symptom: Error connecting to Ollama service

Solution:

  • For Docker: Ensure the Ollama container is running (docker ps)
  • For manual setup: Make sure Ollama is running (ollama serve)
  • See Environment Setup Guide for details on connection configuration

For more troubleshooting tips, see the Full Documentation.

πŸ“š Documentation

Comprehensive documentation is available in the documentation folder:

πŸ“‹ Technology Stack

  • Document Processing: PyPDF2, PyMuPDF, pdfplumber, Tesseract OCR
  • Embeddings: Sentence-Transformers (all-MiniLM-L6-v2)
  • Vector Database: ChromaDB
  • LLM: Ollama (Llama 3.2 3B)
  • Frontend: Streamlit, HTML/CSS/JavaScript
  • Backend: Python FastAPI
  • Containers: Docker, Docker Compose

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Developed with ❀️ by Fakhrul Fauzi.

About

DocuMind is a privacy-first, self-hosted AI knowledge assistant that extracts insights from PDF documents using local LLMs via Ollama.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published