Skip to content

OpenMined/local-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

πŸ” RAG Document Search

A lightweight, local RAG (Retrieval-Augmented Generation) system for indexing and searching your documents. Built with FastAPI, Docling, and ChromaDB for high-performance semantic search across your local files.

✨ Features

  • πŸš€ Lightweight & Fast: Optimized for performance with millions of document chunks
  • 🌐 Beautiful Web Interface: Modern, responsive UI for easy document management and search
  • πŸ“ Auto File Watching: Automatically indexes new/modified files in watched folders
  • πŸ” Semantic Search: Uses advanced embeddings for intelligent document retrieval
  • πŸ“Š Real-time Stats: Monitor your document index and search performance
  • πŸ—‚οΈ File Browser: Dropbox-like interface for browsing and selecting files/folders
  • ⚑ Smart Indexing: Avoids re-indexing unchanged files using content hashing
  • πŸ“ˆ Progress Tracking: Real-time indexing progress with detailed status updates
  • πŸ’Ύ Persistent Configuration: Automatically saves and restores watched folders
  • πŸ”§ Configurable: Easy configuration via environment variables
  • πŸ” OAuth Support: Integration with Microsoft OneDrive/SharePoint (via .tokens.json)

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Frontend      β”‚    β”‚   Backend API    β”‚    β”‚  Vector Store   β”‚
β”‚  (HTML/JS)      │◄──►│   (FastAPI)      │◄──►│   (Chroma)      β”‚
β”‚                 β”‚    β”‚                  β”‚    β”‚                 β”‚
β”‚ β€’ File selector β”‚    β”‚ β€’ File watcher   β”‚    β”‚ β€’ Embeddings    β”‚
β”‚ β€’ Search UI     β”‚    β”‚ β€’ Doc processing β”‚    β”‚ β€’ Metadata      β”‚
β”‚ β€’ Results view  β”‚    β”‚ β€’ Embedding gen  β”‚    β”‚ β€’ Fast search   β”‚
β”‚ β€’ Progress view β”‚    β”‚ β€’ Hash checking  β”‚    β”‚ β€’ Deduplication β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚  File System     β”‚
                       β”‚   Watcher        β”‚
                       β”‚  (watchdog)      β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Supported File Formats

  • PDF documents
  • Microsoft Word (.docx)
  • Text files (.txt, .md)
  • HTML files
  • PowerPoint (.pptx)
  • Excel (.xlsx)

πŸš€ Quick Start

Prerequisites

  • Python 3.12+
  • uv package manager

Installation

  1. Clone the repository:

    git clone <your-repo-url>
    cd syftbox-rag
  2. Install dependencies:

    uv pip install -r requirements.txt
  3. Run the application:

    ./run.sh
  4. Open your browser and go to:

    http://localhost:9000
    

🎯 Usage

Adding Documents

  1. Go to the "Manage Files" tab
  2. Use the file browser to navigate to your desired folder
  3. Select folders or individual files using checkboxes
  4. Click "Add Selected" to start indexing
  5. Monitor progress in real-time with the indexing status indicator

File Browser Features

  • πŸ“‚ Navigate through your file system like Dropbox
  • β˜‘οΈ Select multiple files and folders with checkboxes
  • πŸ“Š View file sizes and modification dates
  • πŸ” Quick folder expansion/collapse
  • 🏠 Home directory quick access

Searching Documents

  1. Go to the "Search Documents" tab
  2. Enter your search query in natural language
  3. Adjust the results limit if needed (default: 20)
  4. Click "Search" or press Enter
  5. View results with similarity scores and metadata

Search Examples

  • "machine learning algorithms"
  • "project timeline and deadlines"
  • "financial reports Q3"
  • "meeting notes from last week"

βš™οΈ Configuration

You can customize the system behavior using environment variables:

Database Settings

export VECTOR_DB_PATH="./my_vector_db"  # Vector database location

Embedding Model

export EMBEDDING_MODEL="all-MiniLM-L6-v2"  # Sentence transformer model

Document Processing

export CHUNK_SIZE="500"           # Document chunk size (characters)
export CHUNK_OVERLAP="50"         # Overlap between chunks
export MIN_CHUNK_SIZE="50"        # Minimum chunk size to index

Server Settings

export HOST="0.0.0.0"            # Server host
export PORT="8080"               # Server port

Performance Tuning

export PROCESSING_DELAY="1.0"         # File processing delay (seconds)
export DEFAULT_SEARCH_LIMIT="20"      # Default search results
export MAX_SEARCH_LIMIT="100"         # Maximum search results

πŸƒβ€β™‚οΈ Running the System

Method 1: Using the run script (Recommended)

./run.sh

Method 2: Direct FastAPI

uv run uvicorn backend.main:app --host 127.0.0.1 --port 9000

Method 3: Development mode (with auto-reload)

uv run uvicorn backend.main:app --host 127.0.0.1 --port 9000 --reload

Stopping the Application

./cleanup.sh

πŸ“ Project Structure

syftbox-rag/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py              # FastAPI server with API endpoints
β”‚   β”œβ”€β”€ config.py            # Configuration settings
β”‚   β”œβ”€β”€ document_processor.py # Docling integration for parsing
β”‚   β”œβ”€β”€ embeddings.py        # Sentence transformer embeddings
β”‚   β”œβ”€β”€ file_watcher.py      # File system monitoring & processing
β”‚   └── vector_store.py      # ChromaDB interface
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ index.html           # Main interface with tabs
β”‚   β”œβ”€β”€ app.js               # Frontend logic & file browser
β”‚   └── style.css            # Modern responsive styling
β”œβ”€β”€ data/                    # Application data (logs, PID files)
β”œβ”€β”€ vector_db/               # ChromaDB storage (auto-created)
β”œβ”€β”€ .tokens.json             # OAuth tokens (optional)
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ run.sh                   # Application launcher script
β”œβ”€β”€ cleanup.sh               # Application cleanup script
└── README.md                # This file

πŸ”§ API Endpoints

The system provides a comprehensive REST API:

  • GET / - Main web interface
  • POST /api/add-folder - Add folder to watch list
  • GET /api/watched-folders - Get watched folders
  • DELETE /api/watched-folders/{path} - Remove watched folder
  • POST /api/search - Search documents
  • GET /api/stats - Get database statistics
  • GET /api/indexing-status - Get real-time indexing progress
  • POST /api/file-structure - Browse file system
  • GET /api/file-structure/home - Get home directory structure
  • POST /api/folder-selection - Batch add/remove files and folders

🎯 Advanced Features

Smart Indexing

  • File hashing prevents re-indexing unchanged documents
  • Chunked processing handles large files efficiently
  • Background processing doesn't block the UI
  • Error recovery handles corrupted or inaccessible files

Real-time Progress Tracking

  • Operation queue with detailed progress information
  • File-level progress with chunk counting
  • Size estimation and processing speed metrics
  • Activity logs for debugging and monitoring

Persistent State Management

  • Watched folders automatically restored on restart
  • Database integrity maintained across sessions
  • Configuration persistence via environment variables

πŸ”§ Development

Adding Dependencies

uv add package-name

Running Tests

# Add test files to test the system
uv run python -m pytest tests/

Code Formatting

uv run black backend/
uv run isort backend/

🎯 Performance Considerations

  • Vector Database: ChromaDB provides excellent performance for millions of document chunks
  • Embedding Model: The default all-MiniLM-L6-v2 model balances speed and accuracy
  • Chunking Strategy: 500-character chunks with 50-character overlap work well for most documents
  • File Watching: Files are processed asynchronously to avoid blocking the UI
  • Search Speed: Typical search times are under 1 second for large document collections
  • Smart Caching: File hashing prevents unnecessary re-processing
  • Memory Management: Efficient streaming processing for large documents

πŸ› οΈ Troubleshooting

Common Issues

  1. Port already in use:

    export SYFTBOX_ASSIGNED_PORT="8080"  # Use a different port
    ./run.sh
  2. Permission errors when adding folders:

    • Ensure the folder path exists and is readable
    • Check file permissions on the target directory
  3. Slow indexing:

    • Reduce CHUNK_SIZE for faster processing
    • Increase PROCESSING_DELAY to reduce system load
    • Monitor progress in the indexing status panel
  4. Out of memory:

    • Use a smaller embedding model
    • Process fewer files at once
    • Increase system memory
  5. Files not being indexed:

    • Check if file format is supported
    • Verify file permissions and accessibility
    • Monitor activity logs for error messages
  6. Application won't start:

    # Clean up any stale processes and files
    ./cleanup.sh
    # Then try starting again
    ./run.sh

Logs and Debugging

The application logs are stored in ./data/app.log. For more detailed logging:

export LOG_LEVEL="DEBUG"
./run.sh

You can also check the application status:

# Check if application is running
ps aux | grep uvicorn

# View recent logs
tail -f ./data/app.log

OAuth Configuration

If using Microsoft OneDrive/SharePoint integration:

  1. Place your OAuth tokens in .tokens.json
  2. Ensure proper permissions for Files.Read, Files.ReadWrite, etc.
  3. Monitor token expiration and refresh as needed

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸš€ Next Steps

  • Add support for more file formats (CSV, JSON, XML)
  • Implement document preview functionality
  • Add user authentication and multi-user support
  • Create Docker containerization
  • Add automated testing suite
  • Implement document versioning
  • Add advanced search filters and faceting

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Contributors 3

  •  
  •  
  •