🔍 RAG Document Search

A lightweight, local RAG (Retrieval-Augmented Generation) system for indexing and searching your documents. Built with FastAPI, Docling, and ChromaDB for high-performance semantic search across your local files.

✨ Features

🚀 Lightweight & Fast: Optimized for performance with millions of document chunks
🌐 Beautiful Web Interface: Modern, responsive UI for easy document management and search
📁 Auto File Watching: Automatically indexes new/modified files in watched folders
🔍 Semantic Search: Uses advanced embeddings for intelligent document retrieval
📊 Real-time Stats: Monitor your document index and search performance
🗂️ File Browser: Dropbox-like interface for browsing and selecting files/folders
⚡ Smart Indexing: Avoids re-indexing unchanged files using content hashing
📈 Progress Tracking: Real-time indexing progress with detailed status updates
💾 Persistent Configuration: Automatically saves and restores watched folders
🔧 Configurable: Easy configuration via environment variables
🔐 OAuth Support: Integration with Microsoft OneDrive/SharePoint (via .tokens.json)

🏗️ System Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Frontend      │    │   Backend API    │    │  Vector Store   │
│  (HTML/JS)      │◄──►│   (FastAPI)      │◄──►│   (Chroma)      │
│                 │    │                  │    │                 │
│ • File selector │    │ • File watcher   │    │ • Embeddings    │
│ • Search UI     │    │ • Doc processing │    │ • Metadata      │
│ • Results view  │    │ • Embedding gen  │    │ • Fast search   │
│ • Progress view │    │ • Hash checking  │    │ • Deduplication │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │
                       ┌──────────────────┐
                       │  File System     │
                       │   Watcher        │
                       │  (watchdog)      │
                       └──────────────────┘

📋 Supported File Formats

PDF documents
Microsoft Word (.docx)
Text files (.txt, .md)
HTML files
PowerPoint (.pptx)
Excel (.xlsx)

🚀 Quick Start

Prerequisites

Python 3.12+
uv package manager

Installation

Clone the repository:

git clone <your-repo-url>
cd syftbox-rag

Install dependencies:
```
uv pip install -r requirements.txt
```
Run the application:
```
./run.sh
```
Open your browser and go to:
```
http://localhost:9000
```

🎯 Usage

Adding Documents

Go to the "Manage Files" tab
Use the file browser to navigate to your desired folder
Select folders or individual files using checkboxes
Click "Add Selected" to start indexing
Monitor progress in real-time with the indexing status indicator

File Browser Features

📂 Navigate through your file system like Dropbox
☑️ Select multiple files and folders with checkboxes
📊 View file sizes and modification dates
🔍 Quick folder expansion/collapse
🏠 Home directory quick access

Searching Documents

Go to the "Search Documents" tab
Enter your search query in natural language
Adjust the results limit if needed (default: 20)
Click "Search" or press Enter
View results with similarity scores and metadata

Search Examples

"machine learning algorithms"
"project timeline and deadlines"
"financial reports Q3"
"meeting notes from last week"

⚙️ Configuration

You can customize the system behavior using environment variables:

Database Settings

export VECTOR_DB_PATH="./my_vector_db"  # Vector database location

Embedding Model

export EMBEDDING_MODEL="all-MiniLM-L6-v2"  # Sentence transformer model

Document Processing

export CHUNK_SIZE="500"           # Document chunk size (characters)
export CHUNK_OVERLAP="50"         # Overlap between chunks
export MIN_CHUNK_SIZE="50"        # Minimum chunk size to index

Server Settings

export HOST="0.0.0.0"            # Server host
export PORT="8080"               # Server port

Performance Tuning

export PROCESSING_DELAY="1.0"         # File processing delay (seconds)
export DEFAULT_SEARCH_LIMIT="20"      # Default search results
export MAX_SEARCH_LIMIT="100"         # Maximum search results

🏃‍♂️ Running the System

Method 1: Using the run script (Recommended)

./run.sh

Method 2: Direct FastAPI

uv run uvicorn backend.main:app --host 127.0.0.1 --port 9000

Method 3: Development mode (with auto-reload)

uv run uvicorn backend.main:app --host 127.0.0.1 --port 9000 --reload

Stopping the Application

./cleanup.sh

📁 Project Structure

syftbox-rag/
├── backend/
│   ├── __init__.py
│   ├── main.py              # FastAPI server with API endpoints
│   ├── config.py            # Configuration settings
│   ├── document_processor.py # Docling integration for parsing
│   ├── embeddings.py        # Sentence transformer embeddings
│   ├── file_watcher.py      # File system monitoring & processing
│   └── vector_store.py      # ChromaDB interface
├── frontend/
│   ├── index.html           # Main interface with tabs
│   ├── app.js               # Frontend logic & file browser
│   └── style.css            # Modern responsive styling
├── data/                    # Application data (logs, PID files)
├── vector_db/               # ChromaDB storage (auto-created)
├── .tokens.json             # OAuth tokens (optional)
├── requirements.txt         # Python dependencies
├── run.sh                   # Application launcher script
├── cleanup.sh               # Application cleanup script
└── README.md                # This file

🔧 API Endpoints

The system provides a comprehensive REST API:

GET / - Main web interface
POST /api/add-folder - Add folder to watch list
GET /api/watched-folders - Get watched folders
DELETE /api/watched-folders/{path} - Remove watched folder
POST /api/search - Search documents
GET /api/stats - Get database statistics
GET /api/indexing-status - Get real-time indexing progress
POST /api/file-structure - Browse file system
GET /api/file-structure/home - Get home directory structure
POST /api/folder-selection - Batch add/remove files and folders

🎯 Advanced Features

Smart Indexing

File hashing prevents re-indexing unchanged documents
Chunked processing handles large files efficiently
Background processing doesn't block the UI
Error recovery handles corrupted or inaccessible files

Real-time Progress Tracking

Operation queue with detailed progress information
File-level progress with chunk counting
Size estimation and processing speed metrics
Activity logs for debugging and monitoring

Persistent State Management

Watched folders automatically restored on restart
Database integrity maintained across sessions
Configuration persistence via environment variables

🔧 Development

Adding Dependencies

uv add package-name

Running Tests

# Add test files to test the system
uv run python -m pytest tests/

Code Formatting

uv run black backend/
uv run isort backend/

🎯 Performance Considerations

Vector Database: ChromaDB provides excellent performance for millions of document chunks
Embedding Model: The default all-MiniLM-L6-v2 model balances speed and accuracy
Chunking Strategy: 500-character chunks with 50-character overlap work well for most documents
File Watching: Files are processed asynchronously to avoid blocking the UI
Search Speed: Typical search times are under 1 second for large document collections
Smart Caching: File hashing prevents unnecessary re-processing
Memory Management: Efficient streaming processing for large documents

🛠️ Troubleshooting

Common Issues

Port already in use:

export SYFTBOX_ASSIGNED_PORT="8080"  # Use a different port
./run.sh

Permission errors when adding folders:
- Ensure the folder path exists and is readable
- Check file permissions on the target directory
Slow indexing:
- Reduce CHUNK_SIZE for faster processing
- Increase PROCESSING_DELAY to reduce system load
- Monitor progress in the indexing status panel
Out of memory:
- Use a smaller embedding model
- Process fewer files at once
- Increase system memory
Files not being indexed:
- Check if file format is supported
- Verify file permissions and accessibility
- Monitor activity logs for error messages

Application won't start:

# Clean up any stale processes and files
./cleanup.sh
# Then try starting again
./run.sh

Logs and Debugging

The application logs are stored in ./data/app.log. For more detailed logging:

export LOG_LEVEL="DEBUG"
./run.sh

You can also check the application status:

# Check if application is running
ps aux | grep uvicorn

# View recent logs
tail -f ./data/app.log

OAuth Configuration

If using Microsoft OneDrive/SharePoint integration:

Place your OAuth tokens in .tokens.json
Ensure proper permissions for Files.Read, Files.ReadWrite, etc.
Monitor token expiration and refresh as needed

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Docling for document processing
ChromaDB for vector storage
Sentence Transformers for embeddings
FastAPI for the web framework
Watchdog for file system monitoring

🚀 Next Steps

Add support for more file formats (CSV, JSON, XML)
Implement document preview functionality
Add user authentication and multi-user support
Create Docker containerization
Add automated testing suite
Implement document versioning
Add advanced search filters and faceting

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cleanup.sh		cleanup.sh
curl_example.sh		curl_example.sh
requirements.txt		requirements.txt
run.sh		run.sh
search_paths_example.py		search_paths_example.py

Uh oh!

License

OpenMined/local-rag

Folders and files

Latest commit

History

Repository files navigation

🔍 RAG Document Search

✨ Features

🏗️ System Architecture

📋 Supported File Formats

🚀 Quick Start

Prerequisites

Installation

🎯 Usage

Adding Documents

File Browser Features

Searching Documents

Search Examples

⚙️ Configuration

Database Settings

Embedding Model

Document Processing

Server Settings

Performance Tuning

🏃‍♂️ Running the System

Method 1: Using the run script (Recommended)

Method 2: Direct FastAPI

Method 3: Development mode (with auto-reload)

Stopping the Application

📁 Project Structure

🔧 API Endpoints

🎯 Advanced Features

Smart Indexing

Real-time Progress Tracking

Persistent State Management

🔧 Development

Adding Dependencies

Running Tests

Code Formatting

🎯 Performance Considerations

🛠️ Troubleshooting

Common Issues

Logs and Debugging

OAuth Configuration

🤝 Contributing

📜 License

🙏 Acknowledgments

🚀 Next Steps

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages