A lightweight, local RAG (Retrieval-Augmented Generation) system for indexing and searching your documents. Built with FastAPI, Docling, and ChromaDB for high-performance semantic search across your local files.
- π Lightweight & Fast: Optimized for performance with millions of document chunks
- π Beautiful Web Interface: Modern, responsive UI for easy document management and search
- π Auto File Watching: Automatically indexes new/modified files in watched folders
- π Semantic Search: Uses advanced embeddings for intelligent document retrieval
- π Real-time Stats: Monitor your document index and search performance
- ποΈ File Browser: Dropbox-like interface for browsing and selecting files/folders
- β‘ Smart Indexing: Avoids re-indexing unchanged files using content hashing
- π Progress Tracking: Real-time indexing progress with detailed status updates
- πΎ Persistent Configuration: Automatically saves and restores watched folders
- π§ Configurable: Easy configuration via environment variables
- π OAuth Support: Integration with Microsoft OneDrive/SharePoint (via .tokens.json)
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Frontend β β Backend API β β Vector Store β
β (HTML/JS) βββββΊβ (FastAPI) βββββΊβ (Chroma) β
β β β β β β
β β’ File selector β β β’ File watcher β β β’ Embeddings β
β β’ Search UI β β β’ Doc processing β β β’ Metadata β
β β’ Results view β β β’ Embedding gen β β β’ Fast search β
β β’ Progress view β β β’ Hash checking β β β’ Deduplication β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
ββββββββββββββββββββ
β File System β
β Watcher β
β (watchdog) β
ββββββββββββββββββββ
- PDF documents
- Microsoft Word (.docx)
- Text files (.txt, .md)
- HTML files
- PowerPoint (.pptx)
- Excel (.xlsx)
- Python 3.12+
- uv package manager
-
Clone the repository:
git clone <your-repo-url> cd syftbox-rag
-
Install dependencies:
uv pip install -r requirements.txt
-
Run the application:
./run.sh
-
Open your browser and go to:
http://localhost:9000
- Go to the "Manage Files" tab
- Use the file browser to navigate to your desired folder
- Select folders or individual files using checkboxes
- Click "Add Selected" to start indexing
- Monitor progress in real-time with the indexing status indicator
- π Navigate through your file system like Dropbox
- βοΈ Select multiple files and folders with checkboxes
- π View file sizes and modification dates
- π Quick folder expansion/collapse
- π Home directory quick access
- Go to the "Search Documents" tab
- Enter your search query in natural language
- Adjust the results limit if needed (default: 20)
- Click "Search" or press Enter
- View results with similarity scores and metadata
"machine learning algorithms"
"project timeline and deadlines"
"financial reports Q3"
"meeting notes from last week"
You can customize the system behavior using environment variables:
export VECTOR_DB_PATH="./my_vector_db" # Vector database location
export EMBEDDING_MODEL="all-MiniLM-L6-v2" # Sentence transformer model
export CHUNK_SIZE="500" # Document chunk size (characters)
export CHUNK_OVERLAP="50" # Overlap between chunks
export MIN_CHUNK_SIZE="50" # Minimum chunk size to index
export HOST="0.0.0.0" # Server host
export PORT="8080" # Server port
export PROCESSING_DELAY="1.0" # File processing delay (seconds)
export DEFAULT_SEARCH_LIMIT="20" # Default search results
export MAX_SEARCH_LIMIT="100" # Maximum search results
./run.sh
uv run uvicorn backend.main:app --host 127.0.0.1 --port 9000
uv run uvicorn backend.main:app --host 127.0.0.1 --port 9000 --reload
./cleanup.sh
syftbox-rag/
βββ backend/
β βββ __init__.py
β βββ main.py # FastAPI server with API endpoints
β βββ config.py # Configuration settings
β βββ document_processor.py # Docling integration for parsing
β βββ embeddings.py # Sentence transformer embeddings
β βββ file_watcher.py # File system monitoring & processing
β βββ vector_store.py # ChromaDB interface
βββ frontend/
β βββ index.html # Main interface with tabs
β βββ app.js # Frontend logic & file browser
β βββ style.css # Modern responsive styling
βββ data/ # Application data (logs, PID files)
βββ vector_db/ # ChromaDB storage (auto-created)
βββ .tokens.json # OAuth tokens (optional)
βββ requirements.txt # Python dependencies
βββ run.sh # Application launcher script
βββ cleanup.sh # Application cleanup script
βββ README.md # This file
The system provides a comprehensive REST API:
- GET
/
- Main web interface - POST
/api/add-folder
- Add folder to watch list - GET
/api/watched-folders
- Get watched folders - DELETE
/api/watched-folders/{path}
- Remove watched folder - POST
/api/search
- Search documents - GET
/api/stats
- Get database statistics - GET
/api/indexing-status
- Get real-time indexing progress - POST
/api/file-structure
- Browse file system - GET
/api/file-structure/home
- Get home directory structure - POST
/api/folder-selection
- Batch add/remove files and folders
- File hashing prevents re-indexing unchanged documents
- Chunked processing handles large files efficiently
- Background processing doesn't block the UI
- Error recovery handles corrupted or inaccessible files
- Operation queue with detailed progress information
- File-level progress with chunk counting
- Size estimation and processing speed metrics
- Activity logs for debugging and monitoring
- Watched folders automatically restored on restart
- Database integrity maintained across sessions
- Configuration persistence via environment variables
uv add package-name
# Add test files to test the system
uv run python -m pytest tests/
uv run black backend/
uv run isort backend/
- Vector Database: ChromaDB provides excellent performance for millions of document chunks
- Embedding Model: The default
all-MiniLM-L6-v2
model balances speed and accuracy - Chunking Strategy: 500-character chunks with 50-character overlap work well for most documents
- File Watching: Files are processed asynchronously to avoid blocking the UI
- Search Speed: Typical search times are under 1 second for large document collections
- Smart Caching: File hashing prevents unnecessary re-processing
- Memory Management: Efficient streaming processing for large documents
-
Port already in use:
export SYFTBOX_ASSIGNED_PORT="8080" # Use a different port ./run.sh
-
Permission errors when adding folders:
- Ensure the folder path exists and is readable
- Check file permissions on the target directory
-
Slow indexing:
- Reduce
CHUNK_SIZE
for faster processing - Increase
PROCESSING_DELAY
to reduce system load - Monitor progress in the indexing status panel
- Reduce
-
Out of memory:
- Use a smaller embedding model
- Process fewer files at once
- Increase system memory
-
Files not being indexed:
- Check if file format is supported
- Verify file permissions and accessibility
- Monitor activity logs for error messages
-
Application won't start:
# Clean up any stale processes and files ./cleanup.sh # Then try starting again ./run.sh
The application logs are stored in ./data/app.log
. For more detailed logging:
export LOG_LEVEL="DEBUG"
./run.sh
You can also check the application status:
# Check if application is running
ps aux | grep uvicorn
# View recent logs
tail -f ./data/app.log
If using Microsoft OneDrive/SharePoint integration:
- Place your OAuth tokens in
.tokens.json
- Ensure proper permissions for Files.Read, Files.ReadWrite, etc.
- Monitor token expiration and refresh as needed
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Docling for document processing
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- FastAPI for the web framework
- Watchdog for file system monitoring
- Add support for more file formats (CSV, JSON, XML)
- Implement document preview functionality
- Add user authentication and multi-user support
- Create Docker containerization
- Add automated testing suite
- Implement document versioning
- Add advanced search filters and faceting