A powerful Retrieval-Augmented Generation (RAG) system for searching and retrieving PDF documents from your Zotero library using semantic similarity and natural language queries.
- π Semantic Search: Uses sentence transformers for intelligent document retrieval
- π Zotero Integration: Automatically extracts metadata from Zotero database
- π Fast Indexing: FAISS-powered similarity search for instant results
- π Browser Integration: Opens relevant PDFs directly in your browser
- π¬ Interactive Interface: Command-line interface with search suggestions
- π·οΈ Rich Metadata: Searches through titles, authors, abstracts, and content
- πΎ Persistent Index: Save and load search indices for faster startup
pip install zotero-rag
For faster embedding generation on CUDA-compatible GPUs:
pip install zotero-rag[gpu]
git clone https://github.com/yourusername/zotero-rag.git
cd zotero-rag
pip install -e .[dev]
-
Build the search index (first time only):
zotero-rag --build-index
-
Search your library:
zotero-rag --query "machine learning transformers"
-
Interactive mode:
zotero-rag
# Build index and start interactive search
zotero-rag --build-index
# Direct search from command line
zotero-rag --query "deep learning ICLR 2023"
# Specify custom Zotero directory
zotero-rag --zotero-dir "/path/to/zotero" --build-index
# Save index for faster subsequent runs
zotero-rag --build-index --save-index my_library.json
# Load existing index
zotero-rag --load-index my_library.json --query "neural networks"
When you run zotero-rag
without arguments, you get an interactive interface:
=== Zotero RAG Search System ===
Enter your search query (or 'quit' to exit)
Example: 'Proba Unlearn ICLR'
Search query: attention mechanisms transformers
Found 5 results:
------------------------------------------------------------
1. [0.847] Attention Is All You Need
Author: Vaswani et al.
Year: 2017
Journal: NIPS
File: attention_is_all_you_need.pdf
2. [0.783] BERT: Pre-training of Deep Bidirectional Transformers
Author: Devlin et al.
Year: 2019
Journal: NAACL
File: bert_pretraining.pdf
Enter number to open (or press Enter to search again): 1
from zotero_rag import ZoteroRAG
# Initialize the system
rag = ZoteroRAG()
# Build index
rag.build_index()
# Search
results = rag.search("quantum computing", top_k=5)
# Open best match
if results:
rag.open_pdf(results[0][0]['path'])
The system automatically detects your Zotero directory on:
- Windows:
~/Zotero
or~/AppData/Roaming/Zotero/
- macOS:
~/Zotero
or~/Library/Application Support/Zotero/
- Linux:
~/Zotero
or~/.zotero/
You can override this with --zotero-dir /custom/path
.
# Build and save index
zotero-rag --build-index --save-index ~/my_zotero_index.json
# Load existing index (much faster)
zotero-rag --load-index ~/my_zotero_index.json
# Update index with new papers
zotero-rag --build-index --save-index ~/my_zotero_index.json
- Discovery: Scans your Zotero storage directory for PDF files
- Metadata Extraction: Reads bibliographic data from Zotero's SQLite database
- Content Extraction: Extracts text from PDF files using PyPDF2
- Embedding Generation: Creates semantic embeddings using sentence transformers
- Index Building: Builds a FAISS index for fast similarity search
- Query Processing: Converts your search query to embeddings
- Retrieval: Finds most similar documents using cosine similarity
- Presentation: Ranks and displays results with metadata
- First run: Building the index takes time (5-10 minutes for 1000+ papers)
- Subsequent runs: Load saved indices for instant startup
- GPU acceleration: Install with
[gpu]
extra for faster embedding generation - Memory usage: ~1-2GB RAM for moderate libraries (500-1000 papers)
- PDF files: Primary support with text extraction
- Metadata: Title, authors, journal, year, abstract, DOI, tags
- Zotero database: SQLite-based metadata extraction
-
"Zotero directory not found"
zotero-rag --zotero-dir "/path/to/your/zotero/directory"
-
"No PDF files found"
- Check that your Zotero library has PDF attachments
- Verify the storage directory exists:
{zotero-dir}/storage/
-
"Database not found"
- Ensure Zotero is closed when running the indexer
- Check for
zotero.sqlite
in your Zotero directory
-
Memory issues with large libraries
- Process PDFs in batches
- Use
--save-index
to avoid rebuilding
# Enable verbose output
zotero-rag --build-index --verbose
# Check detected paths
zotero-rag --info
- Python 3.8+
- 2GB+ RAM recommended
- 1GB+ free disk space for indices
- Internet connection (first run only, for downloading models)
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this software in your research, please cite:
@software{zotero_rag,
title={Zotero RAG Search System},
author={Your Name},
year={2024},
url={https://github.com/yourusername/zotero-rag}
}
- Sentence Transformers for semantic embeddings
- FAISS for efficient similarity search
- Zotero for reference management
- The open-source community for making this possible
- π Documentation
- π Issue Tracker
- π¬ Discussions