A comprehensive Retrieval-Augmented Generation (RAG) system designed to search, analyze, and extract insights from academic research papers. This project provides intelligent document processing, semantic search, and AI-powered question answering capabilities for research paper collections.
- Multi-Modal Document Processing: Support for both predefined corpora and user-uploaded PDF papers
- Advanced Text Processing: Intelligent section extraction, chunking, and metadata extraction
- Semantic Search: SciBERT embeddings with FAISS HNSW indexing for efficient similarity search
- AI-Powered Q&A: Integration with Google Gemini for intelligent question answering
- Metadata Extraction: Automatic extraction of paper metadata including authors, abstracts, and citations
- Flexible Architecture: Modular design supporting different use cases and data sources
- Installation
- Quick Start
- Project Structure
- Usage Examples
- Architecture
- Configuration
- API Reference
- Contributing
- License
- Python 3.8+
- CUDA-compatible GPU (optional, for faster processing)
-
Clone the repository:
git clone <repository-url> cd RAG_for_Research_Papers-main
-
Install dependencies:
pip install -r requirements.txt
-
Set up API keys (optional, for Gemini integration):
export GOOGLE_API_KEY="your-api-key-here"
-
Place your PDF files in the
pdf_folder/pdfs/
directory -
Run the processing pipeline:
cd pdf_folder python main.py
-
Query your papers:
# The system will automatically process your PDFs and create embeddings # You can then ask questions about your papers
-
Run with arxiv dataset:
cd corpus python main.py
-
Query the corpus:
# The system will load the arxiv dataset and allow you to query it
RAG_for_Research_Papers-main/
├── corpus/ # Predefined corpus processing
│ ├── main.py # Main script for corpus processing
│ ├── embeddings.py # Embedding generation
│ ├── faiss_indexing.py # FAISS index creation
│ ├── text_processing.py # Text preprocessing utilities
│ ├── vector_db_processing.py # Vector database operations
│ └── output/ # Processed chunks output
├── pdf_folder/ # User PDF processing
│ ├── main.py # Main script for PDF processing
│ ├── embedding.py # SciBERT embeddings
│ ├── faiss_index.py # FAISS indexing implementation
│ ├── pdf_processing.py # PDF text extraction
│ ├── text_utils.py # Text utilities
│ ├── pdfs/ # User PDF files
│ ├── processed_chunks_json/ # Processed text chunks
│ └── processed_papers.json # Extracted paper data
├── meta_data/ # Metadata extraction
│ ├── main.py # Metadata extraction script
│ ├── extract_metadata.py # PDF metadata extraction
│ ├── embedding.py # Metadata embeddings
│ ├── retrieve.py # Metadata retrieval
│ ├── pdfs/ # PDF files for metadata
│ └── pdf_metadata_output.json # Extracted metadata
├── requirements.txt # Python dependencies
├── diagram.svg # System architecture diagram
└── README.md # This file
# After processing your papers, you can query them like this:
query = "What are the main findings about transformer architectures?"
# The system will:
# 1. Generate embeddings for your query
# 2. Find the most relevant document chunks
# 3. Generate an AI-powered response
from embedding import SciBERTEmbeddings
from faiss_index import DocumentProcessor
# Initialize the embedding model
embedder = SciBERTEmbeddings()
# Process documents
processor = DocumentProcessor(
dataset=your_papers,
output_dir="processed_chunks"
)
# Create embeddings and index
processed_documents = processor.process_dataset()
# Search for relevant documents
query_embedding = embedder.embed_query("Your question here")
distances, indices = processor.search_faiss_index(query_embedding, k=5)
The system follows a modular architecture with the following key components:
- PDF Extraction: PyMuPDF-based text extraction with intelligent section detection
- Text Chunking: Semantic chunking with configurable overlap
- Metadata Extraction: Automatic extraction of paper metadata
- SciBERT Model: Domain-specific embeddings for scientific text
- Vector Generation: Efficient embedding generation for documents and queries
- FAISS HNSW Index: High-performance similarity search
- Semantic Matching: Context-aware document retrieval
- Ranking: Distance-based relevance scoring
- Google Gemini: Large language model for answer generation
- Context Assembly: Intelligent context preparation for LLM
- Response Generation: Structured answer generation
# Google Gemini API Key (optional)
export GOOGLE_API_KEY="your-api-key"
# CUDA Configuration (for GPU acceleration)
export KMP_DUPLICATE_LIB_OK="TRUE"
The system uses the following default models:
- Embedding Model:
allenai/scibert_scivocab_uncased
- LLM:
gemini-2.0-flash
- FAISS Index: HNSW with 768 dimensions
Main class for processing documents and creating search indices.
class DocumentProcessor:
def __init__(self, dataset, output_dir, index_name="research_papers_index")
def process_dataset(self)
def search_faiss_index(self, query_embedding, k=5)
SciBERT-based embedding generation.
class SciBERTEmbeddings:
def __init__(self, model_name="allenai/scibert_scivocab_uncased")
def embed_documents(self, texts)
def embed_query(self, text)
We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- SciBERT: For domain-specific embeddings
- FAISS: For efficient similarity search
- Google Gemini: For AI-powered question answering
- PyMuPDF: For PDF text extraction
If you encounter any issues or have questions, please:
- Check the existing issues
- Create a new issue with detailed information
- Include error messages and system information