Retrieval-Augmented Generation (RAG) for Research Papers

A comprehensive Retrieval-Augmented Generation (RAG) system designed to search, analyze, and extract insights from academic research papers. This project provides intelligent document processing, semantic search, and AI-powered question answering capabilities for research paper collections.

Features

Multi-Modal Document Processing: Support for both predefined corpora and user-uploaded PDF papers
Advanced Text Processing: Intelligent section extraction, chunking, and metadata extraction
Semantic Search: SciBERT embeddings with FAISS HNSW indexing for efficient similarity search
AI-Powered Q&A: Integration with Google Gemini for intelligent question answering
Metadata Extraction: Automatic extraction of paper metadata including authors, abstracts, and citations
Flexible Architecture: Modular design supporting different use cases and data sources

Installation

Prerequisites

Python 3.8+
CUDA-compatible GPU (optional, for faster processing)

Setup

Clone the repository:

git clone <repository-url>
cd RAG_for_Research_Papers-main

Install dependencies:
```
pip install -r requirements.txt
```
Set up API keys (optional, for Gemini integration):
```
export GOOGLE_API_KEY="your-api-key-here"
```

Quick Start

Option 1: Process Your Own PDF Papers

Place your PDF files in the pdf_folder/pdfs/ directory
Run the processing pipeline:
```
cd pdf_folder
python main.py
```

Query your papers:

# The system will automatically process your PDFs and create embeddings
# You can then ask questions about your papers

Option 2: Use Predefined Corpus

Run with arxiv dataset:
```
cd corpus
python main.py
```

Query the corpus:

# The system will load the arxiv dataset and allow you to query it

Project Structure

RAG_for_Research_Papers-main/
├── corpus/                          # Predefined corpus processing
│   ├── main.py                      # Main script for corpus processing
│   ├── embeddings.py                # Embedding generation
│   ├── faiss_indexing.py           # FAISS index creation
│   ├── text_processing.py          # Text preprocessing utilities
│   ├── vector_db_processing.py     # Vector database operations
│   └── output/                     # Processed chunks output
├── pdf_folder/                      # User PDF processing
│   ├── main.py                      # Main script for PDF processing
│   ├── embedding.py                 # SciBERT embeddings
│   ├── faiss_index.py              # FAISS indexing implementation
│   ├── pdf_processing.py           # PDF text extraction
│   ├── text_utils.py               # Text utilities
│   ├── pdfs/                       # User PDF files
│   ├── processed_chunks_json/      # Processed text chunks
│   └── processed_papers.json       # Extracted paper data
├── meta_data/                       # Metadata extraction
│   ├── main.py                      # Metadata extraction script
│   ├── extract_metadata.py         # PDF metadata extraction
│   ├── embedding.py                # Metadata embeddings
│   ├── retrieve.py                 # Metadata retrieval
│   ├── pdfs/                       # PDF files for metadata
│   └── pdf_metadata_output.json    # Extracted metadata
├── requirements.txt                 # Python dependencies
├── diagram.svg                      # System architecture diagram
└── README.md                       # This file

Usage Examples

Basic Query Example

# After processing your papers, you can query them like this:
query = "What are the main findings about transformer architectures?"

# The system will:
# 1. Generate embeddings for your query
# 2. Find the most relevant document chunks
# 3. Generate an AI-powered response

Advanced Usage

from embedding import SciBERTEmbeddings
from faiss_index import DocumentProcessor

# Initialize the embedding model
embedder = SciBERTEmbeddings()

# Process documents
processor = DocumentProcessor(
    dataset=your_papers,
    output_dir="processed_chunks"
)

# Create embeddings and index
processed_documents = processor.process_dataset()

# Search for relevant documents
query_embedding = embedder.embed_query("Your question here")
distances, indices = processor.search_faiss_index(query_embedding, k=5)

Architecture

The system follows a modular architecture with the following key components:

1. Document Processing Pipeline

PDF Extraction: PyMuPDF-based text extraction with intelligent section detection
Text Chunking: Semantic chunking with configurable overlap
Metadata Extraction: Automatic extraction of paper metadata

2. Embedding System

SciBERT Model: Domain-specific embeddings for scientific text
Vector Generation: Efficient embedding generation for documents and queries

3. Search & Retrieval

FAISS HNSW Index: High-performance similarity search
Semantic Matching: Context-aware document retrieval
Ranking: Distance-based relevance scoring

4. AI Integration

Google Gemini: Large language model for answer generation
Context Assembly: Intelligent context preparation for LLM
Response Generation: Structured answer generation

⚙️ Configuration

Environment Variables

# Google Gemini API Key (optional)
export GOOGLE_API_KEY="your-api-key"

# CUDA Configuration (for GPU acceleration)
export KMP_DUPLICATE_LIB_OK="TRUE"

Model Configuration

The system uses the following default models:

Embedding Model: allenai/scibert_scivocab_uncased
LLM: gemini-2.0-flash
FAISS Index: HNSW with 768 dimensions

API Reference

DocumentProcessor

Main class for processing documents and creating search indices.

class DocumentProcessor:
    def __init__(self, dataset, output_dir, index_name="research_papers_index")
    def process_dataset(self)
    def search_faiss_index(self, query_embedding, k=5)

SciBERTEmbeddings

SciBERT-based embedding generation.

class SciBERTEmbeddings:
    def __init__(self, model_name="allenai/scibert_scivocab_uncased")
    def embed_documents(self, texts)
    def embed_query(self, text)

Contributing

We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

SciBERT: For domain-specific embeddings
FAISS: For efficient similarity search
Google Gemini: For AI-powered question answering
PyMuPDF: For PDF text extraction

Support

If you encounter any issues or have questions, please:

Check the existing issues
Create a new issue with detailed information
Include error messages and system information

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retrieval-Augmented Generation (RAG) for Research Papers

Features

Table of Contents

Installation

Prerequisites

Setup

Quick Start

Option 1: Process Your Own PDF Papers

Option 2: Use Predefined Corpus

Project Structure

Usage Examples

Basic Query Example

Advanced Usage

Architecture

1. Document Processing Pipeline

2. Embedding System

3. Search & Retrieval

4. AI Integration

⚙️ Configuration

Environment Variables

Model Configuration

API Reference

DocumentProcessor

SciBERTEmbeddings

Contributing

Development Setup

License

Acknowledgments

Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
corpus		corpus
meta_data		meta_data
pdf_folder		pdf_folder
LICENSE		LICENSE
README.md		README.md
diagram.png		diagram.png
requirements.txt		requirements.txt

License

MargiPandya27/RAG_for_Research_Papers

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented Generation (RAG) for Research Papers

Features

Table of Contents

Installation

Prerequisites

Setup

Quick Start

Option 1: Process Your Own PDF Papers

Option 2: Use Predefined Corpus

Project Structure

Usage Examples

Basic Query Example

Advanced Usage

Architecture

1. Document Processing Pipeline

2. Embedding System

3. Search & Retrieval

4. AI Integration

⚙️ Configuration

Environment Variables

Model Configuration

API Reference

DocumentProcessor

SciBERTEmbeddings

Contributing

Development Setup

License

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages